Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL

gregwebs · 2024-05-23T02:17:00 1716430620

Great to see such a project. We are using datanymizer [1] right now but it has gone unmaintained and we are using my patched version [2] and it is working pretty well for us. I saw a new project that is getting close in terms of having the feature set I need and has them on their roadmap [3].

To ensure that we are marking columns as PII, we run a job that compares the anonymization configuration to a comment on the column- we have a comment on every column to mark it as PII (or not).

[1] https://github.com/datanymizer/datanymizer

[2] https://github.com/digitalmint/datanymizer/tree/digitalmint

[3] https://github.com/GreenmaskIO/greenmask

Other tools I found that do some anonymization but didn't meet my needs:

  * https://github.com/DivanteLtd/anonymizer
  * https://postgresql-anonymizer.readthedocs.io/en/stable
  * https://nitzano.github.io/dbzar/
  * https://github.com/Qovery/Replibyte

woyten · 2024-05-23T08:00:08 1716451208

Hi Greg! I am Vadim Voitenko - Greenmask [1] developer that you've mentioned in your message. Thank you so much for your interest.

[1] https://github.com/GreenmaskIO/greenmask

We are trying to deliver new features ASAP according to the user requests, if it not not a secret, could you list the features that you are looking for?

Since 5 days ago we have had a huge major beta release where we introduced important features - dynamic parameters and deterministic transformers and more. Check if there are features you need. If not so, I would love to hear them in order to prioritize them.

gregwebs · 2024-05-23T15:26:48 1716478008

Glad to see the transformers are more flexible in the new release- I think they were already better than Datanymizer. The feature request is here for conditional transforms, looks like it is planned to be implemented soon: https://github.com/GreenmaskIO/greenmask/issues/34

woyten · 2024-05-23T08:05:50 1716451550

There is the latest release

https://github.com/GreenmaskIO/greenmask/releases/tag/v0.2.0...

blopker · 2024-05-22T19:19:08 1716405548

I don't know exactly how this works, but I wanted to share my experience trying to anonymize data. Don't.

While you may be able to change or delete obvious PII, like names, every bit of real data in aggregate leads to revealing someone's identity. They are male? That's half the population. They also live in Seattle, are Hispanic, age 18-25? Down to a few hundred thousand. They use Firefox? That might be like 10 people.

This is why browser fingerprinting is so effective. It's how Ad targeting works.

Just stick with fuzzing random data during development. Many web frameworks already have libraries for doing this. Django for example has factory_boy[0]. You just tell it what model to use, and the factory class will generate data based on your schema. You'll catch more issues this way anyway because computers are better at making nonsensical data.

Keep production data in production.

[0]: https://factoryboy.readthedocs.io/en/stable/orms.html

edrenova · 2024-05-22T19:32:33 1716406353

Thanks for the comment and hear you on the anonymization. What we see is that customers will go through and categorize what is PII and what is not and anonymize as needed. If not, they'll back fill with synthetic data. You can change the gender from male to something else, same with the city, etc.

It's really down to the use-case. If you're strictly doing development, then you'll probably want to use more synthetic data than anonymization. If you care about preserving the statistical characteristics of the data then you can use ML models like CTGAN to create net new data.

Definitely a balance between when do you anonymize vs. when do you create synthetic data.

blopker · 2024-05-22T19:43:55 1716407035

Thanks for the reply, I don't mean to be discouraging! I totally believe people do this, I'm saying they shouldn't. There are other issues as well. Once production data is floating around different environments, it will be easy to lose track of. Then the first GDPR delete request comes in. Was this data synthetic? Was it real? I think Joe has a copy on his laptop, he's on vacation?

It gets messy. It also doesn't solve the main 'unsolvable' issue with production data: scale. It is difficult to test some changes locally because developers often don't have access to databases large enough that would show issues before getting to production. At a certain size, this is the #1 killer of deployments.

kingraoul · 2024-05-22T22:40:35 1716417635

Combining this tool with downsampling would allow you to run isomorphic workloads on smaller nodes and thereby reveal the yield curve.

davedx · 2024-05-23T08:30:52 1716453052

Yup - I worked on a data warehouse project that was subject to GDPR. The way we did it is we didn't do any synthetic data generation, we just blanked out any PII fields with "DELETED". Then it's still possible to action a delete request, because the PK's, emails are the same as they are in production.

It's definitely possible to practice this while adhering to GDPR, but you do need to plan carefully, and synthetic data should only be used for local dev/testing, not data warehousing.

gaogao · 2024-05-22T21:11:59 1716412319

> categorize what is PII and what is not and anonymize as needed

That sounds like just de-identification / pseudoanonymization, if you're just targeting PII or not

imiric · 2024-05-22T20:58:51 1716411531

Fuzzing random data is fine for development environments, but it won't give you the same scale or statistical significance as production data. Without that you can't really ensure that a change will work reliably in production, without actually deploying to production. Canary deployments can only give you this assurance to a certain degree, and by definition would only cover a subset of the data, so having a traditional staging environment is still valuable.

Not only that, but a staging environment with prod-like data is also useful for running load, performance and E2E tests without actually impacting production servers or touching production data. In all of these cases, anonymizing production data is important as you want to minimize the risk of data leaks, while also avoiding issues that can happen when testing against real production data (e.g. sending emails to actual customers).

blopker · 2024-05-22T22:16:17 1716416177

I don't totally understand this comment. Random data can get you more scale than production data, in that it can just be made up. All the load and E2E testing can be done with test data, no problem.

This idea of data being statistically significant has come up, but that's also easy to replicate with random data once you know the distributions of the data. In practice, those distributions rarely change, especially around demographic data. However, I don't think I've seen a case where this has been a problem. I'd be interested to learn about one.

edrenova · 2024-05-22T22:27:42 1716416862

The ideal scenario is that you're able to augment your existing data with more data that looks just like it. The matter of statistical significance really depends on the use-case. For load testing, it's probably not as important as it is for something like feature testin/debugging/analytical queries.

Even if you know the distribution of the data (which imo can be fairly difficult) replicating that can also be tricky. If you know that a gender column is 30-70 male - female, how do you create 30% male names? How about the female names? Are they the same name or do you repeat names? Does it matter? In some cases it does and in others it doesn't.

What we've seen is that it's really use-case specific and there are some tools that can help but there isn't a complete tool set. That's what we're trying to build over time.

steve_gh · 2024-05-23T15:58:48 1716479928

There are various reasons why you might want synthetic data. Anonymisation is one of them - but the issue is around which statistical relationships are preserved in the anonymizing process, while ensuring that fusion with other data sources is not going to unmask the real data hidden beneath.

yowlingcat · 2024-05-23T13:46:06 1716471966

So because you were a) too lazy to understand the concept of differential privacy b) the value of using anonymized data and c) come up with a trivial strawman, that makes the whole concept of anonymization unnecessary and something that should be replaced by oversimplified factories?

How ignorant and wrong-headed. I'd recommend learning more about the concepts of k-anonymity and differential privacy before you prematurely presume impossible a concept that others (including Google) have been able to use successfully.

blopker · 2024-05-23T15:01:28 1716476488

Are you ok?

yowlingcat · 2024-05-23T16:32:30 1716481950

Your professional laziness clearly has you in the wrong, and all you can muster is a snide, defensive "are you ok?"

I'd really advise you to work on that. There's a world of difference between "This is impossible" and "I can't be bothered to figure out how this useful thing others successfully do while managing billions of dollars of risk is possible." Otherwise, you risk professional stagnation.

davedx · 2024-05-23T08:29:02 1716452942

It depends on the use case though. It's not just about developers testing locally.

One use case that many companies have is data warehousing. Here, you want to have real customer and order data, but anonymized to a degree where only the necessary data is exposed to business analysts and so on. I once worked on a project to do exactly that: clone production to the data warehouse, stripping out only things like contact details, but preserving things like emails, what the customer ordered, and other data like that.

specialist · 2024-05-30T17:18:42 1717089522

Belated reply, sorry: This is the Correct Answer™.

Mid 2000s, I worked with electronic medical records. I eventually determined anon isn't worthwhile.

For starters, deanon will always beat anon. This statement is unambiguously true, per the research. Including the differential privacy stuff.

My efforts pivoted to adopting Translucent Database techniques. My current hill to die on is demanding that all PII be encrypted at rest, at the field level.

(It's basically applying paper password storing techniques to protecting PII. The book shows a handful of illuminating use cases. Super clever. No weird or new tech required.)

NortySpock · 2024-05-22T19:25:02 1716405902

So, how does one create synthetic relational data? Do you just crank out a list of synthetic customers, assign IDs, create between 0 and 3 synthetic orders per person, and between 0 and 3 order line entries per order?

blopker · 2024-05-22T19:33:20 1716406400

This is somewhat framework dependent, but factory_boy supports connecting factories together via SubFactory. There's a real-world example I'm building [0]. See where "author = SubFactory(UserFactory)". I'd imagine there are similar ways to do this for Rails and others too.

[0]: https://github.com/totem-technologies/totem-server/blob/main...

edrenova · 2024-05-22T19:36:19 1716406579

we're actually working on this right, can see the PR here -> https://github.com/nucleuscloud/neosync/pull/1832/files

it's a combination of creating a random number of records for foreign keys i.e 1 customer - create between 2 and 5 transctions. Working on giving you control over that, and handling referential integrity with table constraints (foreign keys, unique constraints, etc.)

ML based approaches typically are not very good at this and struggle with handling things like referential integrity. So a more "procedural" or imperative way is slightly better. The ideal is a combination of both.

beeboobaa3 · 2024-05-22T19:36:52 1716406612

Pretty much, yeah. Use normal distribution so you can get some outliers.

szundi · 2024-05-22T23:23:09 1716420189

Knowing everything about an unspecified someone is different than knowing who that is

rvdginste · 2024-05-23T08:10:33 1716451833

Well… if you know the data is real, then knowing everything about that someone can seriously limit the number of people it could actually be, which makes that someone identifiable.

Like if you only know the full address, and you see that only one person lives at that address. Or you know the exact birthdate and the school that someone went to. Or the year of birth and the small shop that the person works for. And so on…

imiric · 2024-05-22T21:30:46 1716413446

Congrats on the launch!

This topic is relevant to what I'm currently working on, and I'm finding it exhausting to be honest. After considering several options for anonymizing both Postgres and ClickHouse data, I've been evaluating clickhouse-obfuscator[1] for a few weeks now. The idea in principle is great, since ClickHouse allows you to export both its data and Postgres data (via their named collections feature) into Parquet format (or a bunch of others, but we settled on Parquet), and then using clickhouse-obfuscator to anonymize the data and store it in Parquet as well, which can then be imported where needed.

The problem I'm running into is referential integrity, as importing the anonymized data is raising unique and foreign key violations. The obfuscator tool is pretty minimal and has few knobs to tweak its output, so it's difficult to work around this, and I'm considering other options at this point.

Your tool looks interesting, and it seems that you directly address the referential integrity issue, which is great.

I have a couple of questions:

1. Does Neosync ensure that anonymized data has the same statistical significance (distribution, cardinality, etc.) as source data? This is something that clickhouse-obfuscator put quite a lot of effort in addressing, as you can see from their README. Generating synthetic data doesn't solve this, and some anonymization tools aren't this sophisticated either.

2. How does it differ from existing PG anonymization solutions, such as PostgreSQL Anonymizer[2]? Obviously you also handle MySQL, but I'm interested in PG specifically.

As a side note, I'm not sure I understand what the value proposition of your Cloud service would be. If the original data needs to be exported and sent to your Cloud for anonymization, it defeats the entire purpose of this process, and only adds more risk. I don't think that most companies looking for a solution like this would choose to rely on an external service. Thanks for releasing it as open source, but I can't say that I trust your business model to sustain a company around this product.

[1]: https://github.com/ClickHouse/ClickHouse/blob/master/program...

[2]: https://postgresql-anonymizer.readthedocs.io/en/stable/

edrenova · 2024-05-22T21:47:47 1716414467

Thanks for the comment and feedback!

We're actually evaluating a clickhouse integration at the moment for a customer that we're working with so that might be coming in the future. Although today just PG and Mysql.

To answer your questions:

1. We don't support this quite yet although we're working on it for both anonymization and synthetic data. For anonymization, that typically means having deterministic anonymizers that output the same value for the same input (like a hash). For synthetic data that means using a model to be able maintain those same statistical characteristics. We'll have support for both of these within the next quarter. It also depends on what you want to anonymize. If the values that you want to anonymize wouldn't meaningfully change the distribution of the data (think like a name or an address and you're not doing any analytics or queries on those fields) then the statistical distribution of the data stays the same.

2. A few big differences. PG Anonymizer doesn't handle referential integrity, pretty much everything has to be defined in sql and it doesn't have a GUI and it doesn't have any orchestration across environments or databases. Neosync supports all of those.

Folks use the cloud service because they don't have the resource or time to deploy/run the OSS offering themselves. These are usually startups who are okay with us streaming their data and anonymizing it and sending it back to them. We're SOC2 type 2 compliant and usually got through a security review for these deployments. Conversely, they can also just run our managed version and keep all of their data on their infra while we host the control plane.

imiric · 2024-05-22T22:49:41 1716418181

Thanks for the answers! Good to know that ClickHouse support might be on the roadmap.

Re 1: Yeah, it's a hard problem, from what I understand. I recommend reading that clickhouse-obfuscator README, in case it helps with your implementation. The tool works similarly to what you mention: it first generates a Markov model, which is then used for deterministic output with a given seed. My only problem is the referential integrity, so solving that as well would be ideal.

Re 2: Gotcha, makes sense. I kind of assumed that it did maintain integrity, but it's not clear from their docs. It's definitely more difficult to setup as well, and while I appreciate its simplicity, a better user experience never hurts.

Sounds good about your service then. I can't say that I would personally want to rely on it, but I can see how it would be useful for others. Cheers, and good luck with the project!

edrenova · 2024-05-22T23:22:33 1716420153

Yup - totally hear you - hopefully we'll have a good solution for that in a few months :)

mathisd · 2024-05-22T21:10:24 1716412224

During an internship, I was part of a team that developed a collection of tools [0] intended to provide pseudonymization of production database for testing and development purposes. These tools were developed while used in parallel with clients that had a large number of database.

Referential constraint refer to ensuring some coherence / basic logic in the output data (ie. the anonymized street name must exist in the anonymized city). This was the most time consuming phase of the pseudonymization process. They were working on introducing pseudonymization with cross-referential constraint which is a mess as constraint were often strongly intertwined. Also, a lot of the time client had no proper idea of what the field were and what they were truly containing (what format of phone number, we did find a lot of unusual things).

[0] (LINO, PIMO, SIGO, etc.) https://github.com/CGI-FR/PIMO

edrenova · 2024-05-22T21:27:50 1716413270

yeah the referential integrity and constraints part is usually the most complicated part and everyone does things differently which adds another layer of complexity on it

enahs-sf · 2024-05-22T18:54:46 1716404086

I love that it's open-source. Great project and very applicable across a lot of industries, especially those deeply affected by compliance.

pitah1 · 2024-05-23T00:19:26 1716423566

Thanks for sharing. Happy to see another solution that doesn't just slap on AI/ML to try to solve it.

I am also among the many people who have created a solution similar[0] to this :). The approach I took though is being metadata-driven (given most anonymisation solutions cannot guarantee sensitive data not leaking and also open up network access from prod to test envs, security teams did not accept it whilst I was working at a bank), offering the option to validate based on the generated data (i.e. check if your service or job has consumed the data correctly) and ability to clean up the generated or consumed data.

Being metadata-driven opened up the possibility of linking to existing metadata services like data catalogs (OpenMetadata, Amundsen), data quality (Great Expectations, Soda), specification files (OpenAPI/Swagger), etc., which are often underutilized.

The other part that I found whilst building and getting feedback from customers, was having referential integrity across data sources. For example, account create events coming through Kafka, consumed and stored in Postgres whilst, at the end of the day, a CSV file of the same accounts would also be consumed by a job.

I'm wondering if you have come across similar thoughts or feedback from your users?

[0]: https://github.com/data-catering/data-caterer

kjuulh · 2024-05-22T22:24:35 1716416675

I just published our approach to pseudo anonymization and sort of anonymization.

We'd built a tool which can traverse data extract the pii data and put back a token into the data. Before one of our allowed systems would read the data we'd swap in the actual data or an anonymized version if we didn't permission to use it anymore. So we sort of get the best of both worlds we can use the actual data of our customers because we require it, but can safely use data for analytics and retain a lot of the statistical variance of our data.

Crazy complex project to work on given our limited resources but very fulfilling in the end.

It should be mentioned that I don't mention the difference between anonymization and pseudo anonymization in the article mostly because I didn't know it was really a thing. I just implemented a solution given or requirements

https://tech.lunar.app/blog/data-anonymization-at-scale

edrenova · 2024-05-22T23:24:34 1716420274

Nice! appreciate you sharing it - would love to see the code at some point but looks like it's confidential.

I spent a lot of time building tokenization solutions at a previous startup so we'll definitely support tokenization at some point. There is a good use-case for it as well!

3abiton · 2024-05-22T22:50:40 1716418240

Nice article, but a bit different than the OP authors who published a fully fledged open-source tool, is your code also available open-source for the community?

kjuulh · 2024-05-22T22:53:56 1716418436

Sadly not, it is very much tied to our workflows and as such wouldnt provide much value for anyone else other than a reference. Nevertheless I thought I'd share our approach to anonymization / pseudoanomyzation as I haven't really seen that done elsewhere.

aj__chan · 2024-05-22T19:45:17 1716407117

Amazing open source project! I can see pretty broad application to basically every application developers stack as they're building out their tools, but also working with real world production data in developer environments that don't break compliance. Great work, Evis & Nick!

chairmanwow1 · 2024-05-22T20:22:52 1716409372

Interesting, but why does it matter that I actually keep the same the same statistical distributions of data in development as in production? What are the use cases for that kind of feature?

edrenova · 2024-05-22T20:33:29 1716410009

yeah good question, if you're doing any sort of analytical work, then you'll care about the statistical distribution of your data. If you're running queries or sharing data with third parties, then you want to maintain the same stats. If you're just building features then you might not care as much as if you were doign analytical work. But it could still be relevant if you're building metrics/dashboards/anything visual - you'll want to see that you can render your prod data correctly. So more so for analytical work but less so for normal, run of the mill dev work.

ngcazz · 2024-05-23T07:08:55 1716448135

Hey, this looks quite cool! Just spotted this link on your site's frontpage is 404ing https://www.neosync.dev/solutions/keep-environments-in-sync (was quite keen to read this one specifically)

nickzelei · 2024-05-23T18:05:04 1716487504

Looks like the link was changed to this: https://www.neosync.dev/solutions/reproduce-prod-bugs-locall...

edrenova · 2024-05-23T16:53:00 1716483180

hey! so sorry about this - it's fixed now!

also - happy to chat further if you have any questions - evis@neosync.dev

ngcazz · 2024-05-24T21:43:23 1716587003

Thanks for addressing it and your availability. Keen to look into how Neosync might fit my team's dev needs very soon.