Hacker News new | past | comments | ask | show | jobs | submit login

Congrats on the launch!

This topic is relevant to what I'm currently working on, and I'm finding it exhausting to be honest. After considering several options for anonymizing both Postgres and ClickHouse data, I've been evaluating clickhouse-obfuscator[1] for a few weeks now. The idea in principle is great, since ClickHouse allows you to export both its data and Postgres data (via their named collections feature) into Parquet format (or a bunch of others, but we settled on Parquet), and then using clickhouse-obfuscator to anonymize the data and store it in Parquet as well, which can then be imported where needed.

The problem I'm running into is referential integrity, as importing the anonymized data is raising unique and foreign key violations. The obfuscator tool is pretty minimal and has few knobs to tweak its output, so it's difficult to work around this, and I'm considering other options at this point.

Your tool looks interesting, and it seems that you directly address the referential integrity issue, which is great.

I have a couple of questions:

1. Does Neosync ensure that anonymized data has the same statistical significance (distribution, cardinality, etc.) as source data? This is something that clickhouse-obfuscator put quite a lot of effort in addressing, as you can see from their README. Generating synthetic data doesn't solve this, and some anonymization tools aren't this sophisticated either.

2. How does it differ from existing PG anonymization solutions, such as PostgreSQL Anonymizer[2]? Obviously you also handle MySQL, but I'm interested in PG specifically.

As a side note, I'm not sure I understand what the value proposition of your Cloud service would be. If the original data needs to be exported and sent to your Cloud for anonymization, it defeats the entire purpose of this process, and only adds more risk. I don't think that most companies looking for a solution like this would choose to rely on an external service. Thanks for releasing it as open source, but I can't say that I trust your business model to sustain a company around this product.

[1]: https://github.com/ClickHouse/ClickHouse/blob/master/program...

[2]: https://postgresql-anonymizer.readthedocs.io/en/stable/




Thanks for the comment and feedback!

We're actually evaluating a clickhouse integration at the moment for a customer that we're working with so that might be coming in the future. Although today just PG and Mysql.

To answer your questions:

1. We don't support this quite yet although we're working on it for both anonymization and synthetic data. For anonymization, that typically means having deterministic anonymizers that output the same value for the same input (like a hash). For synthetic data that means using a model to be able maintain those same statistical characteristics. We'll have support for both of these within the next quarter. It also depends on what you want to anonymize. If the values that you want to anonymize wouldn't meaningfully change the distribution of the data (think like a name or an address and you're not doing any analytics or queries on those fields) then the statistical distribution of the data stays the same.

2. A few big differences. PG Anonymizer doesn't handle referential integrity, pretty much everything has to be defined in sql and it doesn't have a GUI and it doesn't have any orchestration across environments or databases. Neosync supports all of those.

Folks use the cloud service because they don't have the resource or time to deploy/run the OSS offering themselves. These are usually startups who are okay with us streaming their data and anonymizing it and sending it back to them. We're SOC2 type 2 compliant and usually got through a security review for these deployments. Conversely, they can also just run our managed version and keep all of their data on their infra while we host the control plane.


Thanks for the answers! Good to know that ClickHouse support might be on the roadmap.

Re 1: Yeah, it's a hard problem, from what I understand. I recommend reading that clickhouse-obfuscator README, in case it helps with your implementation. The tool works similarly to what you mention: it first generates a Markov model, which is then used for deterministic output with a given seed. My only problem is the referential integrity, so solving that as well would be ideal.

Re 2: Gotcha, makes sense. I kind of assumed that it did maintain integrity, but it's not clear from their docs. It's definitely more difficult to setup as well, and while I appreciate its simplicity, a better user experience never hurts.

Sounds good about your service then. I can't say that I would personally want to rely on it, but I can see how it would be useful for others. Cheers, and good luck with the project!


Yup - totally hear you - hopefully we'll have a good solution for that in a few months :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: