Congrats on the launch! This topic is relevant to what I'm currently working on,...

edrenova · 2024-05-22T21:47:47 1716414467

Thanks for the comment and feedback!

We're actually evaluating a clickhouse integration at the moment for a customer that we're working with so that might be coming in the future. Although today just PG and Mysql.

To answer your questions:

1. We don't support this quite yet although we're working on it for both anonymization and synthetic data. For anonymization, that typically means having deterministic anonymizers that output the same value for the same input (like a hash). For synthetic data that means using a model to be able maintain those same statistical characteristics. We'll have support for both of these within the next quarter. It also depends on what you want to anonymize. If the values that you want to anonymize wouldn't meaningfully change the distribution of the data (think like a name or an address and you're not doing any analytics or queries on those fields) then the statistical distribution of the data stays the same.

2. A few big differences. PG Anonymizer doesn't handle referential integrity, pretty much everything has to be defined in sql and it doesn't have a GUI and it doesn't have any orchestration across environments or databases. Neosync supports all of those.

Folks use the cloud service because they don't have the resource or time to deploy/run the OSS offering themselves. These are usually startups who are okay with us streaming their data and anonymizing it and sending it back to them. We're SOC2 type 2 compliant and usually got through a security review for these deployments. Conversely, they can also just run our managed version and keep all of their data on their infra while we host the control plane.

imiric · 2024-05-22T22:49:41 1716418181

Thanks for the answers! Good to know that ClickHouse support might be on the roadmap.

Re 1: Yeah, it's a hard problem, from what I understand. I recommend reading that clickhouse-obfuscator README, in case it helps with your implementation. The tool works similarly to what you mention: it first generates a Markov model, which is then used for deterministic output with a given seed. My only problem is the referential integrity, so solving that as well would be ideal.

Re 2: Gotcha, makes sense. I kind of assumed that it did maintain integrity, but it's not clear from their docs. It's definitely more difficult to setup as well, and while I appreciate its simplicity, a better user experience never hurts.

Sounds good about your service then. I can't say that I would personally want to rely on it, but I can see how it would be useful for others. Cheers, and good luck with the project!

edrenova · 2024-05-22T23:22:33 1716420153

Yup - totally hear you - hopefully we'll have a good solution for that in a few months :)