Hacker News new | past | comments | ask | show | jobs | submit login

Would be nice to read how do they collect/prepare data for training.

Which data sources? How much of Meta users data (fb, instagram… etc). How do they sanitize PII?




> How do they sanitize PII?

I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.

Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.


For a convenient definition of PII. Isn’t everything a user does in aggregate PII?


I don't think it's PII. If you had someone's movements, you could go and spy on them, find out who they were (i.e. their PII) and then link that back and say "I now know this identified person's movements". I don't think the movements themselves are PII.

Things that aren't PII aren't "convenient" definitions. Doesn't mean everything that isn't PII is fine to share. It's like saying a kidnapping isn't a murder. That's not a convenient definition of murder; it's just a different thing. We shouldn't start talking like witch hunters as soon as we encounter a situation that we haven't memorised a reasonable response to. We should be able to respond reasonably to new situations.


PII is pretty intuitive to define.

Obvious examples: data that easily identifies a person (Photo, name, number, UUID, etc)

Thats trivial to block. Where it gets harder is stuff that on it's own isn't PII, but combined with another source, would be

For example, aggregating public comments on a celeb's post. (ie stripping out usernames and likes and assigning a new UUID to each person.) For a single post, thats good enough. You're very unlikley to be able to identify a single person.

But over multiple posts, thats where it gets tricky.

As with large companies, the process for getting permission to use that kind of data is righty difficult, so it often doesn't get used like that.


What about a post or comment that includes proper names?


Then, you should check out papers like https://arxiv.org/abs/2302.13971 and https://arxiv.org/abs/2307.09288

In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.

The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.


> Would be nice to read how do they collect/prepare data for training.

By literally opting everyone into their training data set and making it very cumbersome to opt out: https://threadreaderapp.com/thread/1794863603964891567.html


They explicitly train models only on public datasets




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: