I can't comment on how things like faces get used, but in my experience, PII at Meta is inaccessible by default.
Unless you're impersonating a user on the platform (to access what PII they can see), you have to request special access for logs or database columns that contain so much as user IDs, otherwise the data simply won't show up when you query for it. This is baked into the infrastructure layer, so I doubt the GenAI teams are using something else.
I don't think it's PII. If you had someone's movements, you could go and spy on them, find out who they were (i.e. their PII) and then link that back and say "I now know this identified person's movements". I don't think the movements themselves are PII.
Things that aren't PII aren't "convenient" definitions. Doesn't mean everything that isn't PII is fine to share. It's like saying a kidnapping isn't a murder. That's not a convenient definition of murder; it's just a different thing. We shouldn't start talking like witch hunters as soon as we encounter a situation that we haven't memorised a reasonable response to. We should be able to respond reasonably to new situations.
Obvious examples: data that easily identifies a person (Photo, name, number, UUID, etc)
Thats trivial to block. Where it gets harder is stuff that on it's own isn't PII, but combined with another source, would be
For example, aggregating public comments on a celeb's post. (ie stripping out usernames and likes and assigning a new UUID to each person.) For a single post, thats good enough. You're very unlikley to be able to identify a single person.
But over multiple posts, thats where it gets tricky.
As with large companies, the process for getting permission to use that kind of data is righty difficult, so it often doesn't get used like that.
In the paper covering the original Llama they explicitly list their data sources in table 1 - including saying that they pretrained on the somewhat controversial books3 dataset.
The paper for Llama 2 also explicitly says they don't take data from Meta's products and services; and that they filter out data from sites known to contain a lot of PII. Although it is more coy about precisely what data sources they used, like many such papers are.
Which data sources? How much of Meta users data (fb, instagram… etc). How do they sanitize PII?