I'm not surprised, the data dump was an ugly mess of inconsistently encoded data in inconsistent formats with "delimiters" that often appear in the data itself.
Cleaning that up is a serious effort and requires operations on huge files that are very difficult for most software to deal with.
I don't understand how a serious effort would be required, even if the chosen delimiter being present within the data is an issue, the phone number is the first field.
I can get all the phone numbers myself with a simple `cat * | cut -d ":" -f 1`
That's the ID number you just grabbed. The phone number is the second field :)
If that's literally all you want, yes, it's not that hard. But a non-trivial number of people decided to put commas or colons in their names and other nonsense like that, there are lots of commas in the hometown or location fields which makes parsing those a pain, etc.
Those are what I looked at and those are super annoying because several have commas in both the first & last name. I don't know why, but a handful of people listed their names as some, guy, some, guy which I assume should be split into firstname: some, guy and lastname: some, guy. Then a lot of people have None for a birthday, some have something like May 8, and others have something like May 8, 1990. Both locale & hometown can be either None, or have several commas in them.
I had to reformat all that data and validate that each field made sense to parse it. There are helpful "Location" and "link" markers in the CSV but it's still super annoying to parse this stuff.
Also be careful, some of these docs have BOMs that screw up parsing tools (even iconv crapped out on one of the files, Qatar I think it was) and the encoding is all over the place. At least the phone number is ASCII, but the names may be UTF-8 (with or without BOM), UTF-16-le or...
If "Who can look you up using the phone number you provided?" setting wasn't set to "Everyone" in your privacy settings, then your phone number wouldn't be visible to the scraping campaign that was the source of this.
Thanks for saying this. Seems like most people have a misunderstanding of where the data came from.
To clarify: in 2019 you could enter a phone number into Facebook search and it would show you whichever profile was associated with the number of it was set to public.
The “hackers” set up a script to go through every number sequentially: 15550000000, 15550000001, etc.
I would be very surprised if the original data doesn’t contains a LOT more information —- basically everything that can be found by publicly viewing your profile page.