Hacker News new | past | comments | ask | show | jobs | submit login

My number was leaked (checked the dump myself) but I don't show up on this site. Seems like there are some bugs to work out



I'm not surprised, the data dump was an ugly mess of inconsistently encoded data in inconsistent formats with "delimiters" that often appear in the data itself.

Cleaning that up is a serious effort and requires operations on huge files that are very difficult for most software to deal with.


I don't understand how a serious effort would be required, even if the chosen delimiter being present within the data is an issue, the phone number is the first field.

I can get all the phone numbers myself with a simple `cat * | cut -d ":" -f 1`


That's the ID number you just grabbed. The phone number is the second field :)

If that's literally all you want, yes, it's not that hard. But a non-trivial number of people decided to put commas or colons in their names and other nonsense like that, there are lots of commas in the hometown or location fields which makes parsing those a pain, etc.


Aha we must be looking at different data then, possibly someones already done much of the corrections on the version I'm looking at.


Possible, but there are also different files with different schemas, so it's hard to even say that.

There only ones that actually define the data are the 9 or so CSV files that have a header like:

id,phone,first_name,last_name,email,birthday,gender,locale,hometown,location,link

Those are what I looked at and those are super annoying because several have commas in both the first & last name. I don't know why, but a handful of people listed their names as some, guy, some, guy which I assume should be split into firstname: some, guy and lastname: some, guy. Then a lot of people have None for a birthday, some have something like May 8, and others have something like May 8, 1990. Both locale & hometown can be either None, or have several commas in them.

I had to reformat all that data and validate that each field made sense to parse it. There are helpful "Location" and "link" markers in the CSV but it's still super annoying to parse this stuff.


Also be careful, some of these docs have BOMs that screw up parsing tools (even iconv crapped out on one of the files, Qatar I think it was) and the encoding is all over the place. At least the phone number is ASCII, but the names may be UTF-8 (with or without BOM), UTF-16-le or...


Likewise.

That said, the actual files are in fairly shabby shape and quite tedious to clean for DB import. They may have missed thousands of records.


I haven't had the chance to check the dump but I am sure Facebook had my phone number. I'm surprised this site says my number wasn't leaked.


If "Who can look you up using the phone number you provided?" setting wasn't set to "Everyone" in your privacy settings, then your phone number wouldn't be visible to the scraping campaign that was the source of this.


Thanks for saying this. Seems like most people have a misunderstanding of where the data came from.

To clarify: in 2019 you could enter a phone number into Facebook search and it would show you whichever profile was associated with the number of it was set to public.

The “hackers” set up a script to go through every number sequentially: 15550000000, 15550000001, etc.

I would be very surprised if the original data doesn’t contains a LOT more information —- basically everything that can be found by publicly viewing your profile page.


where can i find the dump, I've wanted to check my data as well


I found it here https://archive.is/MZqak


Search fbleaks on telegram


I tried and get no results.



can confirm.


+1. Found my pal's German number in the dump but not on this website.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: