generally speaking for han chinese folks, surnames are one character and given names are one or two characters (with two being more common, usually). so if you see a name that's 1-2 or 2-1 (i.e. liu yifei) and it's not one of the few known multi-character surnames, then you can safely assume that's their given name.
for a 1-1 name like yao ming, it's a little more difficult. some characters are definitely more common as surnames than others - the chinese term for 'common people' (百姓) actually refers to an old classic text where they compiled all the surnames they knew of! so when i see the name yao ming, i immediately recognize that 'yao' is a fairly common surname and 'ming' is not, and thus it's more likely (but not guaranteed!) that 'yao' is the surname here.
there's also some cases that are ambiguous when romanized, but not ambiguous in chinese. for example, consider the name 'wang chen,' where both 'wang' and 'chen' are common chinese surnames. however, if i saw it written out as 王, i would be able to recognize that 王 'wang' is a character that's primarily used for surnames, while 晨 'chen' is not.
chinese names can absolutely be gendered - for example, if i met a 璐, there's a 99% chance that name belongs to a woman because of its meaning (beautiful jade) and its character composition (王 radicals tend to be feminine). i feel like gender-ambiguous names have become more common in recent years, but maybe that's just me not keeping up with naming trends.
my understanding is that there's a bit of a catch-22 with data removal - if you request that a data broker remove ALL of your information, it's impossible for them to keep you from reappearing in their sources later on because that would require them to retain your information (so they can filter you out if you appear again).
I’ve heard this claim, but they could use some sort of bloom filter pr cryptographic hashing to block profiles that contain previously-removed records.
There could also be a shared, trusted opt-out service that accepted information and returned a boolean saying “opt-out” or “opt-in”.
Ideally, it’d return “opt-out” in the no-information case.
Hash-based solutions aren't as easy as we might hope.
You store a hashed version of my SSN, or my phone number, to represent my opt-out? Someone can just hash every number from 000-00-0000 to 999-99-9999 and figure out mine from that.
You hash the entire contents of the profile - name+address+phone+e-mail+DOB+SSN - and the moment a data source provides them with a profile only containing name+address+email - the missing fields mean the hashes won't match.
A trusted third party will work a lot better IMHO.
And of course none of the data brokers have much reason to make opt-outs work well, in the absence of legislation and strict enforcement - it's in their commercial interests to say they "can't stop your data reappearing"
> Someone can just hash every number from 000-00-0000 to 999-99-9999 and figure out mine from that.
That's what salts are for, right? It wouldn't be too hard to issue a very large, known, public salt alongside each SSN.
> And of course none of the data brokers have much reason to make opt-outs work well, in the absence of legislation and strict enforcement - it's in their commercial interests to say they "can't stop your data reappearing"
If the salt is public, what’s the point, then you can get all the salts, and combine them with every possible ssn, and you’re back where you were before.
No, that is kind of the point of a salt is that it doesn't need to be hidden - it's designed for a scenario where e.g. your database is hacked and they're visible as plaintext: https://en.wikipedia.org/wiki/Salt_(cryptography)
Since the salts are random, unique to each SSN and long: a) you'll find no existing rainbow table that contains the correct plaintext for your SSN hash and b) each SSN now requires its own bruteforcing that is unhelpful for any of the other SSNs
Combine that with a very expensive hashing method like PBKDF2 (I'm sure there's something better by now) and you've made it pretty dang hard for non state actors to bruteforce a significant chunk of SSNs. There's also peppers that involve storing some more global secrets on HSMs.
I'm sure the crypto nerds have like a dozen better methods than what I can come up with but the point is this is not a feasibility issue.
I’m sorry but it’s not that simple. You can’t just say add salt, here are the benefits of salt, problem solved.
In a password database, salt is not secret because the password combined with it is secret and can be anything. Even if you know the salt for a particular user, in order to crack that user, you need to start hashing all possible passwords combined with that salt. If a user picks a dumb password like password123, then they are not safe if the salt leaks. Other users with password=password123 will not be immediately apparent because other users have different salts. You would have to try password123 combined with each user’s salt to identify all the users with password123.
You said “It wouldn't be too hard to issue a very large, known, public salt alongside each SSN.” That means there should be some theoretical service where you pass it an ssn and get back the salt, right? So what have you gained? Any attacker with an ssn can get the salt, and nothing was gained. Or if attackers don’t have ssns they can just ask for all the salts, the mapping from ssn to salt is public so they know 000-00-000 has salt1, 000-00-001 has salt2, etc, so you haven’t increased the amount of hashes attackers have to do to do whatever it is they want to do.
You’re right about commercial interests being at play. That’s why we don’t have laws like GDPR in the USA. Crypto nerds have thought about this long and hard and if it was that easy we wouldn’t need stupidly complex laws like GDPR. They would “just add salt.” Or other services would “just add salt” instead of relying on more complex and expensive forms of identity verification and protection.
You don’t need to be a crypto nerd to try to describe a flow where having a public known salt per ssn helps with privacy. You do not need to be a crypto nerd to design secure one way hash functions that would plug into that flow.
Yep, you are right, complete brain fart on my end. Of course it doesn't work if it's required for the salt to be publicly mappable to the SSN, since that just circumvents the whole thing. I just didn't understand what you were saying in your earlier message.
"all the salts" * "all the SSNs" becomes a very big number. With a large enough but still reasonably sized salt, you can engineer it so that hashing all combinations takes an amount of time greater than the age of the universe even if you use all the computers in the world.
All the salts * all the ssns is a very large set but it’s irrelevant because in the above scenario each ssn has a public well known salt, you don’t have to test each salt against each possible ssn because the mapping from one to the other is known.
Even if such a service doesn’t exist, and you just have a list of all the salts without knowing which ssn they map to, you’re just hand waving how hard it will be to hash the entire salt*ssn set.
Hashing a salt+ssn can’t take too too long because data brokers need to be doing it frequently in order to verify identities.
In this report, https://files.consumerfinance.gov/f/documents/cfpb_consumer-..., it says monthly volume of credit card marketing mail is in the hundreds of millions per month. Can we assume that each piece of mail is roughly associated with one instance of hashing a salt+ssn? Given that number, how expensive (in terms of time, compute cycles, whatever) can it possibly be to hash a salt+ssn? If we make it too expensive, expensive enough to support your “age of the universe” claims, credit markets would grind to a halt.
I’m quite familiar with how a salt works. One might say deeply familiar since I have worked on auth services for very large, very secure organizations.
Poster above me just said “add salt” and waved their hands without describing anything concrete, like just saying some magic words can solve hard problems.
So for a perfect match they'd need to have some sort of unique identifier that's present in the first set of data you ask them to remove, as well as being present in any subsequent "acquisitions" or "scrapes" of your data.
If these devs that scrape/dump/collate all this info are anything like the ones I've seen, and they're functioning in countries like the US and UK whereby you don't have individual identifiers that are pretty unique, then I'd say the chance of them being able to get such a "unique" key on you to remove you perpetually, is next to impossible. And if it's even close to being "hard", they'll not even bother. Doubley-so if this service/people/data is anything like the credit-score companies, which are notoriously bad at data de duplication and sanitation.
Likewise, if you want them to do some sort of removal using things other than a unique identifier, then you have to have some sort of function that determines closeness between the two records. From what I've heard, places like Interpol, countries' border-control and police agencies usually use name, surname and dob as a combination to match. Amazingly unique and unchanging combination, that one! /s
Sorry, I value my legal rights over the viability of the data broker industry. If they can’t figure out a way for lawfully not collecting my data, they should not collect data period.
Which would never work because real life data is messy so the hashes would not match. Even something as simple as SSN + DOB runs into loads of potential formatting and data entry issues you'll have to perfectly solve before such a system could work, and even that makes assumptions as to what data will be available from each dataset. Some may be only name and address. Some may include DoB, but the person might have lied about their DoB when filling out the form. The people entering it might have misspelled their name. It might be a person who put in a fake SSN because they're an illegal immigrant without a real one. Data correlation in the real world is a nightmare.
When you tell a data broker to delete all of the data about you, how can you be sure they get ALL of the data about you, including the ones where your name is misspelled or the DoB is wrong or it lists and old address or something? Even worse if someone comes around later and discovers the orphan data when adding new data about you and fixes the glitch, effectively undoing the data delete.
It's a catch-22 that if you want them to not collect data about you they need a full profile on you in order to be able to reject new data. A profile that they will need to keep up-to-date, which is what they were doing already.
> Even something as simple as SSN + DOB runs into loads of potential formatting and data entry issues you'll have to perfectly solve
You don’t have to solve it perfectly to be an improvement.
Also this is BS. Not every bit of data is perfectly formatted and structured but both of your examples are structured data. You can 100% reliably and deterministically hash this data.
There’s so much in your argument that can be replied with “imperfect is better than status quo”. If you give someone the wrong DOB, it’s “not you” anyways, at least let me scrub my real data even if the entry is imperfect for some people or some records.
> You don’t have to solve it perfectly to be an improvement.
They don't want to solve your problem. You aren't their customer. They want to comply with the letter of the request in as much as it covers their own butt in terms of regulatory requirements and/or political optics.
The “solution” mentioned is political. A requirement that data on an individual is properly deleted when presented with the data would be “good”. A requirement that captures every nuance of mistakes would be “perfect”.
Hashing a birthday and SSN is deterministic. We could deterministically keep that data deleted. This would be better than we have today, and could be done reliably and affordably.
The companies can easily be required (by law) to implement the “good” solution. Everyone complaining it’s not “perfect” is stopping “good”.
There's a trivial way to not re-add data that was removed: don't do it without user opt-in, whom admittedly you have access to ask at the moment of data collection. If you don't have the ability to ask users to opt in, you probably shouldn't be collecting the data anyway, with very few exceptions like criminal records.
edit for clarity: by criminal records, I mean for the official management of them, not for scraping their content.
my impression is that you can't replace competent programmers with chatGPT, but that hasn't stopped companies from trying (or simply accepting incompetency).
i'm bilingual and have absolutely not experienced this. that being said, i think being bilingual - rather than a learner - does make it a lot easier to code-switch instinctively between different languages.
from my experience doing animanga-adjacent translations - readers also prioritize speed first, quality second. there are a lot of people who will happily read machine-translated work (and often awkward, typo-ridden MTL at that) rather than wait a day or two for better translations. same goes for general scanlation quality, like typesetting and redraws.
this is also why the fansubbing scene is effectively dead - companies like crunchyroll get episode scripts early and can thus release subs simultaneously with the official release. most fansub groups now just fix/edit the crunchyroll script, if they even pick up series at all. there's no point in putting in the effort if no one's going to look at it, after all.
that's the main issue i have with 'more but worse' translations, honestly. you'll get more material, but the good translators won't just move to content that wasn't translated before - they'll just disappear entirely.
'translationese' is a pretty common term for it. when you translate, it's really easy to mirror the source structure/syntax even when there's more idiomatic ways to say it in the target language.
Exactly. One simple example that I see all the time comes to mind:
In English, "dozens of ____s" is a very common expression, particularly in news articles. In my local language, even though we do have a word for "dozen", it's much more common to say that in the form of "tens of ____s". Most of the "dozens of ____s" I see written in my language are from news articles that were (badly) translated from English.
English uses "dozens" in more situations than Dutch uses "tens", also because "tens" in Dutch is a three-syllable word. It's often just not idiomatic
I often find myself having started a sentence in Dutch that I can't finish without borrowing something from English, and I remember a recent example actually involved the word "dozens" (although I forgot what the sentence was about so I can't reproduce it here). That sentence should have been constructed entirely differently, but I now use English in my day-to-day communications at work, at home, and also most online ones so some stuff slips through.
It blew my mind some months ago when I used an English saying, perfectly translated (no loan words, good sentence structure), but entirely nonexistent in Dutch. I can't ever have heard anyone said it but it came out without any thought. The person I was talking to is also proficient in English and understood what I meant, but whether or not their face gave something away, it took me five seconds to realize what I had said. I guess the brain stores words in a form of meaning that transcends language, and just calls upon the language neural net to convert that into muscle movements for speech? Actually, no, then you'd have gotten the word-for-word translation; it must be storing more than single words in some sort of language lookup center, or maybe something that converts between the two structures if you do enough translating between a given language pair? Either way, mind-boggling stuff
i believe one of the issues with controlled burns is that there are specific conditions necessary for them to be controllable - and with the current state of the california fire season, it's actually really difficult to set them up.
it's pretty likely that storebought stock is made with 'unattractive' produce that wouldn't otherwise sell, like how tomato sauce is made with ugly or bruised tomatoes.
you can't really have a 'nonsense' name - there's a list of around 3000 characters you're allowed to use in names, but in theory, you can put together whatever you want. you'll just get side-eyed for having a weird name, that's all.
that being said, i think most of these websites that ask for a kanji name will require you to show ID with that name when you show up in person, so you might run into trouble if you just pick random characters.
for a 1-1 name like yao ming, it's a little more difficult. some characters are definitely more common as surnames than others - the chinese term for 'common people' (百姓) actually refers to an old classic text where they compiled all the surnames they knew of! so when i see the name yao ming, i immediately recognize that 'yao' is a fairly common surname and 'ming' is not, and thus it's more likely (but not guaranteed!) that 'yao' is the surname here.
there's also some cases that are ambiguous when romanized, but not ambiguous in chinese. for example, consider the name 'wang chen,' where both 'wang' and 'chen' are common chinese surnames. however, if i saw it written out as 王, i would be able to recognize that 王 'wang' is a character that's primarily used for surnames, while 晨 'chen' is not.