Hacker News new | past | comments | ask | show | jobs | submit login
Known anomalies in Unicode character names (2017) (unicode.org)
99 points by fanf2 on May 20, 2020 | hide | past | favorite | 83 comments



I am not sure if native speakers are involved in Unicode standardization at all.

Personal example.

In Georgian (a few millions native speakers) we had historically three different independent alphabets: mtavruli, nuskhuri and mkhedruli. All represent same letters, but different ways to write them. Alphabets are represented in Unicode under different code points. Only mkhedruli is used currently (last few centuries) and therefore most fonts have glyphs for mkhedruli only.

Also there is no concept of capitalization in Georgian, there is no "a" and "A", we have only "ა", one does not start a sentence, a word or a personal name with some variant of a letter. Anna is ანნა in Georgian, the first and the last letters are written the same way.

Now, because of some reasons, letters of mtavruli alphabet are registered as uppercase versions of letters of mkhedruli alphabet. Mind blown.

And since well written software uses rules defined by Unicode standard, via ICU library or some platform specific routines, many desktop and mobile applications as well as major web-sites are hard to use. First letters of sentences are capitalized, but since mtavruli glyphs are missing from fonts, they are rendered as ⍰.

So instead of

მაგათი დედაც ვატირე.

we see

⍰აგათი დედაც ვატირე.

and have to guess what was the first letter. But who cares, if it is not used in U.S. right?


> I am not sure if native speakers are involved in Unicode standardization at all.

Definitely. For Georgian alphabets that were one of the original scripts included in Unicode 1.0, as an example, the reference fonts were designed by Georgian type designers Anton Dumbadze and Irakli Garibashvili. If something was very off they must have noticed that.


Doing a bit of research, this mkhedruli/mtavruli case relationship was only added in Unicode 11, which means you can go and read the documents you used to justify it.

And it seems that the Georgians themselves were the ones who pushed for this change: https://unicode.org/L2/L2016/16034-n4707-georgian.pdf. It seems this isn't a case of "ignorant Westerners screwing up Unicode" but "ignorant programmers who don't understand how Unicode works screwing it up," as noted in the sibling comment.


Or "ignorant programmers who don't understand how text works or can't bother to support it properly getting to decide character encoding standards".

We have English speaking programmers swearing by ASCII despite it not having all the characters needed for English (I can't find it now but there was a report by the Library of Congress that came to that conclusion). I suspect their Georgian counterparts have similar attitudes towards text.


> not having all the characters needed for English

I've heard this argument a few times, and each time it's just as contrived, particularly if you stick to American English. While I'm happy to acknowledge it's not good enough for other languages, ASCII is plenty good for English. Unless you're trying to type cafe with an accented e to be fancy or type that old conjoined-ae character, ASCII is fine for everyday usage. See the fact that most Americans have never installed a separate keyboard layout.


> While I'm happy to acknowledge it's not good enough for other languages, ASCII is plenty good for English.

It’s good enough for the style of American English adopted as a concession to typewriter limitations, since that’s pretty much exactly what ASCII is designed to support. OTOH, American English as used outside of that context doesn’t use straight quotes, but rounded ones in each direction with different meaning, uses various diacritical marks, and also uses the cent sign (¢), the obelisk (†), diesis (‡), pilcrow (¶), section sign (§), division sign (÷), multiplication sign (×), less-(and greater-)than-or-equals sign (≤ and ≥), angle brackets (⟨ and ⟩, which aren’t the same as less-than/greater-than symbols), the interrobang (‽), and—the next three distinct from the hyphen—the minus sign, en-dash, and em-dash.

> See the fact that most Americans have never installed a separate keyboard layout.

Common software used for text, like MS-Word, provides convenience methods for many non-ASCII characters without an keyboard layout change.


Some operating systems extend this to work across all apps.


Since I know that macOS conveniently allows to type en and em dashes (option+hyphen and shift+option+hyphen respectively) I'm using both frequently in the appropriate situations instead of just putting hyphens everywhere.


I don't see at all how not being able to type certain non alphabetical symbols means it isn't good for English. Half you quoted are general math symbols that have nothing to do with English.


> I don't see at all how not being able to type certain non alphabetical symbols means it isn't good for English.

It means that because the English language uses punctuation and other non-alphabetic symbols, not just letters of the alphabet, as an examination of any substantial corpus of written material in the language not solely consisting of material where the writer was constrained by ASCII (and, heck, even those where they were, though the set of non-alphabetic symbols used would be narrower) would instantly reveal.

> Half you quoted are general math symbols that have nothing to do with English.

5 (7 if you mistakenly include angle brackets, which do have textual, non-mathematical uses) out of 15 is not half, but mathematical symbols are used in written English.


This is like saying that a python interpreter isn't good for interpreting python, because frequently python programs might be invoked using a bash script, and so python is incomplete without bash.

If my weird squiggley that I use in my specific niche field isn't represented by any Unicode code point, does that mean that Unicode isn't good for representing English?


I want you to explain to your grandmother why the software you wrote can handle dollar signs but not cent signs.


FIELDATA didn't support the cent sign either, which my grandmother is much more familiar with, so I think it would not be hard for her to grasp this concept.


Funny you mention American English given the fact that a great many Americans have various diacritic marks in their names. You can make the claim that those names are "foreign" but the most common ancestry group in the US is German.

Regardless of the technical arguments, from the perspective of non-programmers, when they see things like the CA DMV being incapabale of doing things as simple as recording a diacritic over a person's name, it doesn't reflect well on the software industry and programmers as a professsion.


> You can make the claim that those names are "foreign" but the most common ancestry group in the US is German.

While true, I think the overwhelming majority of German-Americans anglicized their names several generations ago. For instance, my great-great grandfather switched from Müller to Miller. This is probably less true of more recent arrivals.


I didn’t know ‘cafe’ was even a thing and wondered what’s fancy about café, so I had to Google a bit. I learned that, yes, Americans do spell ‘cafe’, but the most common spelling still is café. Even https://simple.wikipedia.org/wiki/Café redirects cafe to café.

https://www.merriam-webster.com/dictionary/cafe‘: “variants: or less commonly cafe”

https://dictionary.cambridge.org/dictionary/essential-americ...: “noun (also cafe)”

https://www.dictionary.com/browse/cafe also prefers to see the accent

That spelling preference may be shifting. I couldn’t find evidence for it, though.


> the most common spelling still is café

Really? The data say otherwise: https://books.google.com/ngrams/graph?content=caf%C3%A9%2Cca...

Do you have another source that supports the popularity of the diacritic variant? Rather, café doesn't increase in popularity until after 2000, and thus would appear to be an affectation of sophistication.


I'd hate to work for the New Yorker without ö on the keyboard, but for 99.99% of American usage ASCII is plenty good enough.


ASCII doesn't even have proper quotation marks or dashes. There's a reason software for text production has long made it easy to use characters outside of the ASCII range.


You're talking past OP. Most people don't care about matching quotations or dashes.


Most people will care about something not in ASCII.

Windows-1252 encoding was supported on American computers long before Unicode. I remember “smart quotes” being promoted as a feature of word processors.

Even for people who didn't care for the quotes:

• everyone appreciates ¢, ° for °F or 46° N, ¼, ½, ¾, × and ÷,

• people handling large texts want §, †, ‡ and maybe ¶,

• businesses want ©, ® and ™,

• scientists and engineers need µ, ±, ², ³, ·.

Web browsers have made it more difficult for the average user to enter these characters, but software like Microsoft Word has supported easy ways to insert many of them automatically.


The 'long s' (ſ) was still in fairly common use in English less than 200 years ago. Consequently there are a lot of historically relevant documents written in English that can't be faithfully reproduced using ASCII. This may not be relevant to most people, but it seems plausibly relevant to libraries or archivists (for instance the Library of Congress.)

Example: https://upload.wikimedia.org/wikipedia/commons/7/79/Bill_of_...


And that's exactly why Unicode was invented.


It's naïve to assume that American English does not use diacriticals.


I newer saw correctly spelled name of Kyïv (Kyiv, Kiev) city in English texts.


Kyiv is the official form according to the Ukraine government/mapping agency (according to Wikipedia).

The most common English spelling is Kiev. That reflects the English pronunciation. (The British one anyway.)

But many significant European cities with a long history have a different name in English than the local languages: Cologne, Munich, Florence, Naples, Warsaw, Copenhagen, Gothenburg, Prague.


This isn't just Anglophones being boorish and Anglocentric, either. Those cities also have different names in German or, more obviously, Chinese.


Indeed, and naturally British and Irish cities have different names in other (European) languages.

In French: Douvres, Londres, Édimbourg.

In German: Edinburg.

In Italian: Edimburgo, Glascovia, Londra, Novocastro, Dublino.

In Chinese: Jiànqiáo (剑桥) (lit. "sword bridge", Cambridge)


Have you ever wondered why Russians are indifferent to the fact that Moskva is spelled Moscow in English?


Because we don’t really care.


Pretending ASCII is enough for English is at least understandable. I've seen engineers demanding that we find a localized string for "middle name" in Korean. For a social network product. (Hint: There are no middle names in Korean.)

Some people just don't care.


Not a great example if it's not a platform for only Korean people. If you have a field for middle name, there will probably be some-one with the language set to Korean still be interested in reading or writing that field (ie expats learning the language, or perhaps people with mixed heritage?). I think assuming because Korean (people) have no middle names you don't need to translate it is more careless than the other way around.


The uppercase mapping of Mkhedruli is Mtavruli in Unicode, but the titlecase mapping is not. If you use the uppercase mapping to automatically capitalise the first letter of a sentence, you are not following the recommendations of the Unicode standard.


TIL about ToTitleCase. An easy-to-visualize example of the difference is

    ToUpperCase("dz") => "DZ"
    ToTitleCase("dz") => "Dz"


But how many programmers are even aware of the fact Unicode has both an Uppercase and Titlecase mapping? Let alone that they can do different things!

I routinely see code that treats titlecase as “set individual character in this Unicode string to its upper case equivalent”... It wouldn’t surprise me at all to find out this kind of titlecase vs uppercase mistake is a very widespread issue.


I for one never heard of "titlecase" before. I guess that's what I get from learning text encoding from APIs exposed in programming languages.


The thing is one cannot type mtavruli letters. No modern keyboard layout which I am aware of allows that.

If you press "s" key it types "ს" (S), but if you press "shift+s" it types "შ" (SH), which is entirely different letter. Georgian alphabet has 33 letters, 7 more letters than Latin, so letters ჭღშჟძჩთ are mapped to shift+something keystrokes.

Russian language has 33 letters too, but Russian layouts override 7 punctuation keys ,.;'[]~ so pressing "," will not type a comma, but a letter "б", pressing "]" will type "ъ", so Russian layouts remap punctuation symbols to shift+number keystrokes. Georgian layouts keep punctuation key assignments same as for English and use shift+letter for additional letters.

Modern fonts may have mtavruli styled glyphs, that is true, but they are placed on mkhedruli code points anyway, which is exactly the problem, no widely used fonts support this case switch anyway and render ⍰.


In Arch-based Linux distros you can use Caps Lock. When Caps Lock is active, each letter comes out capitalized (including shift+ combinations).

Mtavruli has already been supported by the following fonts: Google Noto (Android 11 DP, Arch Linux), SegoeUI, Calibri (Windows 2019 update), Helvetica Neue (macOS Catalina, iOS 13).


Interesting, I didn't know that there was a distinction between uppercase and titlecase. Perhaps unsurprisingly, unicode has a FAQ around just that: https://unicode.org/faq/casemap_charprop.html


> who cares, if it is not used in U.S. right?

This isn't intended to sound rude, but why should an American care? Dismissing the argument that for better or worse many people hold doesn't do any good. Making it might convince them otherwise.


Please don't take HN threads further into nationalistic flamewar.


That strikes me as much closer to linguistic than nationalistic; there's more than one nation that speaks English. Could you clarify what about that statement was flame-y? All I did was ask that adontz make the argument rather than making a dismissive statement.


You took a sarcastic question that was being asked from a defensive place and reflected it back to the person in an amplified way that was guaranteed to rub salt in their wound.


The use of the obscure word "solidus" instead of "slash" for the character "/" is absurd. The term "slash" should at least be a formal alias, if it isn't already. This text seems to imply it's not a formal alias, but maybe it is.

After all, the POSIX standard includes slash as a name for the character, and that an ISO standard: https://pubs.opengroup.org/onlinepubs/9699919799/ What's more, Unicode refers to "\" as "backslash".

Very weird.


In over 40 years of computing I've never heard the backslash called anything else… until today when you made me look into it.

Apparently Unicode has renamed it be "REVERSE SOLIDUS" while still tolerating "BACKSLASH" as a name for it.

I don't expect anyone to ever use the words "REVERSE SOLIDUS" in my presence.

Interesting, it entered computer use in the era of Algol so they could write their /\ and \/ operators.


"Slash" and "virgule" are both formal aliases for U+002F SOLIDUS.


What's more, Unicode refers to "\" as "backslash".

I'd be happy if nobody ever referred to the word "backslash."

I heard it read in a URL in a TV commercial this week. I can't believe it. People have been getting that wrong since AOL days.


It's a standard character, it needs a name, the standard name for "\" is backslash. I'm okay with that. The normal and widespread name for "/" is "slash", and I'm fine with that too.

Yes, there are people who get things wrong on the internet. Just gently correct them and move on. I don't think trying to change to other terms like solidus and reverse solidis are going to make it any better.

And I like :// being in URLs. It makes them incredibly easy to disambiguate them from anything else.


I've been calling it a backslash since the '70's. Get off my lawn, kid.


In my neck of the woods it was slash and slosh.


What would you prefer then?


They simply don't belong in web addresses.


Antivirgule, of course.


People say backslash in web addresses when they mean a plain old slash.


I reread your post and now I understand what you were trying to say.


Considering how huge the Unicode standard is, it's more surprising that there are so little (known) errors. CJK characters alone account for thousands upon thousands, and these can sometimes vary by just a single stroke. I suspect the majority of redundancies or suboptimal choices were the result of subsuming so many existing standards though. There might also be plain errors in research but are probably in the more obscure blocks.

See also, ghost kanji: https://www.japantimes.co.jp/life/2018/10/29/language/ghost-...


> CJK characters alone account for thousands upon thousands

The vast majority of CJK characters have systematic names like "CJK UNIFIED IDEOGRAPH-72AC" which aren't really subject to errors in the same way as the more verbose names used for other scripts.


I have to wonder what the systematic names refer to; is it a Chinese/Japanese dictionary ordering in which they were assigned, stroke count (but then, in what order are characters of the same stroke count organized), or some other or arbitrary ordering? As in, why is the 72AC character at 72AC, not at 72AD?


That's a good question. Unfortunately, a lot of old Unicode process documentation isn't available online -- you can see a couple of relevant-sounding documents at [1], but very little of it is available until 1999 or so.

There are some pretty clear patterns to the character ordering, though -- if you look closely, you can see big runs of characters which share radicals. For example, characters 5000 through 500F are:

倀 倁 倂 倃 倄 倅 倆 倇 倈 倉 倊 個 倌 倍 倎 倏

all of which have the 亻radical on the left side. It's clearly not arbitrary.

[1]: https://www.unicode.org/L2/L1990/Register-1990.html


> There are some pretty clear patterns to the character ordering, though -- if you look closely, you can see big runs of characters which share radicals.

That's because those runs were allocated at once (e.g. CJK Unified Ideographs Extension G spanning from U+30000 to U+3134A) and they were systematically ordered by radicals and stroke counts. Note that radicals and stroke counts are fairly arbitrary and can differ among character sources (the Unihan database has a ton of them). While fairly predictable, this ordering is ultimately arbitrary.


> all of which have the 亻radical on the left side.

Except for 倉 it seems.


Supposedly the two strokes at the top that look like a roof are somehow supposed to be the same character as "亻". You could argue that it's not any stupider than claiming that a upside-down vee with a crossbar is the character as a dee with the ascender cropped off, but it also isn't any less stupid, so... <shrug>.


It looks like it's merged under the roof in that character.


No, a1369209993 is correct - 亻is the combining form of 人 (the "单人旁"); 倉 uses the more basic 人 form.


Unicode character names primarily exist for identifying characters. That's why we have the stability for those names after all: unless it is very much misleading (like swapped Lao letters) we would like to stick to them as long as possible.

For many scripts we can come up with some set of names that are suitable for identification. Han characters, among others, aren't. One can say that Han "characters" are identified by its shape, but a line between two differently perceived characters is extremely unclear. Some may recognize two characters as same, some not, some would even try to add a stroke or so to differentiate the character. Han characters are thus identified by providing multiple properties for them (the Unihan database), and the character names are just placeholders.


>There might also be plain errors in research but are probably in the more obscure blocks.

The one with Lao letters "LO LING" and "LO LOOT" is actually very blunt error -- names for them are swapped. It's pretty safe to assume that whoever was making the standard does not know this writing system.


It really isn't: it's safe to assume they did, because full language mappings are not the job of a single person, and don't get accepted willy-nilly (unlike emoji). However, clerical errors are SUPER EASY, so it's safe to assume someone accidentally pasted something in the wrong database-presented-as-a-spreadsheet close enough to the time of official revision publication, and when that isn't caught, now it's official.


Having done a Japanese localization by communicating entirely in spreadsheets, this is extremely likely. In fact we had to fix the translations a couple of times because we'd send it off for feedback and get a different translation for the same English.


Does anyone know the background behind:

> U+262B FARSI SYMBOL

> This symbol is so named because as symbol of Iran it cannot be encoded in ISO standards.

It would seem strange that ISO standards cannot refer to countries, so I'm guessing this is something specific to Iran?


A quick search brought me to [1] which says:

<quote>

As noted by Roozbeh Pournader:

Neither Farsi, nor a symbol. In real life, it is the official emblem of the goverment of the Islamic Republic of Iran.

Technically that would make it a logo and thus not a suitable candidate for encoding. But Roozbeh also noted:

Exactly. The funny fact is that it has been in Unicode since 1.0...

</quote>

How reliable this is I don't know, but it sounds plausible.

[1] http://archives.miloush.net/michkap/archive/2005/01/29/36320...


It's interesting, I'd guessed it was a compatibility choice but there doesn't seem to be any evidence for that.

Aside: Microsoft sucks for having purged blogs like that one. The archive you're looking at captures most of what was once Michael Kaplan's blog at Microsoft. Kaplan was terminated by Microsoft and then died, and his blog was one of a large number of valuable blogs with insights into Windows technologies that at some point were "tidied away" because they didn't fit whatever nonsense brand vision somebody had that week.


I really hate the way Microsoft rearranges their web pages seemingly on a whim. You can't rely on a link to them being useful for any significant period of time, and as you note useful information is lost on a frequent basis.


Where applicable: Look for URLs in aka.ms - this looks superficially like a link shortener but (whether by corporate policy or just people inside Microsoft agree with us) they are maintained so that when the pages are all shuffled around the aka.ms links still get where you were going.

https://aka.ms/RootCert will always be Microsoft's Root Trust programme documentation (how a Certificate Authority like Let's Encrypt gets themselves listed as trusted in Windows) even when Microsoft decides that page should now be in the form of a GIF anim or a 3D bullet hell game.


It’s not just Microsoft. Apple regularly deprecates and moves around or deleted old documentation, too.


ISO standards refer to countries all the time -- ISO 3166 is literally all about countries, for instance.

I suspect this is actually shorthand for "ISO wanted to avoid referring to countries in Unicode character names". But that ship has clearly sailed nowadays with characters like U+1F5FE ("SILHOUETTE OF JAPAN").


> SILHOUETTE OF JAPAN

This may sound pedantic, but Japan the island or Japan the nation? It seems to refer to the island(s). On the other hand, the "farsi symbol" is of a particular nation rather than a given set of boundaries. Nations can be renamed, but the island of Japan refers to the chunk of earth. Add to that that nations get redrawn every so often in the middle east and I can understand the position.


Japan is not an island. Having a distinct symbol for the 4 large islands comprising most of Japan is definitely a political statement, if not an especially controversial one.


Known anomalies in “Known anomalies in Unicode character names”: they refer to “háček” as “hacek”.

I wouldn’t normally pick something up like this, but isn’t the whole point of ASCII to have a way to represent non-English characters?


> The "caron" should have been called hacek and combining hacek. The term "caron" is suspected by some to be an invention of some early standards body

That's funny. Hacek is a Czech invention (it was actually a dot originally), but I always assumed that "caron" is the official "correct" (english) name.


I don't understand why some of these errors were fixed via formal aliases but not others. Why not give them all proper aliases?


I think that for some the correct alias was already in use for another character, like with the digrams.


(2017) should be append, since both the header and footer state that document is from 2017.

Also, from the parent page for that link:

“ Unicode Technical Notes provide information on a variety of topics related to Unicode and internationalization technologies.

These technical notes are independent publications, not approved by any of the Unicode Technical Committees, nor are they part of the Unicode Standard or any other Unicode specification. Publication does not imply endorsement by the Unicode Consortium in any way. These documents are not subject to the Unicode Patent Policy.”

SOURCE: https://unicode.org/notes/


Added, thanks.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: