In the Turkish locale, "INFO".lower() != "info"

kentonv · on Aug 16, 2020

Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

It seems to me that in practice, it's extremely rare to want to change case of real, natural-language text. When I have natural-language text, it's just a blob to me, and I don't want to touch it.

The only time I ever want to lower-case or capitalize something, I'm working with identifiers meant for computer -- not human -- consumption. Usually, specifically, I'm dealing with identifiers that have annoyingly been defined to be case-insensitive even though the only humans that ever see them are programmers and programmers hate case-insensitivity. HTTP headers are a common example.

I mostly write C++, and I end up writing code like:

    for (char& c: str) {
      if ('A' <= c && c <= 'Z') c = c - 'A' + 'a';
    }

Later on, some well-meaning developer on my team will come along and say "Ugh what is this NIH syndrome?" and then they "clean it up" as:

    #include <ctype.h>

    for (char& c: str) {
      c = tolower(c);
    }

And then I have to say NOOOOOOO DON'T DO THAT YOU HAVE NO IDEA WHAT tolower() REALLY DOES!

I struggle to imagine any real use case where you'd actually want locale-dependent tolower() other than, maybe, a word processor -- but if you're writing a word processor, you're probably not going to be depending on the language's built-in string APIs to do your text manipulation.

rkangel · on Aug 16, 2020

This is a classic case of a 'why' code comment being needed. It's obvious what you're doing, but without a 2 line explanation, it's not clear why.

dmurray · on Aug 16, 2020

Seems like it would be even better to put this in its own function with a descriptive name, ascii_tolower or roman_tolower or whatever, that has exactly the semantics you want.

gregmac · on Aug 16, 2020

This is exactly right, is and is a great example of what self-documenting code can be. The function itself could have a bit more explanation but any code calling it is going to be obvious.

The big difference is it looks deliberate, instead of just code written by someone trying to micro-optimize, be very clever, or who just didn't realize tolower() exists. Most people will pause before just replacing it, and likewise it should trigger questions in the PR.

eitland · on Aug 16, 2020

Still warrants a comment to be sure no one concludes the built in is good enough.

randomdata · on Aug 16, 2020

It certainly warrants a test to document what the function is for.

And if that test also happens to validate that the documentation is accurate, that is a nice side benefit.

h0l0cube · on Aug 17, 2020

> It certainly warrants a test to document what the function is for.

You might be being serious, so I'll indulge. How would a test that will never break, for a function that would never be changed (because look at it), that lives in a different part of the directory structure, be worth even one second extra time or thought to write it, over-and-above the descriptive comment that is longer than the code itself, and lives discoverably right next to the code itself?

olliej · on Aug 17, 2020

Because that's how you stop regressions.

The cost of writing a single test case is not more than the cost of diagnosing what change broke your code for Turkish users.

Now there's always a point where maybe the infrastructure for testing that kind of thing doesn't exist, so writing your one "simple" test case takes a bunch of time, but on the plus side, future similar "simple" test cases will be easy at that point. And no one has to track down why your code is broken in turkey.

Many years ago while I worked on IM/IMEs on Mac and windows I spent maybe a week working on code that allowed an IME to be implemented in JS (within the webkit test harness), so it was possible to test, and prevent regressions that kept on being reintroduced by people changing layout and/or editing code in ways that are "obviously correct" for US/English keyboards. The win from that week is many regressions that were caught before they were even committed, and the ability to completely rewrite the text input system to support IMEs on non-mac platforms.

h0l0cube · on Aug 17, 2020

This isn't exactly addressing my point. Never did I say that 'All tests are useless'. Refer to my answers here:

https://news.ycombinator.com/item?id=24183844

xahrepap · on Aug 17, 2020

I've found that writing a unit test to verify it works at all is sometimes loads easier than manually running the app or whatever.

At that point, just leave the test and check it in.

h0l0cube · on Aug 17, 2020

I agree with this, kind of. Though I've found that keeping useless tests around doesn't always have 0 cost, same for any dead code.

Often, though, you can verify that your code works using a REPL.

I'm also a huge fan of how doc-tests in some languages that make the tests are part of the documentation, and neatly 'cohere' them to the function itself. At which point I'm happy to leave them in, as the tests-as-documentation are harder to miss, and are actually instructive.

scubbo · on Aug 17, 2020

Comments can be ignored, moved, misinterpreted. A test asserts correct behaviour. The only way to make the build fail if someone has (incorrectly) replaced the "complex toLower" with the (incorrect) "built-in toLower" is to delete or change the corresponding test - which rings way more alarm bells than a vague recollection of "hey, didn't there used to be a comment around here that said we shouldn't change this?"

Machine-time to run tests is cheap (if it's not in your codebase, then I agree that your benefit calculus may be different) - human cognition and awareness to prevent mistakes is valuable.

h0l0cube · on Aug 17, 2020

Keep in mind, I'm not advocating for 'no testing, ever'. But surely there's a limit to where you say something is so trivial that it won't ever change. And it's with these preconditions I ask the question:

> a test that will never break, for a function that would never be changed

I'll refer you to the OPs low-tech to-lower function, which is reductively simple, and never should be altered, with a comment as to why.

> Comments can be ignored, moved, misinterpreted.

Don't disagree with this. Someone lacking competence and care may ignore a comment, is incapable of comprehension, or just moves things for no reason. This should be caught at review.

Tests aren't infallible either. They can be invalidated, disabled because someone lacking competence decided it was the easiest way to move forward. This should be caught at review.

Edit: I'll address some of your other points more specifically...

> The only way to make the build fail if someone has (incorrectly) replaced the "complex toLower" with the (incorrect) "built-in toLower" is to delete or change the corresponding test

In OP's precise example, a worse thing can happen. They can shrug their shoulders and just use the built-in directly. Human-context business-value explication has more powerful benefit here.

> which rings way more alarm bells than a vague recollection of "hey, didn't there used to be a comment around here that said we shouldn't change this?"

If no-one's reviewing when someone changes the actual code and it's adjacent comments, who's reviewing the changes to the tests?

> human cognition and awareness to prevent mistakes is valuable.

Extra code comes at a maintenance and cognition cost. Maybe one trivial test seems like a minor cost, but how about the maintenance of 1000s of tests that (ought to) always pass?

sedatk · on Aug 16, 2020

C# has a `ToLowerInvariant()` variety for that.

paranoidrobot · on Aug 16, 2020

Which iirc is an alias for ToLower on the en-us locale. (Same for the other C# *Invariant() methods)

GoblinSlayer · on Aug 17, 2020

It calls ToLower(CultureInfo.InvariantCulture);

Locale name is iv-IV.

lmm · on Aug 16, 2020

Comments are unreliable, you should use tooling to fix this forever. E.g. findbugs has a rule for this problem: http://findbugs.sourceforge.net/bugDescriptions.html#DM_CONV...

kentonv · on Aug 16, 2020

Yeah I probably wrote that comment the first few times I did this but it's hard to write it the 50th time.

Maybe I should have my own tolower() function that I can call so I only have to write the comment once but it just feels ridiculous somehow.

Natsu · on Aug 16, 2020

It's far more ridiculous to repeat yourself over and over instead of making a simple function that describes exactly what you want and why.

viraptor · on Aug 16, 2020

> but it's hard to write it the 50th time.

You know you wrote it 49 times before, but the person reading the code probably doesn't. It only changes your experience not everyone else's.

If it's the same codebase, just write that function :)

eitland · on Aug 16, 2020

Write the function, comment it!

Many of these are obvious to many people here, but some aren't.

Even I can admit that some of the stuff in this thread is not obvious at all.

random314 · on Aug 16, 2020

Why does it feel ridiculous?

kentonv · on Aug 16, 2020

Because I've already rewritten more of the standard library than is healthy.

I mean, it's clearly the right thing to do here but I can predict the conversation that will inevitably result... "You wrote your own tolower() function? Why?" "The standard one is horribly broken." "How could a function that lower-cases a letter be broken??? Jesus Kenton your NIH syndrome is out of control." "Sigh..."

(Slightly more seriously, any particular time I need to lower-case something, it takes 10 seconds to write out the code, but would take 10 minutes to find a good place to define a reusable function and exactly what its API should be, and so it never seems worth the effort in the moment. Just like how most messy code comes to be.)

random314 · on Aug 16, 2020

This conversation can be simply be avoided by copy pasting your original hacker news comment into the library function header.

I have noticed some coworkers have their ego gratified by being right while everyone else is wrong. Instead of simply explaining what they are doing when they are doing it, they will do something that looks wrong in a very noticeable way and wait for the backlash. The backlash gives them an opportunity to show everyone else how they were right while everyone else was wrong and also an opportunity to play victim. However, in SW development - it is not just the technical details - your behavior also matters in a big way.

In this particular case, the correct approach is to create your own library function with appropriate comments. This is why the concept of a library function was invented. It is its entire raison-d'etre. However, you are doing everything but that. Including providing justifications in hacker news comments instead of your source code.

Now inevitably, someone will change your inline code to use to_lower. This will give you an opportunity to scream bloody murder, show how other engineers don't really understand technical details, correct them and also play victim. Create a library utility with comments and link it in - End of story.

saagarjha · on Aug 17, 2020

I’m reminded of garbage collector blog posts where they do something stupid (“disable it”, “allocate a ballast”) and then get to spend a couple pages explaining why it worked for them.

kentonv · on Aug 16, 2020

Speaking of people wanting to gratify their ego by being right: Everyone on this thread trying to lecture me on software engineering? ¯\_(ツ)_/¯

random314 · on Aug 17, 2020

At least, you get to play victim :)

obmelvin · on Aug 17, 2020

Looks like you're only 10 times away from using this to having spent your 10 minutes ;)

But seriously, I'm not here to lecture you, but personally I'd appreciate having a teammate educate me on the undesired behavior and have a nice function I could use to ensure my own code doesn't break user input

nitrogen · on Aug 16, 2020

Most codebases I've worked with have a StringUtils.java, or .kt, or a str.c or utils.c. Maybe just start one. Interestingly I haven't needed it as much in Ruby.

But I too feel the cognitive (and social!) burden of introducing a new function. It's not just "where do I put this", but "how do I convince the team I know what I'm doing since 15 years of experience clearly isn't enough and developers (mostly rightly) ignore positional authority and seniority".

Izkata · on Aug 16, 2020

  #include <kentonv.h>

kentonv · on Aug 16, 2020

It's... It's called KJ...

https://github.com/capnproto/capnproto/tree/master/c++/src/k...

lathiat · on Aug 17, 2020

This project has the greatest sales pitch I've ever seen: https://github.com/capnproto/capnproto/blob/master/README.md -- combined with the project name it's just perfect.

cesarb · on Aug 16, 2020

Java has two variants of toLowerCase(): one which uses the default/current locale (almost never what you want), and one which receives an explicit locale (Locale.ROOT is almost always the one you want). At work, we use the "forbidden APIs" checker (https://github.com/policeman-tools/forbidden-apis) to fail the CI if the variant which uses the default locale is ever used; if you really want to use a locale-dependent toLowerCase(), you have to explicitly call Locale.getDefault() and use it as the locale.

Is there something similar for C and C++? It could help in your case, by making your well-meaning colleagues aware of the issue.

vesinisa · on Aug 16, 2020

> Locale.ROOT is almost always the one you want

At least Android developers are advised to use Locale.US: https://developer.android.com/reference/java/util/Locale

> The default locale is not appropriate for machine-readable output. The best choice there is usually Locale.US – this locale is guaranteed to be available on all devices, and the fact that it has no surprising special cases and is frequently used

It would be indeed interesting to see in which features these two locales actually differ.

rvnx · on Aug 16, 2020

Yes, tolower_l(string, locale)

smnrchrds · on Aug 16, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

On many documents, including Turkish passport and identity card and many (all?) other passports, names are written in all caps. Maybe toLower() is not that useful, but toUpper() is crucial in any application where you are dealing with real person names.

phonebanshee · on Aug 16, 2020

toUpper is definitely language-dependent. For example, in Irish there are initial letters that are written as lower-case even in all caps. Wikipedia's example is amusing, since it's a photo of a government passport office sign - the all-caps version of Oifig na bPasanna is OIFIG NA bPASANNA (photo https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/AL..., article https://en.wikipedia.org/wiki/Irish_orthography#Capitalisati...). It would look utterly bizarre to write OIFIG NA BPASANNA. And this isn't at all an unusual construction in Irish, it happens in personal names all the time.

Plus, there's the issue of diacritical marks. Irish keeps long marks over capitals, but French drops accents. Do you plan to do é => É (required for Irish - POBLACHT NA hÉIREANN is the all-caps version of Poblacht na hÉireann [the Republic of Ireland]) or é => E (common practice for French)? You have to get it right, and you have to know the language to do that. (Poblacht na hÉireann also illustrates the fact that initial caps is also a language-dependent idea; you absolutely can't write Poblacht Na Héireann - that makes my eyes burn just looking at it.)

(And before you say, well, Irish isn't a language spoken by very many people, remember that it's an official language of the European Union. If you're writing software to be used by EU agencies, you're going to have to care.)

smnrchrds · on Aug 16, 2020

> French drops accents

Official position of both Académie française and Office québécois de la langue française is that accents must be preserved in capital letters. However, it is common in France to drop them, while they are almost always preserved in Québéc. I have heard that the reason is that European French keyboard layout makes it difficult to type accented capital letters, unlike Québécois French layout which makes writing them easy. But I am not sure if this is the cause rather than the effect of the practice.

forty · on Aug 16, 2020

I confirm that in French capital letters should have accents.

I have an anecdote on this: on birth certificates, family names are written in capital letters. It turns out my partner name ends with a É which was written as E in her birth certificate. She never noticed (it had never prevented her to get national ID with her name properly accented) until we had our first kid which has both our names, and they refused to have the name accented until we had my partner's birth certificate updated (which as you can imagine is quite an adventure, since you need to dig ancient family birth certificates to prove it was originally written with an accent...).

ccccc0 · on Aug 16, 2020

French here, I recall my primary school textbook where they said something along the lines of "sometimes accents are dropped, that's sort of fine as long as it doesn't change the meaning". They gave the example of a fictitious newspaper whose headline was "UN POLICIER TUE": depending on the accent (tué/tue) it means either "a policeman kills" or "a policeman killed".

dongvsascript · on Aug 17, 2020

american here who's lived in france and still use an azerty keyboard because it lets me type in both languages. how do you get a capital A with an accent on an azerty keyboard?

nargek · on Aug 17, 2020

Easily ? You don't, with the most common azerty keyboard you have to use : Ctrl+Alt+7 Shift+A. That's why there is a new standard for azerty keyboard the "NF Z 71-300" that is better with accent and stuff like æ,œ,Æ,Œ,«» etc.

dongvsascript · on Aug 17, 2020

Damn it. That's why I couldn't figure it out for years. You can't.. Is the new keyboard standard being used anywhere? Like if I walk into a common office in france and sit down at a laptop - is it likely to be using the new keyboard layout?

What's weird is I sometimes, even 10 years ago, would get an email from people in France, and it would have an accented A. Like, how did they do that..

nargek · on Aug 17, 2020

i know that LDLC is selling one of these [1] ... and that's it, i don't even think it's coming to laptop anytime soon.

[1] https://www.ldlc.com/fiche/PB00279741.html

smnrchrds · on Aug 17, 2020

Not the answer to your question, but this is why I think Quebec uses accented capitals more than France. In Canadian French layout, there is a key for à. Simply using Shift+à gives you À.

dongvsascript · on Aug 17, 2020

classic canada, fixing the keyboard. now all that's left is above 69.

korean is worse. they base off of 10k, not 1000. so a million is hundred ten_thousand. bagman. but as bad as that is, it's no 97 amirite.

phonebanshee · on Aug 16, 2020

Interesting that I was wrong - another data point in the "it's more complicated than you think it is" column. I always thought you were supposed to drop them (because I was explicitly told so by a French engineer I worked with in the 90s, talking about one particular poster, and many years later still assume that one hallway conversation was enough to make that THE OFFICIAL RULE without bothering to actually check...)

kergonath · on Aug 16, 2020

It’s been a pet peeve of mine for quite a while, and an urban legend for quite a lot more. It was tolerated when all we had was typewriters (and even then you were supposed to add them, but it was cumbersome).

hocuspocus · on Aug 16, 2020

Wrongly dropping accents on uppercase letters predates computer keyboards; the French azerty layout puts accented letters on the first level of number row:

http://j.poitou.free.fr/pro/img/tkn/tw-image.jpg

This idiocy carried over. The recent layout update makes dead accents more accessible:

https://norme-azerty.fr/

But I haven't seen much adoption yet.

zorked · on Aug 16, 2020

That button to the right of the P that contains four different forms of dashes is... interesting.

Even more if you consider that the minus sign there is not the character - that is used by every programming language.

masklinn · on Aug 16, 2020

> That button to the right of the P that contains four different forms of dashes is... interesting.

And there are two more on the 8 key.

I like the mac international keyboard layout, but it still only provides for 4 of those: the non-breaking hyphen and the "proper" minus sign are lacking.

I like that the "new azerty" provides for pretty much every diacritic, even those which are not in use in french.

kergonath · on Aug 16, 2020

I blame the horrendous default settings of MS Word and Outlook for this, and the maddeningly convoluted way to enter accented caps on Windows. There is no context in which it is correct to omit accents in French, caps or otherwise.

rurban · on Aug 17, 2020

Nope, none of them are really useful. The only useful folding function is casefold(str, locale). (if your str type doesn't know it's locale).

toLower and toUpper should only be used for presentation, but all case-insensitive operations need to be done with casefold.

crazygringo · on Aug 16, 2020

Of course! There are tons of cases where you need to store in "sentence case" (first word and proper nouns and acryonyms capitalized, nothing else) so you can convert to title case or all-caps as needed for display purposes. Templates are full of this kind of stuff.

There are similarly tons of cases where you reduce everything to lowercase without accents for searching and indexing purposes. Depending on your setup, your database might handle that for you, but there are edge cases where you need to do it at the application level.

Long story short, every string has a locale, and you should never change the case of something without specifying its locale. Either be explicit that it's American English or ASCII or Latin1 or whatever... or that it's something else. Never leave someone reading the code guessing.

asveikau · on Aug 16, 2020

> you can convert to title case or ... for display purposes.

I am skeptical if someone thinks they need to do this and how they will get it done.

Eg. Looping through and capitalizing the first gylph after breaking whitespace regardless of locale is not the way to go, but I guarantee you a nontrivial amount of people reading this would write exactly that if asked to solve the high level problem.

I find it annoying when software or even in some cases human typists try to enforce English language title case. Some other languages have different rules for titles and capitalization and seeing the English rules enforced out of context can be jarring.

crazygringo · on Aug 16, 2020

I find it amusing you are skeptical... why so distrustful? But believe me, it's quite necessary.

I use the citation manager Zotero a lot. It's necessary to store all the titles of journal articles and books in sentence case (e.g. "Issues regarding the economy of France") because some publications require citations to use sentence style (remains unchanged) while others require title style ("Issues Regarding the Economy of France").

And obviously the solution cannot be naive, but is language-dependent, so that in English words like "the" or "in" don't get capitalized. And as the rules for titles are obviously language-dependent, so it goes without saying that the algorithm would have to be localized.

(Note that while it's relatively trivial to convert from sentence case to title case in English, it's impossible to automate in the opposite direction, because you never know if a capitalized term in the title is a proper noun or not.)

a1369209993 · on Aug 17, 2020

> (Note that while it's relatively trivial to convert from sentence case to title case in English[...])

Strictly speaking you can't do that either:

  "Latine et videtur".totitle = "Latine et Videtur"
  "I et some food".totitle = "I Et Some Food"

I suspect that fails less often though.

dragonwriter · on Aug 17, 2020

> I am skeptical if someone thinks they need to do this and how they will get it done.

It's not an uncommon requirement, though probably not often in a locale sensitive way, so you can often get away with just doing the right thing for one locale.

To do it generally, you probably need to research appropriate handling per locale (likely, what your client wants done in the particular locale, since I'm not sure there is usually just one way; I know there are multiple variations in en-US), and then have a master function that takes the text and target locale and applies the correct locale-specific title casing rules.

layer8 · on Aug 16, 2020

> I am skeptical if someone thinks they need to do this and how they will get it done

I most often use title case mappings in the context of replacements of names into diagnostic messages. I.e. you have n types of objects and m messages like “${foo} not found” or “More than one ${foo} required”, and perform title case mapping on ${foo} depending on whether it is at the start of the message (sentence) or not.

Groxx · on Aug 16, 2020

While I agree it's almost always a bad idea: effectively every design team I've encountered has requested stuff like this.

So yes, it's extremely common. It's done on tons of websites, for tons of mail addresses (ever receive an all-uppercase address on a delivery? same issue.), and tons and tons and tons of emails and legal documents (woe to those with last names like McCormick).

happytoexplain · on Aug 16, 2020

We frequently use localized upper/lower casing at my workplace, as we do not store such stylization in user facing copy. Most copy is written and translated in sentence case or title case (because both are much harder to achieve programmatically), and then our designers have the option of using that casing as-is, or using all-upper or all-lower.

mrighele · on Aug 16, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

If you are collecting data which include people's names or addresses you probably want localization to be applied correctly so that you can compare data coming from different sources and possibly with different cases. Having your name spelled differently in different documents can cause a non trivial amount of problems with an overzealous bureaucracy.

bjoli · on Aug 17, 2020

I was onced refused entry into a country for 6 hours because of different spellings of my last name. The (apparently quite amateur) travel agency had sent my last name written using OE instead of Ö, whereas all the documents relating to my identity use Ö (or if it was the other way around).

dusted · on Aug 17, 2020

We thought on this specifically when naming our son, one with no special characters, a pure clean ASCII string like mom used to cook them.

dataflow · on Aug 16, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

How do you lowercase without localization? Remember all text isn't English. Unless you're actually asking if anyone has ever had a use case for lower-casing non-English text?

lmm · on Aug 17, 2020

I think the real question is: is there any use case for toLower() where you want the system default locale to be applied? If you want to lower-case text for "system" purposes then you need to keep track of the locale associated with that text (which won't generally be the locale of the system the program is running on); the only case where you want to use the system default locale is where you're interacting with the (human) user of the system, but it's hard to imagine a use case where you'd want to use toLower() for displaying text to the user.

phonebanshee · on Aug 17, 2020

You vastly underestimate the desire of people for things to look 'nice' where 'nice' is defined by exactly what they mean when they say it. If you want a 'nice' display of data that's input by users, you're often going to butcher it by doing things like converting everything to lower-case, then maybe upper-casing the first letter. Because 'nice.'

It's horrifically wrong, of course, but no one is going to think you're a reasonable person for insisting that correct wins over nice.

Imagine a real-world scenario where you're displaying a list of names of users, where the users got to type in their own names. You can either use what users typed in, or you can do something like process it so it's in the American idea of initial caps. You can't possibly do localization, since it's a list of names of people in the US, so it's melting pot of names from all over the world, and you never asked for user input of what you'd use to localize it anyway [and no, you absolutely can't figure that out from just the name itself]. You can't use what users typed in, because the design team thinks that looks like a horrible mess (and it is; users are laughably bad at data entry). So the design team wins, and you butcher everything by pretending that toLower() plus toUpper() for the first character of every word is a sensible thing to do. (And yes, that's a painful real-world example of software I've shipped and that was used by millions of people.)

dataflow · on Aug 17, 2020

> it's hard to imagine a use case where you'd want to use toLower() for displaying text to the user.

Maybe for automatically changing the case of user input (auto-correcting capitalization, etc.).

reaperducer · on Aug 16, 2020

Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

Yes. In a system I'm about done with, there is a sortable chart of dates and times. In some languages day and month names are capitalized, and in some they are not.

bjourne · on Aug 16, 2020

How does that work? toUpper() can't possibly know that the string is a day or month name.

a-nikolaev · on Aug 16, 2020

I think, this is why you need explicitly ASCII and explicitly Unicode lower-/upper-/capitalized-transformations. So you don't assume these things to work automagically. Some times you need one type, the other times you need the other type.

__s · on Aug 16, 2020

I recently ordered a Pixel, on the mail slip they had converted my name to uppercase, last name read "DUBé"

Also got my address screwed up on account of living at a half address.. 1/2 some street #42

ramshorns · on Aug 16, 2020

A char in C++ is one byte, right? Is it even possible for this "fixed" code to call ctype::tolower() on something like a UTF-8 or UTF-16 code point?

kentonv · on Aug 16, 2020

Correct, it won't even work as intended with modern Unicode locales.

ramshorns · on Aug 16, 2020

So maybe if the code is broken anyway for non-ASCII characters, it's fine to use tolower, since somewhere else in the code it ensures that c is a byte.

kentonv · on Aug 16, 2020

The code is not broken for non-ASCII characters. UTF-8 works just fine with 8-bit chars, and the code I wrote correctly lower-cases ASCII letters even when UTF-8 is present (it just won't touch the UTF-8 chars, which is fine in this use case).

It's only tolower() and toupper() specifically that are broken because they expect to be able to do their job on a single byte, which is no longer possible with UTF-8.

Meanwhile, using tolower() to lower-case an HTTP header name won't give you the correct results if the locale is set to Turkish with the ISO 8859-9 character set, which is 8-bit, and where tolower('I') will produce the byte 0xFD which is 'ı' in this character set.

ramshorns · on Aug 16, 2020

I see, thanks for the explanation.

felixarba · on Aug 16, 2020

I have a morse code app which consistently crashed when certain users would try to translate letter "i", and it took me a long time to figure out that only the turkish users would complain about it, and when one of them sent me a screenshot I only noticed a "wrongly" rendered capital letter i (I used toUpper) and after digging around a bunch, I learned about this while turkish letter i.

ezoe · on Aug 17, 2020

A small portion of people whose native writing system are based from Latin alphabet believes that case conversion is an essential must have feature and having that feature in locale library help the localization.

But if you consider the other writing systems, having case conversion feature in locale library actually harm the localization effort. It's not easy to make it no-op. The implementation of locale library are generally poor quality because the implementer has no idea how other languages works.

Another example is singular/plural support. It just burden the localization effort because for languages which no such concept, the localization work must ensure that presence of such library doesn't harm their language.

Some people is under the delusion that locale library must have more features to support his native languages not so important traits. While what really necessary is just forget about supporting minor language traits that are not universal among the languages.

The text should be considered binary blob and most program should pass it left to right without modification.

karmakaze · on Aug 16, 2020

And a note that it assumes ASCII. On an EBCDIC system, the 'A'-'Z' test will translate other characters besides letters.

grumple · on Aug 16, 2020

Now I’m wondering about what happens when we change email addresses to lowercase...

https://en.m.wikipedia.org/wiki/Email_address#Internationali...

sedatk · on Aug 16, 2020

You shouldn't. Email addresses are case-sensitive.

bschwindHN · on Aug 17, 2020

Ahhh, mastahpiece

https://doc.rust-lang.org/std/string/struct.String.html#meth...

vasama · on Aug 16, 2020

This is why I have a set of functions like AsciiToLower(char* string, size_t size). They only touch characters in the ASCII space at <0x80. Even went and implemented them with SSE for x86.

tyingq · on Aug 16, 2020

Airlines might be a good example. The back end system doesn't grok lowercase characters at all, so you need to transform data to uppercase A-Z, 0-9 and a few punctuation marks.

miahi · on Aug 16, 2020

But they do have the most extensive transliteration rules library to match everything to that limited character set (ICAO Doc 9303[1]) that is used by many systems outside the aviation world.

[1] https://www.icao.int/publications/pages/publication.aspx?doc...

nurettin · on Aug 16, 2020

You need localization if you do any kind of multilingual text processing. Not sure how it could escape a thinking person's imagination.

aflag · on Aug 16, 2020

File names, URLs and email address support utf-8 characters and you may want to lower case them in many situations. If the user is trying to search for a string, they probably want case insensitivity. I don't think it's that rare/weird for people to want localisation to apply when calling toLower.

olliej · on Aug 17, 2020

Yes, semi regularly -- lowercasing of text for user interfaces is frequently required. Similarly case insensitive comparisons.

Human text is much much more complex than any computing protocol you're ever going to engage with.

The question is "which one should be default", and that's a more complicated question.

dragonwriter · on Aug 17, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

Well, it's always been exclusively in American English, but I've certainly used it in cases where I was doing text transforms for display, so, yeah, though it's not the most common case.

fovc · on Aug 16, 2020

What about sorting users by name?

phonebanshee · on Aug 16, 2020

That's completely language+locale dependent. For example, here'an alphabetical list of Irish surnames - https://www.duchas.ie/en/nom?txt=M. You'll notice that sort order ignores an initial O or Mac (or Ni or Bean, etc).

golergka · on Aug 17, 2020

Honestly, strings that are intended for human and for computer consumption should just be two different basic types without any implicit conversion between them.

jmiller099 · on Aug 16, 2020

i like c |= 0x20; :)

neeeeees · on Aug 17, 2020

I may be missing something - why is tolower(c) incorrect here?

kentonv · on Aug 17, 2020

Because if `c` is the letter 'I', and the current locale happens to be set to Turkish, then `tolower(c)` will return 'ı', not 'i'. If you are trying to lower-case an HTTP header name for the purpose of case-insensitive comparison, this is definitely not what you wanted. (And similar problems exist with several other locales; it's not just Turkish.)

neeeeees · on Aug 18, 2020

Ah I see, thanks for explaining

bayindirh · on Aug 16, 2020

Welcome to the Turkish language, where we have ı, i, I and İ. In our language the conversion is as follows:

- i <-> İ

- ı <-> I

We love our dots and preserve them. For a more detailed read, please see:

https://blog.codinghorror.com/whats-wrong-with-turkey/

Natsu · on Aug 16, 2020

As I understand it, Turkish is one of the more important locales to test with because of things like this.

1-more · on Aug 16, 2020

Poor encoding can lead to the odd murder too: http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two...

> The use of "i" resulted in an SMS with a completely twisted meaning: instead of writing the word "sıkısınca" it looked like he wrote "sikisince." Ramazan wanted to write "You change the topic every time you run out of arguments" (sounds familiar enough) but what Emine read was, "You change the topic every time they are fucking you" (sounds familiar too.)

rvnx · on Aug 16, 2020

That doesn't explain the e instead of the a, does it ?

bayindirh · on Aug 16, 2020

In the olden times, ending words with e instead of a is considered an acceptable typo.

Also the article explicitly says "it looked like he wrote". So when you see red, that last letter can become anything and nothing would change.

bayindirh · on Aug 16, 2020

Turkish is the only language which has the ı & I pair. Similarly, AFAIK, Turkish is again the only language with ğ and ş letters. So, by testing for Turkish, you test for a lot of European languages at once. Moreover we share some modified letters(ç, ü) with other Central European languages.

If your program can pass “The Turkish Test”, you pass a lot of others too.

anticensor · on Aug 16, 2020

Azerbaijani too. Moreover, Azerbaijani has an additional letter ə, which sounds like /æ/.

therein · on Aug 16, 2020

I love the feeling of camaraderie arising from that partial mutual intelligibility of Turkish and Azerbaijani.

That connection through language goes a long way.

müqəddəs bacı millət :)

rurban · on Aug 17, 2020

Turkish, Lithuanian and Korean specifically. They do have the most exceptions.

sampo · on Aug 16, 2020

> We love our dots and preserve them.

Turkish preserves the dots of i, ö and ü in their capital versions, but not with j. The capital J is dotless.

bayindirh · on Aug 16, 2020

Isn’t capital j is dotless in every language?

generationP · on Aug 16, 2020

> - i <-> İ

> - ı <-> I

After seeing this, I don't understand how the rest of us can fail to have the same distinction. There's something logically beautiful -- like the rhyme in a good poem -- about artificial languages (or, in this case, alphabets) that naturally evolved languages just cannot compete with.

lolc · on Aug 16, 2020

To me, the undotted-i thing is more of a hack. Far more beautiful is why Turkish distinguishes the two vowels: Speakers of Turk languages don't like to mingle bright and dark vowels in the same word. Appreciate the economy of not having to switch your enunciation between front and back in the same word.

Speakers of Turkish generally build words with vowels from either of these exclusive groups:

aıuo

eiüö

You see? They could write "ä" instead of "e" to have all bright vowels dotted and mirror the symmetry. But because "e" was already available and pronounced like that in other dominant languages of the time, they stuck to it. No need to break conventions. This didn't work for "ı" because there was no corresponding letter in the Latin alphabet. So they loped off the dot from i and called it a day. A pragmatic decision. Except that the capital version had then to be dotted to keep the distinction. That caused a lot of headaches downstream.

9nGQluzmnq3M · on Aug 17, 2020

This is called vowel harmony and it's a fairly common linguistic feature:

https://en.wikipedia.org/wiki/Vowel_harmony

...but it's notably absent from the Indo-European languages.

tryauuum · on Aug 16, 2020

Unrelated story about Russian language.

The first letter of russian alphabet is А, the last one is Я. So it's natural to try to match russian words with '[А-Яа-я]+'. But this is a recipe for disaster, this regexp doesn't match words with 'Ё' in them like "Артём".

This is due to the fact that regexp ranges work on byte values. All letters of russian language have neatly ordered byte values, except for the Ё.

Sharlin · on Aug 16, 2020

English is probably the only commonly spoken language where naïve char range matching kind of sort of works. I say ”kind of sort of” because [a-zA-Z] trivially fails to match all words in many English texts that haven’t been lossily compressed to ASCII, including this comment.

It is practically always wrong to match on [a-z] unless you’re parsing a computer language whose spec guarantees that it works.

Izkata · on Aug 16, 2020

Forget ascii conversion, that also fails on contractions like "don't".

kuang_eleven · on Aug 17, 2020

ï isn't an English character though! For an English document, "naïve" is a mispelling, at best. That being said, you don't always have the luxury of doing things the "correct" way, especially if users are trying to cram god-knows what into a text field.

Sharlin · on Aug 17, 2020

Nope. ”Naïve” is an accepted variant of ”naive” in every major English dictionary. [a-zA-Z] is never a ”correct” way to match natural language text.

ThePowerOfFuet · on Aug 19, 2020

Would it blow your mind that coöperate is technically correct?

That's the whole point of the diaeresis.

tryauuum · on Aug 16, 2020

I always wanted to know, how easy is it to type naïve on a common western keyboard?

Do you have to press some obscure keyboard shortcut?

jakub_g · on Aug 16, 2020

I'm Windows-based and wanted a keyboard layout that will allow me typing easily Polish and French at the same time, without switching keyboard layouts (PL == US+AltGr for accents; while FR layout is insane, because apart from being AZERTY, all special chars are in different places, and you need a Shift to type numbers; and the way to type accents is also special).

I found "Polish international" [1] layout which honestly can be perfect for many people. It's optimized to be compatible with regular Polish keyboard (hence with US keyboard too), and maybe not the fastest if you type a lot special chars, but it's extremely intuitive:

ï = AltGr+:, i

ü = AltGr+:, u

é = AltGr+/, e

è = AltGr+\, e (since it's extremely common, also aliased as AltGr+w)

If you're Windows based and want US-compatible keyboard layout that allows easily typing any special chars, I highly recommend it.

[1] https://translate.google.com/translate?sl=pl&tl=en&u=https%3...

205guy · on Aug 16, 2020

I type English and French in Windows on the same QWERTY keyboard. I once learned to type on Azerty, but I mainly type English now on a standard US keyboard layout. For the French, I find the windows alt-numbers works the easiest for accented characters. Alt-130=é, Alt-133=à, Alt-135=ç, Alt-137=ê, Alt-138=è which covers 95% of the accented character usage. I have a little chart next to my desk with all the others (ï,ô,ù) they’re nearly all Alt-14x and Alt-15x. And then I’ll put é in the paste buffer because it is the most used and a bit quicker that way (for words like “préféré”).

The Alt-13x codes are not as quick as the Azerty keys, but good enough and once memorized are fairly easy with a keyboard that has a keypad (most PCs do, even my laptop). This is especially true because they are done with both hands simultaneously, as opposed to something like Cmd-e+e on a Mac. Actually, they are faster than finding the accented characters on my QWERTY virtual keyboard as I type this comment on iOS.

Those AltGr- combos seem complicated to me, I would much prefer a system such as AltGr-e =é, then AltGr-ee=è, AltGr-eee=ê, etc. To me that would be more intuitive than remembering the composing character (slash for aigüe, etc).

cassepipe · on Aug 16, 2020

You seem to be quite used to your Alt combination but as you said they really are not straightforward. I found another very simple solution, on Linux you can set a compose key (typically Alt gr or the contextual menu key). You type one after another, the compose key and then any two keys that make sense like ' followed by e (et vice versa), it will give you a é. It is both fast and easy to work with.

jakub_g · on Aug 16, 2020

Reminds me of when the 'D' key broke in my physical keyboard long time ago. I liked that keyboard a lot and couldn't find a good replacement so I learnt to type Alt-100 do get 'd'.

reaperducer · on Aug 16, 2020

how easy is it to type naïve on a common western keyboard?

In macOS, you can either use Command-u (for "umlat") followed by i, or hold down the i key for a second and press 2 to select the ï from the pop-up menu.

masklinn · on Aug 16, 2020

> Command-u

option-u (aka alt-u).

Generally speaking, command is for application-level or os-level commands, control is for text edition, and alt is for alternate characters (all can be shifted and command "overrides" the rest).

reaperducer · on Aug 16, 2020

You're right, it's Option-u. Most of the key labels on my MacBook have long since been scratched away.

This has happened with every single Apple keyboard I've ever used. I suspect it's my fault, since I'm a key pounder, having learned to type on an IBM Selectric typewriter.

nemetroid · on Aug 16, 2020

On a Swedish keyboard, there's a dead key for ¨, so you press that followed by i to get ï.

It's not very clear why the Swedish keyboard has that key, since ä and ö each have their own keys. The layout has other quirks as well, such as keys for §, ½ and the useless "currency sign", ¤.

bjoli · on Aug 17, 2020

Mac os quietly removed the paragraph sign for me when switching keyboards to non-apple ones if you used a non-standard layout that had moved the key they use to detect what kind of keyboard you are using.

It wouldn't have been much of a bother, had I not used the key as my Emacs leader...

aliswe · on Aug 16, 2020

Yes! I think the "mine" character should be switched for the dollar sign.

BTW, the dead key could be from german, for writing their U:s.

madeofpalk · on Aug 16, 2020

By default on a Mac you just hold down the key to get different options, similar to on an iPhone (and a presume touch-Android).

https://i.imgur.com/yuG063t.png

flyinghamster · on Aug 17, 2020

On both Windows and Linux, I've liked the "US international" keyboard layout as an alternate when I need to type letters with diacritical marks. In that layout, " ' ~ ` ^ are all dead keys, which modify the next letter typed. In addition, the right Alt key (usually Alt Gr on non-US keyboards) can be used to quickly type some commonly-used letters.

On Linux there is also a "US alternate international" layout with some extra dead keys to make it easier to do things like š or ž (Ctrl-RightAlt-< followed by s or z, respectively).

boring_twenties · on Aug 16, 2020

On any Unix, just enable the "compose" key, then <Compose>+"+i.

It's always something easy to remember, like " for umlauts, o for circles (©, ®), obviously ' for accents (ń) and so on.

loa_in_ · on Aug 17, 2020

You can see and modify the .XCompose file, and you can even put your own strings there like an email address or change the sequence. There are some community XCompose files on GitHub too.

account42 · on Aug 18, 2020

You may or may not need to set GTK_IM_MODULE=xim to get GTK applications to use ~/.XCompose.

Also, Qt5 broke support for multi-character results when using compose sequences (everything after the first character is ignored).

Izkata · on Aug 16, 2020

On Ubuntu, I use xmodmap to turn Print Screen into a Compose key. Then it's: <compose>"i

https://en.m.wikipedia.org/wiki/Compose_key

figomore · on Aug 16, 2020

I use the Macintosh keyboard map in Linux. So I do <right alt>+e to ‘, <right alt>+n to ~.

dylz · on Aug 16, 2020

"i or similar works. ^i, `i, 'i, etc. for the others.

el-salvador · on Aug 17, 2020

You can set up your keyboard as US International. And then type [“] + [i]. It’s a very useful keyboard layout because the punctuation characters match the keyboard ana allows you to type English, Spanish, French, Dutch, German, Swedish, Norwiegan, Portuguese

account42 · on Aug 18, 2020

Having punctuation keys as dead keys is too annoying so I use my own keyboard that has dead keys only when used with AltGr as well as some direct AltGr umlauts (ie AltGr+a=ä).

fnord123 · on Aug 16, 2020

It is acceptable to write English without diacritics. "Naive" is accepted.

skissane · on Aug 16, 2020

It is certainly acceptable, although some publications have diacritics as part of their house style, and if the author doesn't use them, the copy editor is supposed to insert them.

The New Yorker writing things like coöperate and reëlect is probably the most infamous, although they are not the only ones.

The Guardian style guide says to use spellings exposé, lamé, résumé, and roué – but not café. Although, when giving the name of an organisation/institution (restaurants and cafés included), the article should use whatever spelling is preferred by the management, including their choice about how to spell cafe/café (when that word is part of their name).

Personally, I'd always write café in formal English, because cafe just looks wrong to me. However, in something informal like a text message I probably wouldn't bother.

fnord123 · on Aug 16, 2020

Sure, some publications might like that. And some publications still treat data as a plural of datum rather than a mass noun like water. But I don't think I know a native speaker who would look at "naive" and feel the way they do when thy see "could of".

FWIW, Economist explicitly names Naive as one to use without diacritic: https://cdn.static-economist.com/sites/default/files/store/S...

dheera · on Aug 16, 2020

The easiest solution to this problem would be to just rename it to "naive".

blkhawk · on Aug 16, 2020

not that unusual - for German for instance üöäÜÖÄß need to be added so all words can be matched.

a3w · on Aug 16, 2020

now, there is even a capital ß ;)

stavros · on Aug 16, 2020

Less relatedly, I really hate when people use the eszett instead of a Greek beta. I just needed to get that out of my chest.

Forge36 · on Aug 16, 2020

Out of curiosity I tried on my phone: ß

Ss

SS

So my phone doesn't have that yet!

maxpro · on Aug 16, 2020

mine has it - ẞ the small one is ß

Igelau · on Aug 16, 2020

Is the capital supposed to be shorter?

a1369209993 · on Aug 17, 2020

Oddly enough, yes (depending on font of course); compare "Sf": "ẞß" should have the same capital and lowercase-ascender heights, the latter of which is often higher.

zaarn · on Aug 17, 2020

It's wider. Depends on your font and it's support though.

mikepurvis · on Aug 16, 2020

This would be an argument for just using [:alpha:] everywhere; presumably it does the correct thing based on locale?

tryauuum · on Aug 16, 2020

No, alpha doesn't work, at least in "grep -P" with "ru_RU.UTF-8" locale:

  $ echo Test | grep -oP '[[:alpha:]]+'
  Test
  $ echo Артём | grep -oP '[[:alpha:]]+'
  $ echo Артём | grep -oP '[А-Яа-я]+'
  Арт
  м

This thing works, though I've never seen one in the wild:

  $ echo Артём | grep -oP '[\p{Cyrillic}]+'
  Артём

wruza · on Aug 16, 2020

Standard classes only work for 8-bit locales, afaik, and also in some languages (e.g. perl) when a string encoding corresponds to an internal representation of what its engine thinks it is a "current" unicode format for the specific version of language and a checkout location. The fact that different tools stick to different engines, modes and normalization rules (bre/ere/posix/nfd/nfc/perl/pcre/php/ruby/icu/whatever) doesn't help either. Full cross-platform unicode matching is a can of worms that you usually don't want to open. It is basically CSV of the encoding world. /\S+?/ to the rescue.

https://regular-expressions.mobi/unicode.html

https://regular-expressions.mobi/refunicode.html

https://regular-expressions.mobi/posixbrackets.html

mikepurvis · on Aug 17, 2020

Well, that is horrifying. TIL.

drran · on Aug 17, 2020

  $ echo Артём | grep -oE '[[:alpha:]]+'
  Артём

tryauuum · on Aug 17, 2020

Yep, -E vs. -P

noxer · on Aug 16, 2020

The problem here is simply a bad regex pattern. You need to use something that supports Unicode an enable it. (Usually a flag or option you have to set because Unicode matching is slower and often not needed.) Then last but not least you need to use character classes and not define ranges yourself unless you actually need a range. Like you cant use [:digit:] or \d if you really only want 1-9 without the 0.

Here are some examples to match your string all of them use PCRE flavor

(UCP)[[:alpha:]]+

(UCP)[[:alnum:]]+ This would includes digits

(UCP)[[:word:]]+ This includes "word" chars

(UCP)\w+ same as above

FrontAid · on Aug 16, 2020

Changes to the casing might also change the value's length. E.g. uppercasing the German ß will transform it to SS. Example using JavaScript:

'ß'.toUpperCase(); // returns 'SS'

https://en.wikipedia.org/wiki/%C3%9F

schoen · on Aug 16, 2020

There is apparently a multi-decade controversy about that:

https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

(with German language authorities recently endorsing the idea that ß can have a distinctive uppercase form "ẞ")

dathinab · on Aug 16, 2020

Which can be both correct and wrong depending on context.

Normally there is no such thing as a capital ß, so it was decided that if for some unreasonable reason you do uppercase it you go with SS.

But then for some all-caps usages this is not right. E.g. a all caps name of an restaurant as placed above the restaurants door. In which case it was common to have a ß in a all-caps name like FOOßBAR. So they decided that for reasons like this we now have an (EDIT: semi?) official uppercase ß.

So all in all this and other examples in other languages mean you should never do a case insensitive comparison by upper/lower casing both sides, it won't work reliable.

scrollaway · on Aug 16, 2020

Ive long thought programming languages need a "localizable string" (Aka user-facing string) type, different from regular utf8 strings. Something like what gettext and other i18n libraries fake for you, but native to the language.

Behaviour like this is definitely a good reason why: sorting, changing case, etc should be consistent when dealing with strings used as constants and identifiers, but Python's .lower() behaviour makes sense in a localizable string context.

lazulicurio · on Aug 16, 2020

Along similar lines, I've thought that it would be useful if Unicode included language marks (i.e. codepoints to identify blocks of text as being written in a specific language). It would be strictly more useful than the barebones left-to-right/right-to-left marks (U+200E/U+200F) when deciding how to process and display text. And it would be a step towards correcting the mess that was Han unification.

jwilk · on Aug 16, 2020

See RFC 2482 — Language Tagging in Unicode Plain Text:

https://tools.ietf.org/html/rfc2482

But it was deprecated later on:

https://tools.ietf.org/html/rfc6082

lazulicurio · on Aug 16, 2020

Interesting. Unfortunate that the deprecation notice doesn't include much rationale. I found at least one mail thread about it[1], which seems to confirm that the main thought was that semantic information about text should be handled at a higher layer (e.g. XML). I can understand that argument for a general purpose tagging mechanism, but language and glyphs are strongly semantically linked.

(Somewhat ironically, the previous thread on that mailing list is about the struggles of case folding in a general fashion across multiple language scripts[2])

Edit: I also found [3], which offers the following:

----

- Most of the data sources used to assemble the documents on the Web will not contain these characters; producers, in the process of assembling or serializing the data, will need to introspect and insert the characters as needed—changing the data from the original source. Consumers must then deserialize and introspect the information using an identical agreement. The consumer has no way of knowing if the characters found in the data were inserted by the producer (and should be removed) or if the characters were part of the source data. Overzealous producers might introduce additional and unnecessary characters, for example adding an additional layer of bidi control codes to a string that would not otherwise require it. Equally, an overzealous consumer might remove characters that are needed by or intended for downstream processes.

- Another challenge is that many applications that use these data formats have limitations on content, such as length limits or character set restrictions. Inserting additional characters into the data may violate these externally applied requirements, and interfere with processing. In the worst case, portions (or all of) the data value itself might be rejected, corrupted, or lost as a result.

- Inserting additional characters changes the identity of the string. This may have important consequences in certain contexts.

- Inserting and removing characters from the string is not a common operation for most data serialization libraries. Any processing that adds language or direction controls would need to introspect the string to see if these are already present or might need to do other processing to insert or modify the contents of the string as part of serializing the data.

----

Other than #3 (the one about string identity), I find these wholly unpersuasive. And even #3 isn't even that great a reason considering that programmatic processors have to deal with that issue anyway due to case folding.

[1] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0039....

[2] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0038....

[3] https://www.w3.org/TR/string-meta/

Ericson2314 · on Aug 16, 2020

What this gets right down to is that Unicode is a flawed idea: the meaning/behavior/whatever of characters is insanely dependent on their context.

The problem was never gazillions of code pages, but our inability to write C to deal with that amount of complexity circa 1990.

With modern machines, and good programming languages with good type systems, I absolutely think we could store a language per string, and concatenate into a polylinguistic rope if needed.

This would hopefully push us away from stringly-typed crap in general.

arcticbull · on Aug 16, 2020

Unicode goes to great pains to avoid ascribing any meaning/behavior/whatever to character sets. Because to your point you can’t. Unicode is actually incredibly well thought out. That’s why we have values, code points and grapheme clusters. I don’t think the Unicode standard even defines casing except in the human-readable names ascribed to code points.

If you want to build a polylinguistic rope you can certainly do that with Unicode, but you won’t have solved anything because language alone without context doesn’t really define many of the operations you’re describing.

The answer is usually the same as “doctor it hurts when I...” — stop doing it. Stop manipulating user input without context. Stop trying to limit user visible strings by character count, use pixel width in the rendered font. And so on.

jfk13 · on Aug 16, 2020

> I don’t think the Unicode standard even defines casing except in the human-readable names ascribed to code points

Sure it does; the Unicode Character Database includes fields for the lowercase, uppercase and titlecase mappings. But it also acknowledges that these are just default mappings, and may need to be tailored for specific languages/locales.

Ericson2314 · on Aug 16, 2020

Unicode is well thought out! And that's what makes it hard to critique :). I think it's one of the best-maintained, well-thought out standards there is, but I still think the premise is wrong.

If all that good effort went into something along the lines I am describing, where languages, or at least scripts, cannot be arbitrary mixed at the character level, I think we would have an even better result with the same level of effort.

arcticbull · on Aug 16, 2020

If you treat Unicode as your backing representation -- a pile of glyphs -- you can build what you're asking on top right?

Ericson2314 · on Aug 17, 2020

That is like saying I can take an untyped language and then add types. Sure you can! That said, it's much nicer (to me) to first define the typing rules (static semantics) and then define evaluation (dynamic semantics) only on well-typed programs. This avoids the need to include lots of annoying stuff in the domain. See my other comment, https://news.ycombinator.com/item?id=24180620, for an example of something I rather leave ill-typed.

That said, any "multicode" better describe the interopt with "unicode" in great detail for practical reasons. Still this is the "FFI", and one can be careful to let it not muddle things by e.g. not allowing every unicode string to be imported without additional metadata.

arcticbull · on Aug 17, 2020

I'm suggesting it's more like layering a programming language on top of assembly. The lower level is the universe of what you can do (in this case, the set of all glyphs) and the higher level is an imposition of specific constraints (in your case, which ones go together).

Ericson2314 · on Aug 18, 2020

Languages need not by defined by how they compile. (If they do, we tend to call it "desugaring".) At the very least, they usually compile to multiple ISAs, and none is more definitive than the other.

I am happy to define how to translate Multicode to Unicode, but I wouldn't want any of the internal notions of Multicode to be defined in terms of that translation.

throwaway_pdp09 · on Aug 16, 2020

> the meaning/behavior/whatever of characters is insanely dependent on their context

I wish you would give an example instead of just proclaiming crapness. You know, so we n00bs can learn something.

toast0 · on Aug 16, 2020

Different languages have different rules for change case (as seen here) or what to do when translitterating to 7-bit ascii, in French, you can mostly drop accents if you need to, in German, you need to transform an umlaut to an e following the vowel. Of course, many languages don't have a way to translitterate to 7-bit ascii.

Sorting of strings is language dependent, but I don't know that there's a defined order for mixed language lists, so I guess user's context works if you're sorting for user purposes, but if you're sorting for machine purposes, you better not use the locale aware sort without telling it a hardcoded locale that doesn't change between localization library versions.

throwaway_pdp09 · on Aug 16, 2020

@toast0, @lazulicurio, both of your points seem to illustrate the complexities of the languages, not "...that Unicode is a flawed idea" as the original poster said. AFAIKS this is intrinsic complexity showing itself and does not make any indication of how it should be done correctly, or better.

lazulicurio · on Aug 16, 2020

> both of your points seem to illustrate the complexities of the languages, not "...that Unicode is a flawed idea"

The flaw in Unicode is that it punts on the intrinsic complexity---pretending that codepoints have language-independent, plain-text, semantic meaning.

A couple of threads that have molded my views over time:

I can't write my name in Unicode https://news.ycombinator.com/item?id=9219162 (Specifically these two comments https://news.ycombinator.com/item?id=9220530 and https://news.ycombinator.com/item?id=9220970)

Why isn't the external link symbol in Unicode? https://news.ycombinator.com/item?id=23016832

Ericson2314 · on Aug 16, 2020

> The flaw in Unicode is that it punts on the intrinsic complexity---pretending that codepoints have language-independent, plain-text, semantic meaning.

> Pretending "plain text" isn't an oxymoron

FTFY :)

Ericson2314 · on Aug 16, 2020

The benefit of looking at languages/scripts in isolation is that the combinatorial explosion of all languages/scripts at once is dodged.

E.g. lookalike charaters, and social engineering by using a vs а. (One is Cyrillic). I don't want to even define "a == а". I want Latin and Cyrillic to be different types of characters, and that expression to be ill-typed.

This solves the Turkish problem, where the upper case I is two different charters in two different types (Turkish Roman script?), and the case folding functions likewise have disjoint types.

a1369209993 · on Aug 17, 2020

> I want Latin and Cyrillic to be different types of characters

How do you concatenate English and Ру́сская text, and what is the type of this sentence?

Ericson2314 · on Aug 18, 2020

[Either [Latin] [Cyrillic]] is a very simply type taking advantage that the language only switches at word boundary.

a1369209993 · on Aug 18, 2020

Huh. That doesn't quite address my objection (CamelCase like EnglishEtРу́сская still un-works), but that's actually a good point in the overwhelming majority of cases. I'm not quite convinced this approach works in practice (I'm sticking with "A"="A"="A"), but I'd definitely like to see a more technically fleshed-out design.

lazulicurio · on Aug 16, 2020

How about: case folding for the letter 'I' is dependent on whether the locale is Turkish or not.

;)

kevin_thibedeau · on Aug 16, 2020

Unicode supported this with tag sequences but that is deprecated and unlikely to work with modern libs.

DougBTX · on Aug 16, 2020

Along the lines of this?

https://docs.microsoft.com/en-us/dotnet/api/system.globaliza...

wongarsu · on Aug 16, 2020

.NET is one of the few ecoecosystems to get this right. It offers the invariant culture for identifier-like things, "fr" for French language and "fr-FR" for French language in France, allowing you to specify your intention to every string-modifying function.

Support at the type level would be a lot less verbose, but support at the function level is already much better than many other popular languages.

kanox · on Aug 16, 2020

It would be great if strings and especially date-time values always carried locale and timezone information with them.

It would take slightly more memory but not significant on modern machines.

wongarsu · on Aug 16, 2020

Putting the locale information on the string sounds like a good idea. However I'm not sure how that should handle combined strings with components from different locales. For example `logLevel + ": " + logMessage` might produce "info: bağlantı kesildi" in Turkish. How to annotate that? Neither English nor Turkish would work correctly, each would produce the wrong result when uppercasing.

You could treat it as a series of string slices with different locales `[("info", "en"), (": ", ""), ("bağlantı kesildi", "tr")]`. That would work correctly, and you could now uppercase each slice according to its appropriate locale, but it wouldn't really be low overhead anymore. Maybe still worth it. It would be an interesting approach that might even be able to be implemented pretty seamlessly as a library in some languages (C++ or rust for example)

scrollaway · on Aug 16, 2020

That just seems to be a parameter for locale-dependent functions. Very useful, but no, I'm talking about splitting the unicode-string datatype in two: "user-facing unicode string" vs "internal unicode string".

Example: logging.log("INFO", i"This is a localizable string")

In the i18n world, we could gather i-strings just like gettext does (where it looks like `logging.log("INFO", _("This is a localizable string")`). The language could then have other useful hooks/behaviours into that datatype, and definitely one of them would be whether various methods have i18n behaviour enabled on them, versus using a C locale.

layer8 · on Aug 16, 2020

In Java, there is Locale.ROOT, which can be used in a similar way. In particular, it is useful when performing locale-dependent operations in locale-independent contexts (e.g. working with case-insensitive identifiers) where you don’t want the behavior of your code to depend on the current default locale.

thomasahle · on Aug 16, 2020

That would be great! For example, in Python you currently have to do something like this

    import locale
    sorted(list_of_strings, key=locale.strxfrm)

To sort using the current loacale, which many people forget.

chippy · on Aug 16, 2020

https://garygregory.wordpress.com/2015/11/03/java-lowercase-...

In the Turkish locale, the Unicode LATIN CAPITAL LETTER I becomes a LATIN SMALL LETTER DOTLESS I. That’s not a lowercase “i”.

beeforpork · on Aug 16, 2020

My genius idea was once to use toupper() to normalise paths on Windows, which are case-insensitive. One day, a customer from Azerbaijan reported that my application failed to access a file in C:\WİNDOWS\...

tryauuum · on Aug 16, 2020

i feel your pain

Macha · on Aug 16, 2020

07/04/2008 -> April 7th seems about as reasonable a result as July 4th, especially when you've explicitly opted in to a Turkish locale. I don't agree with the article's assertion that the format being interpreted according to the user's locale is wrong here, the one wrong part is a US centric programmer's expectation that PP-QQ-YYYY is an unambiguous format. Use YYYY-mm-dd when you need a format that's not ambiguous

frabert · on Aug 16, 2020

YYYY-mm-dd also plays nice with lexicographic ordering, which is why I always use it when I need to put dates in e.g. filenames

Macha · on Aug 16, 2020

I'm a European working primarily with Americans. My home country uses dd/mm/YYYY (or dd/mm for short) and the US uses mm/dd/YYYY for with mm/dd for short. I've switched to YYYY-mm-dd simply for my own sanity and if I omit the year I write the month in text format, such as "5 June".

withinboredom · on Aug 16, 2020

The US military uses the almost same convention (dd-mmm-yyyy) so 07-aug-2020.

dgellow · on Aug 16, 2020

That’s dd-mmm-yyyy

withinboredom · on Aug 16, 2020

Thanks!

Macha · on Aug 16, 2020

Note: This is actually a reply to the article here: https://news.ycombinator.com/item?id=24178270 , for some reason I thought that was the top level link.

Maybe if dang sees this, it could be reparented?