Hacker News new | past | comments | ask | show | jobs | submit login
In the Turkish locale, "INFO".lower() != "info" (github.com/python)
188 points by duckerude on Aug 16, 2020 | hide | past | favorite | 282 comments



Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

It seems to me that in practice, it's extremely rare to want to change case of real, natural-language text. When I have natural-language text, it's just a blob to me, and I don't want to touch it.

The only time I ever want to lower-case or capitalize something, I'm working with identifiers meant for computer -- not human -- consumption. Usually, specifically, I'm dealing with identifiers that have annoyingly been defined to be case-insensitive even though the only humans that ever see them are programmers and programmers hate case-insensitivity. HTTP headers are a common example.

I mostly write C++, and I end up writing code like:

    for (char& c: str) {
      if ('A' <= c && c <= 'Z') c = c - 'A' + 'a';
    }
Later on, some well-meaning developer on my team will come along and say "Ugh what is this NIH syndrome?" and then they "clean it up" as:

    #include <ctype.h>

    for (char& c: str) {
      c = tolower(c);
    }
And then I have to say NOOOOOOO DON'T DO THAT YOU HAVE NO IDEA WHAT tolower() REALLY DOES!

I struggle to imagine any real use case where you'd actually want locale-dependent tolower() other than, maybe, a word processor -- but if you're writing a word processor, you're probably not going to be depending on the language's built-in string APIs to do your text manipulation.


This is a classic case of a 'why' code comment being needed. It's obvious what you're doing, but without a 2 line explanation, it's not clear why.


Seems like it would be even better to put this in its own function with a descriptive name, ascii_tolower or roman_tolower or whatever, that has exactly the semantics you want.


This is exactly right, is and is a great example of what self-documenting code can be. The function itself could have a bit more explanation but any code calling it is going to be obvious.

The big difference is it looks deliberate, instead of just code written by someone trying to micro-optimize, be very clever, or who just didn't realize tolower() exists. Most people will pause before just replacing it, and likewise it should trigger questions in the PR.


Still warrants a comment to be sure no one concludes the built in is good enough.


It certainly warrants a test to document what the function is for.

And if that test also happens to validate that the documentation is accurate, that is a nice side benefit.


> It certainly warrants a test to document what the function is for.

You might be being serious, so I'll indulge. How would a test that will never break, for a function that would never be changed (because look at it), that lives in a different part of the directory structure, be worth even one second extra time or thought to write it, over-and-above the descriptive comment that is longer than the code itself, and lives discoverably right next to the code itself?


Because that's how you stop regressions.

The cost of writing a single test case is not more than the cost of diagnosing what change broke your code for Turkish users.

Now there's always a point where maybe the infrastructure for testing that kind of thing doesn't exist, so writing your one "simple" test case takes a bunch of time, but on the plus side, future similar "simple" test cases will be easy at that point. And no one has to track down why your code is broken in turkey.

Many years ago while I worked on IM/IMEs on Mac and windows I spent maybe a week working on code that allowed an IME to be implemented in JS (within the webkit test harness), so it was possible to test, and prevent regressions that kept on being reintroduced by people changing layout and/or editing code in ways that are "obviously correct" for US/English keyboards. The win from that week is many regressions that were caught before they were even committed, and the ability to completely rewrite the text input system to support IMEs on non-mac platforms.


This isn't exactly addressing my point. Never did I say that 'All tests are useless'. Refer to my answers here:

https://news.ycombinator.com/item?id=24183844


I've found that writing a unit test to verify it works at all is sometimes loads easier than manually running the app or whatever.

At that point, just leave the test and check it in.


I agree with this, kind of. Though I've found that keeping useless tests around doesn't always have 0 cost, same for any dead code.

Often, though, you can verify that your code works using a REPL.

I'm also a huge fan of how doc-tests in some languages that make the tests are part of the documentation, and neatly 'cohere' them to the function itself. At which point I'm happy to leave them in, as the tests-as-documentation are harder to miss, and are actually instructive.


Comments can be ignored, moved, misinterpreted. A test asserts correct behaviour. The only way to make the build fail if someone has (incorrectly) replaced the "complex toLower" with the (incorrect) "built-in toLower" is to delete or change the corresponding test - which rings way more alarm bells than a vague recollection of "hey, didn't there used to be a comment around here that said we shouldn't change this?"

Machine-time to run tests is cheap (if it's not in your codebase, then I agree that your benefit calculus may be different) - human cognition and awareness to prevent mistakes is valuable.


Keep in mind, I'm not advocating for 'no testing, ever'. But surely there's a limit to where you say something is so trivial that it won't ever change. And it's with these preconditions I ask the question:

> a test that will never break, for a function that would never be changed

I'll refer you to the OPs low-tech to-lower function, which is reductively simple, and never should be altered, with a comment as to why.

> Comments can be ignored, moved, misinterpreted.

Don't disagree with this. Someone lacking competence and care may ignore a comment, is incapable of comprehension, or just moves things for no reason. This should be caught at review.

Tests aren't infallible either. They can be invalidated, disabled because someone lacking competence decided it was the easiest way to move forward. This should be caught at review.

Edit: I'll address some of your other points more specifically...

> The only way to make the build fail if someone has (incorrectly) replaced the "complex toLower" with the (incorrect) "built-in toLower" is to delete or change the corresponding test

In OP's precise example, a worse thing can happen. They can shrug their shoulders and just use the built-in directly. Human-context business-value explication has more powerful benefit here.

> which rings way more alarm bells than a vague recollection of "hey, didn't there used to be a comment around here that said we shouldn't change this?"

If no-one's reviewing when someone changes the actual code and it's adjacent comments, who's reviewing the changes to the tests?

> human cognition and awareness to prevent mistakes is valuable.

Extra code comes at a maintenance and cognition cost. Maybe one trivial test seems like a minor cost, but how about the maintenance of 1000s of tests that (ought to) always pass?


C# has a `ToLowerInvariant()` variety for that.


Which iirc is an alias for ToLower on the en-us locale. (Same for the other C# *Invariant() methods)


It calls ToLower(CultureInfo.InvariantCulture);

Locale name is iv-IV.


Comments are unreliable, you should use tooling to fix this forever. E.g. findbugs has a rule for this problem: http://findbugs.sourceforge.net/bugDescriptions.html#DM_CONV...


Yeah I probably wrote that comment the first few times I did this but it's hard to write it the 50th time.

Maybe I should have my own tolower() function that I can call so I only have to write the comment once but it just feels ridiculous somehow.


It's far more ridiculous to repeat yourself over and over instead of making a simple function that describes exactly what you want and why.


> but it's hard to write it the 50th time.

You know you wrote it 49 times before, but the person reading the code probably doesn't. It only changes your experience not everyone else's.

If it's the same codebase, just write that function :)


Write the function, comment it!

Many of these are obvious to many people here, but some aren't.

Even I can admit that some of the stuff in this thread is not obvious at all.


Why does it feel ridiculous?


Because I've already rewritten more of the standard library than is healthy.

I mean, it's clearly the right thing to do here but I can predict the conversation that will inevitably result... "You wrote your own tolower() function? Why?" "The standard one is horribly broken." "How could a function that lower-cases a letter be broken??? Jesus Kenton your NIH syndrome is out of control." "Sigh..."

(Slightly more seriously, any particular time I need to lower-case something, it takes 10 seconds to write out the code, but would take 10 minutes to find a good place to define a reusable function and exactly what its API should be, and so it never seems worth the effort in the moment. Just like how most messy code comes to be.)


This conversation can be simply be avoided by copy pasting your original hacker news comment into the library function header.

I have noticed some coworkers have their ego gratified by being right while everyone else is wrong. Instead of simply explaining what they are doing when they are doing it, they will do something that looks wrong in a very noticeable way and wait for the backlash. The backlash gives them an opportunity to show everyone else how they were right while everyone else was wrong and also an opportunity to play victim. However, in SW development - it is not just the technical details - your behavior also matters in a big way.

In this particular case, the correct approach is to create your own library function with appropriate comments. This is why the concept of a library function was invented. It is its entire raison-d'etre. However, you are doing everything but that. Including providing justifications in hacker news comments instead of your source code.

Now inevitably, someone will change your inline code to use to_lower. This will give you an opportunity to scream bloody murder, show how other engineers don't really understand technical details, correct them and also play victim. Create a library utility with comments and link it in - End of story.


I’m reminded of garbage collector blog posts where they do something stupid (“disable it”, “allocate a ballast”) and then get to spend a couple pages explaining why it worked for them.


Speaking of people wanting to gratify their ego by being right: Everyone on this thread trying to lecture me on software engineering? ¯\_(ツ)_/¯


At least, you get to play victim :)


Looks like you're only 10 times away from using this to having spent your 10 minutes ;)

But seriously, I'm not here to lecture you, but personally I'd appreciate having a teammate educate me on the undesired behavior and have a nice function I could use to ensure my own code doesn't break user input


Most codebases I've worked with have a StringUtils.java, or .kt, or a str.c or utils.c. Maybe just start one. Interestingly I haven't needed it as much in Ruby.

But I too feel the cognitive (and social!) burden of introducing a new function. It's not just "where do I put this", but "how do I convince the team I know what I'm doing since 15 years of experience clearly isn't enough and developers (mostly rightly) ignore positional authority and seniority".


  #include <kentonv.h>



This project has the greatest sales pitch I've ever seen: https://github.com/capnproto/capnproto/blob/master/README.md -- combined with the project name it's just perfect.


Java has two variants of toLowerCase(): one which uses the default/current locale (almost never what you want), and one which receives an explicit locale (Locale.ROOT is almost always the one you want). At work, we use the "forbidden APIs" checker (https://github.com/policeman-tools/forbidden-apis) to fail the CI if the variant which uses the default locale is ever used; if you really want to use a locale-dependent toLowerCase(), you have to explicitly call Locale.getDefault() and use it as the locale.

Is there something similar for C and C++? It could help in your case, by making your well-meaning colleagues aware of the issue.


> Locale.ROOT is almost always the one you want

At least Android developers are advised to use Locale.US: https://developer.android.com/reference/java/util/Locale

> The default locale is not appropriate for machine-readable output. The best choice there is usually Locale.US – this locale is guaranteed to be available on all devices, and the fact that it has no surprising special cases and is frequently used

It would be indeed interesting to see in which features these two locales actually differ.


Yes, tolower_l(string, locale)


> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

On many documents, including Turkish passport and identity card and many (all?) other passports, names are written in all caps. Maybe toLower() is not that useful, but toUpper() is crucial in any application where you are dealing with real person names.


toUpper is definitely language-dependent. For example, in Irish there are initial letters that are written as lower-case even in all caps. Wikipedia's example is amusing, since it's a photo of a government passport office sign - the all-caps version of Oifig na bPasanna is OIFIG NA bPASANNA (photo https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/AL..., article https://en.wikipedia.org/wiki/Irish_orthography#Capitalisati...). It would look utterly bizarre to write OIFIG NA BPASANNA. And this isn't at all an unusual construction in Irish, it happens in personal names all the time.

Plus, there's the issue of diacritical marks. Irish keeps long marks over capitals, but French drops accents. Do you plan to do é => É (required for Irish - POBLACHT NA hÉIREANN is the all-caps version of Poblacht na hÉireann [the Republic of Ireland]) or é => E (common practice for French)? You have to get it right, and you have to know the language to do that. (Poblacht na hÉireann also illustrates the fact that initial caps is also a language-dependent idea; you absolutely can't write Poblacht Na Héireann - that makes my eyes burn just looking at it.)

(And before you say, well, Irish isn't a language spoken by very many people, remember that it's an official language of the European Union. If you're writing software to be used by EU agencies, you're going to have to care.)


> French drops accents

Official position of both Académie française and Office québécois de la langue française is that accents must be preserved in capital letters. However, it is common in France to drop them, while they are almost always preserved in Québéc. I have heard that the reason is that European French keyboard layout makes it difficult to type accented capital letters, unlike Québécois French layout which makes writing them easy. But I am not sure if this is the cause rather than the effect of the practice.


I confirm that in French capital letters should have accents.

I have an anecdote on this: on birth certificates, family names are written in capital letters. It turns out my partner name ends with a É which was written as E in her birth certificate. She never noticed (it had never prevented her to get national ID with her name properly accented) until we had our first kid which has both our names, and they refused to have the name accented until we had my partner's birth certificate updated (which as you can imagine is quite an adventure, since you need to dig ancient family birth certificates to prove it was originally written with an accent...).


French here, I recall my primary school textbook where they said something along the lines of "sometimes accents are dropped, that's sort of fine as long as it doesn't change the meaning". They gave the example of a fictitious newspaper whose headline was "UN POLICIER TUE": depending on the accent (tué/tue) it means either "a policeman kills" or "a policeman killed".


american here who's lived in france and still use an azerty keyboard because it lets me type in both languages. how do you get a capital A with an accent on an azerty keyboard?


Easily ? You don't, with the most common azerty keyboard you have to use : Ctrl+Alt+7 Shift+A. That's why there is a new standard for azerty keyboard the "NF Z 71-300" that is better with accent and stuff like æ,œ,Æ,Œ,«» etc.


Damn it. That's why I couldn't figure it out for years. You can't.. Is the new keyboard standard being used anywhere? Like if I walk into a common office in france and sit down at a laptop - is it likely to be using the new keyboard layout?

What's weird is I sometimes, even 10 years ago, would get an email from people in France, and it would have an accented A. Like, how did they do that..


i know that LDLC is selling one of these [1] ... and that's it, i don't even think it's coming to laptop anytime soon.

[1] https://www.ldlc.com/fiche/PB00279741.html


Not the answer to your question, but this is why I think Quebec uses accented capitals more than France. In Canadian French layout, there is a key for à. Simply using Shift+à gives you À.


classic canada, fixing the keyboard. now all that's left is above 69.

korean is worse. they base off of 10k, not 1000. so a million is hundred ten_thousand. bagman. but as bad as that is, it's no 97 amirite.


Interesting that I was wrong - another data point in the "it's more complicated than you think it is" column. I always thought you were supposed to drop them (because I was explicitly told so by a French engineer I worked with in the 90s, talking about one particular poster, and many years later still assume that one hallway conversation was enough to make that THE OFFICIAL RULE without bothering to actually check...)


It’s been a pet peeve of mine for quite a while, and an urban legend for quite a lot more. It was tolerated when all we had was typewriters (and even then you were supposed to add them, but it was cumbersome).


Wrongly dropping accents on uppercase letters predates computer keyboards; the French azerty layout puts accented letters on the first level of number row:

http://j.poitou.free.fr/pro/img/tkn/tw-image.jpg

This idiocy carried over. The recent layout update makes dead accents more accessible:

https://norme-azerty.fr/

But I haven't seen much adoption yet.


That button to the right of the P that contains four different forms of dashes is... interesting.

Even more if you consider that the minus sign there is not the character - that is used by every programming language.


> That button to the right of the P that contains four different forms of dashes is... interesting.

And there are two more on the 8 key.

I like the mac international keyboard layout, but it still only provides for 4 of those: the non-breaking hyphen and the "proper" minus sign are lacking.

I like that the "new azerty" provides for pretty much every diacritic, even those which are not in use in french.


I blame the horrendous default settings of MS Word and Outlook for this, and the maddeningly convoluted way to enter accented caps on Windows. There is no context in which it is correct to omit accents in French, caps or otherwise.


Nope, none of them are really useful. The only useful folding function is casefold(str, locale). (if your str type doesn't know it's locale).

toLower and toUpper should only be used for presentation, but all case-insensitive operations need to be done with casefold.


Of course! There are tons of cases where you need to store in "sentence case" (first word and proper nouns and acryonyms capitalized, nothing else) so you can convert to title case or all-caps as needed for display purposes. Templates are full of this kind of stuff.

There are similarly tons of cases where you reduce everything to lowercase without accents for searching and indexing purposes. Depending on your setup, your database might handle that for you, but there are edge cases where you need to do it at the application level.

Long story short, every string has a locale, and you should never change the case of something without specifying its locale. Either be explicit that it's American English or ASCII or Latin1 or whatever... or that it's something else. Never leave someone reading the code guessing.


> you can convert to title case or ... for display purposes.

I am skeptical if someone thinks they need to do this and how they will get it done.

Eg. Looping through and capitalizing the first gylph after breaking whitespace regardless of locale is not the way to go, but I guarantee you a nontrivial amount of people reading this would write exactly that if asked to solve the high level problem.

I find it annoying when software or even in some cases human typists try to enforce English language title case. Some other languages have different rules for titles and capitalization and seeing the English rules enforced out of context can be jarring.


I find it amusing you are skeptical... why so distrustful? But believe me, it's quite necessary.

I use the citation manager Zotero a lot. It's necessary to store all the titles of journal articles and books in sentence case (e.g. "Issues regarding the economy of France") because some publications require citations to use sentence style (remains unchanged) while others require title style ("Issues Regarding the Economy of France").

And obviously the solution cannot be naive, but is language-dependent, so that in English words like "the" or "in" don't get capitalized. And as the rules for titles are obviously language-dependent, so it goes without saying that the algorithm would have to be localized.

(Note that while it's relatively trivial to convert from sentence case to title case in English, it's impossible to automate in the opposite direction, because you never know if a capitalized term in the title is a proper noun or not.)


> (Note that while it's relatively trivial to convert from sentence case to title case in English[...])

Strictly speaking you can't do that either:

  "Latine et videtur".totitle = "Latine et Videtur"
  "I et some food".totitle = "I Et Some Food"
I suspect that fails less often though.


> I am skeptical if someone thinks they need to do this and how they will get it done.

It's not an uncommon requirement, though probably not often in a locale sensitive way, so you can often get away with just doing the right thing for one locale.

To do it generally, you probably need to research appropriate handling per locale (likely, what your client wants done in the particular locale, since I'm not sure there is usually just one way; I know there are multiple variations in en-US), and then have a master function that takes the text and target locale and applies the correct locale-specific title casing rules.


> I am skeptical if someone thinks they need to do this and how they will get it done

I most often use title case mappings in the context of replacements of names into diagnostic messages. I.e. you have n types of objects and m messages like “${foo} not found” or “More than one ${foo} required”, and perform title case mapping on ${foo} depending on whether it is at the start of the message (sentence) or not.


While I agree it's almost always a bad idea: effectively every design team I've encountered has requested stuff like this.

So yes, it's extremely common. It's done on tons of websites, for tons of mail addresses (ever receive an all-uppercase address on a delivery? same issue.), and tons and tons and tons of emails and legal documents (woe to those with last names like McCormick).


We frequently use localized upper/lower casing at my workplace, as we do not store such stylization in user facing copy. Most copy is written and translated in sentence case or title case (because both are much harder to achieve programmatically), and then our designers have the option of using that casing as-is, or using all-upper or all-lower.


> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

If you are collecting data which include people's names or addresses you probably want localization to be applied correctly so that you can compare data coming from different sources and possibly with different cases. Having your name spelled differently in different documents can cause a non trivial amount of problems with an overzealous bureaucracy.


I was onced refused entry into a country for 6 hours because of different spellings of my last name. The (apparently quite amateur) travel agency had sent my last name written using OE instead of Ö, whereas all the documents relating to my identity use Ö (or if it was the other way around).


We thought on this specifically when naming our son, one with no special characters, a pure clean ASCII string like mom used to cook them.


> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

How do you lowercase without localization? Remember all text isn't English. Unless you're actually asking if anyone has ever had a use case for lower-casing non-English text?


I think the real question is: is there any use case for toLower() where you want the system default locale to be applied? If you want to lower-case text for "system" purposes then you need to keep track of the locale associated with that text (which won't generally be the locale of the system the program is running on); the only case where you want to use the system default locale is where you're interacting with the (human) user of the system, but it's hard to imagine a use case where you'd want to use toLower() for displaying text to the user.


You vastly underestimate the desire of people for things to look 'nice' where 'nice' is defined by exactly what they mean when they say it. If you want a 'nice' display of data that's input by users, you're often going to butcher it by doing things like converting everything to lower-case, then maybe upper-casing the first letter. Because 'nice.'

It's horrifically wrong, of course, but no one is going to think you're a reasonable person for insisting that correct wins over nice.

Imagine a real-world scenario where you're displaying a list of names of users, where the users got to type in their own names. You can either use what users typed in, or you can do something like process it so it's in the American idea of initial caps. You can't possibly do localization, since it's a list of names of people in the US, so it's melting pot of names from all over the world, and you never asked for user input of what you'd use to localize it anyway [and no, you absolutely can't figure that out from just the name itself]. You can't use what users typed in, because the design team thinks that looks like a horrible mess (and it is; users are laughably bad at data entry). So the design team wins, and you butcher everything by pretending that toLower() plus toUpper() for the first character of every word is a sensible thing to do. (And yes, that's a painful real-world example of software I've shipped and that was used by millions of people.)


> it's hard to imagine a use case where you'd want to use toLower() for displaying text to the user.

Maybe for automatically changing the case of user input (auto-correcting capitalization, etc.).


Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

Yes. In a system I'm about done with, there is a sortable chart of dates and times. In some languages day and month names are capitalized, and in some they are not.


How does that work? toUpper() can't possibly know that the string is a day or month name.


I think, this is why you need explicitly ASCII and explicitly Unicode lower-/upper-/capitalized-transformations. So you don't assume these things to work automagically. Some times you need one type, the other times you need the other type.


I recently ordered a Pixel, on the mail slip they had converted my name to uppercase, last name read "DUBé"

Also got my address screwed up on account of living at a half address.. 1/2 some street #42


A char in C++ is one byte, right? Is it even possible for this "fixed" code to call ctype::tolower() on something like a UTF-8 or UTF-16 code point?


Correct, it won't even work as intended with modern Unicode locales.


So maybe if the code is broken anyway for non-ASCII characters, it's fine to use tolower, since somewhere else in the code it ensures that c is a byte.


The code is not broken for non-ASCII characters. UTF-8 works just fine with 8-bit chars, and the code I wrote correctly lower-cases ASCII letters even when UTF-8 is present (it just won't touch the UTF-8 chars, which is fine in this use case).

It's only tolower() and toupper() specifically that are broken because they expect to be able to do their job on a single byte, which is no longer possible with UTF-8.

Meanwhile, using tolower() to lower-case an HTTP header name won't give you the correct results if the locale is set to Turkish with the ISO 8859-9 character set, which is 8-bit, and where tolower('I') will produce the byte 0xFD which is 'ı' in this character set.


I see, thanks for the explanation.


I have a morse code app which consistently crashed when certain users would try to translate letter "i", and it took me a long time to figure out that only the turkish users would complain about it, and when one of them sent me a screenshot I only noticed a "wrongly" rendered capital letter i (I used toUpper) and after digging around a bunch, I learned about this while turkish letter i.


A small portion of people whose native writing system are based from Latin alphabet believes that case conversion is an essential must have feature and having that feature in locale library help the localization.

But if you consider the other writing systems, having case conversion feature in locale library actually harm the localization effort. It's not easy to make it no-op. The implementation of locale library are generally poor quality because the implementer has no idea how other languages works.

Another example is singular/plural support. It just burden the localization effort because for languages which no such concept, the localization work must ensure that presence of such library doesn't harm their language.

Some people is under the delusion that locale library must have more features to support his native languages not so important traits. While what really necessary is just forget about supporting minor language traits that are not universal among the languages.

The text should be considered binary blob and most program should pass it left to right without modification.


And a note that it assumes ASCII. On an EBCDIC system, the 'A'-'Z' test will translate other characters besides letters.


Now I’m wondering about what happens when we change email addresses to lowercase...

https://en.m.wikipedia.org/wiki/Email_address#Internationali...


You shouldn't. Email addresses are case-sensitive.



This is why I have a set of functions like AsciiToLower(char* string, size_t size). They only touch characters in the ASCII space at <0x80. Even went and implemented them with SSE for x86.


Airlines might be a good example. The back end system doesn't grok lowercase characters at all, so you need to transform data to uppercase A-Z, 0-9 and a few punctuation marks.


But they do have the most extensive transliteration rules library to match everything to that limited character set (ICAO Doc 9303[1]) that is used by many systems outside the aviation world.

[1] https://www.icao.int/publications/pages/publication.aspx?doc...


You need localization if you do any kind of multilingual text processing. Not sure how it could escape a thinking person's imagination.


File names, URLs and email address support utf-8 characters and you may want to lower case them in many situations. If the user is trying to search for a string, they probably want case insensitivity. I don't think it's that rare/weird for people to want localisation to apply when calling toLower.


Yes, semi regularly -- lowercasing of text for user interfaces is frequently required. Similarly case insensitive comparisons.

Human text is much much more complex than any computing protocol you're ever going to engage with.

The question is "which one should be default", and that's a more complicated question.


> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

Well, it's always been exclusively in American English, but I've certainly used it in cases where I was doing text transforms for display, so, yeah, though it's not the most common case.


What about sorting users by name?


That's completely language+locale dependent. For example, here'an alphabetical list of Irish surnames - https://www.duchas.ie/en/nom?txt=M. You'll notice that sort order ignores an initial O or Mac (or Ni or Bean, etc).


Honestly, strings that are intended for human and for computer consumption should just be two different basic types without any implicit conversion between them.


i like c |= 0x20; :)


I may be missing something - why is tolower(c) incorrect here?


Because if `c` is the letter 'I', and the current locale happens to be set to Turkish, then `tolower(c)` will return 'ı', not 'i'. If you are trying to lower-case an HTTP header name for the purpose of case-insensitive comparison, this is definitely not what you wanted. (And similar problems exist with several other locales; it's not just Turkish.)


Ah I see, thanks for explaining


Welcome to the Turkish language, where we have ı, i, I and İ. In our language the conversion is as follows:

- i <-> İ

- ı <-> I

We love our dots and preserve them. For a more detailed read, please see:

https://blog.codinghorror.com/whats-wrong-with-turkey/


As I understand it, Turkish is one of the more important locales to test with because of things like this.


Poor encoding can lead to the odd murder too: http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two...

> The use of "i" resulted in an SMS with a completely twisted meaning: instead of writing the word "sıkısınca" it looked like he wrote "sikisince." Ramazan wanted to write "You change the topic every time you run out of arguments" (sounds familiar enough) but what Emine read was, "You change the topic every time they are fucking you" (sounds familiar too.)


That doesn't explain the e instead of the a, does it ?


In the olden times, ending words with e instead of a is considered an acceptable typo.

Also the article explicitly says "it looked like he wrote". So when you see red, that last letter can become anything and nothing would change.


Turkish is the only language which has the ı & I pair. Similarly, AFAIK, Turkish is again the only language with ğ and ş letters. So, by testing for Turkish, you test for a lot of European languages at once. Moreover we share some modified letters(ç, ü) with other Central European languages.

If your program can pass “The Turkish Test”, you pass a lot of others too.


Azerbaijani too. Moreover, Azerbaijani has an additional letter ə, which sounds like /æ/.


I love the feeling of camaraderie arising from that partial mutual intelligibility of Turkish and Azerbaijani.

That connection through language goes a long way.

müqəddəs bacı millət :)


Turkish, Lithuanian and Korean specifically. They do have the most exceptions.


> We love our dots and preserve them.

Turkish preserves the dots of i, ö and ü in their capital versions, but not with j. The capital J is dotless.


Isn’t capital j is dotless in every language?


> - i <-> İ

> - ı <-> I

After seeing this, I don't understand how the rest of us can fail to have the same distinction. There's something logically beautiful -- like the rhyme in a good poem -- about artificial languages (or, in this case, alphabets) that naturally evolved languages just cannot compete with.


To me, the undotted-i thing is more of a hack. Far more beautiful is why Turkish distinguishes the two vowels: Speakers of Turk languages don't like to mingle bright and dark vowels in the same word. Appreciate the economy of not having to switch your enunciation between front and back in the same word.

Speakers of Turkish generally build words with vowels from either of these exclusive groups:

aıuo

eiüö

You see? They could write "ä" instead of "e" to have all bright vowels dotted and mirror the symmetry. But because "e" was already available and pronounced like that in other dominant languages of the time, they stuck to it. No need to break conventions. This didn't work for "ı" because there was no corresponding letter in the Latin alphabet. So they loped off the dot from i and called it a day. A pragmatic decision. Except that the capital version had then to be dotted to keep the distinction. That caused a lot of headaches downstream.


This is called vowel harmony and it's a fairly common linguistic feature:

https://en.wikipedia.org/wiki/Vowel_harmony

...but it's notably absent from the Indo-European languages.


Unrelated story about Russian language.

The first letter of russian alphabet is А, the last one is Я. So it's natural to try to match russian words with '[А-Яа-я]+'. But this is a recipe for disaster, this regexp doesn't match words with 'Ё' in them like "Артём".

This is due to the fact that regexp ranges work on byte values. All letters of russian language have neatly ordered byte values, except for the Ё.


English is probably the only commonly spoken language where naïve char range matching kind of sort of works. I say ”kind of sort of” because [a-zA-Z] trivially fails to match all words in many English texts that haven’t been lossily compressed to ASCII, including this comment.

It is practically always wrong to match on [a-z] unless you’re parsing a computer language whose spec guarantees that it works.


Forget ascii conversion, that also fails on contractions like "don't".


ï isn't an English character though! For an English document, "naïve" is a mispelling, at best. That being said, you don't always have the luxury of doing things the "correct" way, especially if users are trying to cram god-knows what into a text field.


Nope. ”Naïve” is an accepted variant of ”naive” in every major English dictionary. [a-zA-Z] is never a ”correct” way to match natural language text.


Would it blow your mind that coöperate is technically correct?

That's the whole point of the diaeresis.


I always wanted to know, how easy is it to type naïve on a common western keyboard?

Do you have to press some obscure keyboard shortcut?


I'm Windows-based and wanted a keyboard layout that will allow me typing easily Polish and French at the same time, without switching keyboard layouts (PL == US+AltGr for accents; while FR layout is insane, because apart from being AZERTY, all special chars are in different places, and you need a Shift to type numbers; and the way to type accents is also special).

I found "Polish international" [1] layout which honestly can be perfect for many people. It's optimized to be compatible with regular Polish keyboard (hence with US keyboard too), and maybe not the fastest if you type a lot special chars, but it's extremely intuitive:

ï = AltGr+:, i

ü = AltGr+:, u

é = AltGr+/, e

è = AltGr+\, e (since it's extremely common, also aliased as AltGr+w)

If you're Windows based and want US-compatible keyboard layout that allows easily typing any special chars, I highly recommend it.

[1] https://translate.google.com/translate?sl=pl&tl=en&u=https%3...


I type English and French in Windows on the same QWERTY keyboard. I once learned to type on Azerty, but I mainly type English now on a standard US keyboard layout. For the French, I find the windows alt-numbers works the easiest for accented characters. Alt-130=é, Alt-133=à, Alt-135=ç, Alt-137=ê, Alt-138=è which covers 95% of the accented character usage. I have a little chart next to my desk with all the others (ï,ô,ù) they’re nearly all Alt-14x and Alt-15x. And then I’ll put é in the paste buffer because it is the most used and a bit quicker that way (for words like “préféré”).

The Alt-13x codes are not as quick as the Azerty keys, but good enough and once memorized are fairly easy with a keyboard that has a keypad (most PCs do, even my laptop). This is especially true because they are done with both hands simultaneously, as opposed to something like Cmd-e+e on a Mac. Actually, they are faster than finding the accented characters on my QWERTY virtual keyboard as I type this comment on iOS.

Those AltGr- combos seem complicated to me, I would much prefer a system such as AltGr-e =é, then AltGr-ee=è, AltGr-eee=ê, etc. To me that would be more intuitive than remembering the composing character (slash for aigüe, etc).


You seem to be quite used to your Alt combination but as you said they really are not straightforward. I found another very simple solution, on Linux you can set a compose key (typically Alt gr or the contextual menu key). You type one after another, the compose key and then any two keys that make sense like ' followed by e (et vice versa), it will give you a é. It is both fast and easy to work with.


Reminds me of when the 'D' key broke in my physical keyboard long time ago. I liked that keyboard a lot and couldn't find a good replacement so I learnt to type Alt-100 do get 'd'.


how easy is it to type naïve on a common western keyboard?

In macOS, you can either use Command-u (for "umlat") followed by i, or hold down the i key for a second and press 2 to select the ï from the pop-up menu.


> Command-u

option-u (aka alt-u).

Generally speaking, command is for application-level or os-level commands, control is for text edition, and alt is for alternate characters (all can be shifted and command "overrides" the rest).


You're right, it's Option-u. Most of the key labels on my MacBook have long since been scratched away.

This has happened with every single Apple keyboard I've ever used. I suspect it's my fault, since I'm a key pounder, having learned to type on an IBM Selectric typewriter.


On a Swedish keyboard, there's a dead key for ¨, so you press that followed by i to get ï.

It's not very clear why the Swedish keyboard has that key, since ä and ö each have their own keys. The layout has other quirks as well, such as keys for §, ½ and the useless "currency sign", ¤.


Mac os quietly removed the paragraph sign for me when switching keyboards to non-apple ones if you used a non-standard layout that had moved the key they use to detect what kind of keyboard you are using.

It wouldn't have been much of a bother, had I not used the key as my Emacs leader...


Yes! I think the "mine" character should be switched for the dollar sign.

BTW, the dead key could be from german, for writing their U:s.


By default on a Mac you just hold down the key to get different options, similar to on an iPhone (and a presume touch-Android).

https://i.imgur.com/yuG063t.png


On both Windows and Linux, I've liked the "US international" keyboard layout as an alternate when I need to type letters with diacritical marks. In that layout, " ' ~ ` ^ are all dead keys, which modify the next letter typed. In addition, the right Alt key (usually Alt Gr on non-US keyboards) can be used to quickly type some commonly-used letters.

On Linux there is also a "US alternate international" layout with some extra dead keys to make it easier to do things like š or ž (Ctrl-RightAlt-< followed by s or z, respectively).


On any Unix, just enable the "compose" key, then <Compose>+"+i.

It's always something easy to remember, like " for umlauts, o for circles (©, ®), obviously ' for accents (ń) and so on.


You can see and modify the .XCompose file, and you can even put your own strings there like an email address or change the sequence. There are some community XCompose files on GitHub too.


You may or may not need to set GTK_IM_MODULE=xim to get GTK applications to use ~/.XCompose.

Also, Qt5 broke support for multi-character results when using compose sequences (everything after the first character is ignored).


On Ubuntu, I use xmodmap to turn Print Screen into a Compose key. Then it's: <compose>"i

https://en.m.wikipedia.org/wiki/Compose_key


I use the Macintosh keyboard map in Linux. So I do <right alt>+e to ‘, <right alt>+n to ~.


"i or similar works. ^i, `i, 'i, etc. for the others.


You can set up your keyboard as US International. And then type [“] + [i]. It’s a very useful keyboard layout because the punctuation characters match the keyboard ana allows you to type English, Spanish, French, Dutch, German, Swedish, Norwiegan, Portuguese


Having punctuation keys as dead keys is too annoying so I use my own keyboard that has dead keys only when used with AltGr as well as some direct AltGr umlauts (ie AltGr+a=ä).


It is acceptable to write English without diacritics. "Naive" is accepted.


It is certainly acceptable, although some publications have diacritics as part of their house style, and if the author doesn't use them, the copy editor is supposed to insert them.

The New Yorker writing things like coöperate and reëlect is probably the most infamous, although they are not the only ones.

The Guardian style guide says to use spellings exposé, lamé, résumé, and roué – but not café. Although, when giving the name of an organisation/institution (restaurants and cafés included), the article should use whatever spelling is preferred by the management, including their choice about how to spell cafe/café (when that word is part of their name).

Personally, I'd always write café in formal English, because cafe just looks wrong to me. However, in something informal like a text message I probably wouldn't bother.


Sure, some publications might like that. And some publications still treat data as a plural of datum rather than a mass noun like water. But I don't think I know a native speaker who would look at "naive" and feel the way they do when thy see "could of".

FWIW, Economist explicitly names Naive as one to use without diacritic: https://cdn.static-economist.com/sites/default/files/store/S...


The easiest solution to this problem would be to just rename it to "naive".


not that unusual - for German for instance üöäÜÖÄß need to be added so all words can be matched.


now, there is even a capital ß ;)


Less relatedly, I really hate when people use the eszett instead of a Greek beta. I just needed to get that out of my chest.


Out of curiosity I tried on my phone: ß

Ss

SS

So my phone doesn't have that yet!


mine has it - ẞ the small one is ß


Is the capital supposed to be shorter?


Oddly enough, yes (depending on font of course); compare "Sf": "ẞß" should have the same capital and lowercase-ascender heights, the latter of which is often higher.


It's wider. Depends on your font and it's support though.


This would be an argument for just using [:alpha:] everywhere; presumably it does the correct thing based on locale?


No, alpha doesn't work, at least in "grep -P" with "ru_RU.UTF-8" locale:

  $ echo Test | grep -oP '[[:alpha:]]+'
  Test
  $ echo Артём | grep -oP '[[:alpha:]]+'
  $ echo Артём | grep -oP '[А-Яа-я]+'
  Арт
  м
This thing works, though I've never seen one in the wild:

  $ echo Артём | grep -oP '[\p{Cyrillic}]+'
  Артём


Standard classes only work for 8-bit locales, afaik, and also in some languages (e.g. perl) when a string encoding corresponds to an internal representation of what its engine thinks it is a "current" unicode format for the specific version of language and a checkout location. The fact that different tools stick to different engines, modes and normalization rules (bre/ere/posix/nfd/nfc/perl/pcre/php/ruby/icu/whatever) doesn't help either. Full cross-platform unicode matching is a can of worms that you usually don't want to open. It is basically CSV of the encoding world. /\S+?/ to the rescue.

https://regular-expressions.mobi/unicode.html

https://regular-expressions.mobi/refunicode.html

https://regular-expressions.mobi/posixbrackets.html


Well, that is horrifying. TIL.


  $ echo Артём | grep -oE '[[:alpha:]]+'
  Артём


Yep, -E vs. -P


The problem here is simply a bad regex pattern. You need to use something that supports Unicode an enable it. (Usually a flag or option you have to set because Unicode matching is slower and often not needed.) Then last but not least you need to use character classes and not define ranges yourself unless you actually need a range. Like you cant use [:digit:] or \d if you really only want 1-9 without the 0.

Here are some examples to match your string all of them use PCRE flavor

(UCP)[[:alpha:]]+

(UCP)[[:alnum:]]+ This would includes digits

(UCP)[[:word:]]+ This includes "word" chars

(UCP)\w+ same as above


Changes to the casing might also change the value's length. E.g. uppercasing the German ß will transform it to SS. Example using JavaScript:

'ß'.toUpperCase(); // returns 'SS'

https://en.wikipedia.org/wiki/%C3%9F


There is apparently a multi-decade controversy about that:

https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

(with German language authorities recently endorsing the idea that ß can have a distinctive uppercase form "ẞ")


Which can be both correct and wrong depending on context.

Normally there is no such thing as a capital ß, so it was decided that if for some unreasonable reason you do uppercase it you go with SS.

But then for some all-caps usages this is not right. E.g. a all caps name of an restaurant as placed above the restaurants door. In which case it was common to have a ß in a all-caps name like FOOßBAR. So they decided that for reasons like this we now have an (EDIT: semi?) official uppercase ß.

So all in all this and other examples in other languages mean you should never do a case insensitive comparison by upper/lower casing both sides, it won't work reliable.


Ive long thought programming languages need a "localizable string" (Aka user-facing string) type, different from regular utf8 strings. Something like what gettext and other i18n libraries fake for you, but native to the language.

Behaviour like this is definitely a good reason why: sorting, changing case, etc should be consistent when dealing with strings used as constants and identifiers, but Python's .lower() behaviour makes sense in a localizable string context.


Along similar lines, I've thought that it would be useful if Unicode included language marks (i.e. codepoints to identify blocks of text as being written in a specific language). It would be strictly more useful than the barebones left-to-right/right-to-left marks (U+200E/U+200F) when deciding how to process and display text. And it would be a step towards correcting the mess that was Han unification.


See RFC 2482 — Language Tagging in Unicode Plain Text:

https://tools.ietf.org/html/rfc2482

But it was deprecated later on:

https://tools.ietf.org/html/rfc6082


Interesting. Unfortunate that the deprecation notice doesn't include much rationale. I found at least one mail thread about it[1], which seems to confirm that the main thought was that semantic information about text should be handled at a higher layer (e.g. XML). I can understand that argument for a general purpose tagging mechanism, but language and glyphs are strongly semantically linked.

(Somewhat ironically, the previous thread on that mailing list is about the struggles of case folding in a general fashion across multiple language scripts[2])

Edit: I also found [3], which offers the following:

----

- Most of the data sources used to assemble the documents on the Web will not contain these characters; producers, in the process of assembling or serializing the data, will need to introspect and insert the characters as needed—changing the data from the original source. Consumers must then deserialize and introspect the information using an identical agreement. The consumer has no way of knowing if the characters found in the data were inserted by the producer (and should be removed) or if the characters were part of the source data. Overzealous producers might introduce additional and unnecessary characters, for example adding an additional layer of bidi control codes to a string that would not otherwise require it. Equally, an overzealous consumer might remove characters that are needed by or intended for downstream processes.

- Another challenge is that many applications that use these data formats have limitations on content, such as length limits or character set restrictions. Inserting additional characters into the data may violate these externally applied requirements, and interfere with processing. In the worst case, portions (or all of) the data value itself might be rejected, corrupted, or lost as a result.

- Inserting additional characters changes the identity of the string. This may have important consequences in certain contexts.

- Inserting and removing characters from the string is not a common operation for most data serialization libraries. Any processing that adds language or direction controls would need to introspect the string to see if these are already present or might need to do other processing to insert or modify the contents of the string as part of serializing the data.

----

Other than #3 (the one about string identity), I find these wholly unpersuasive. And even #3 isn't even that great a reason considering that programmatic processors have to deal with that issue anyway due to case folding.

[1] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0039....

[2] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0038....

[3] https://www.w3.org/TR/string-meta/


What this gets right down to is that Unicode is a flawed idea: the meaning/behavior/whatever of characters is insanely dependent on their context.

The problem was never gazillions of code pages, but our inability to write C to deal with that amount of complexity circa 1990.

With modern machines, and good programming languages with good type systems, I absolutely think we could store a language per string, and concatenate into a polylinguistic rope if needed.

This would hopefully push us away from stringly-typed crap in general.


Unicode goes to great pains to avoid ascribing any meaning/behavior/whatever to character sets. Because to your point you can’t. Unicode is actually incredibly well thought out. That’s why we have values, code points and grapheme clusters. I don’t think the Unicode standard even defines casing except in the human-readable names ascribed to code points.

If you want to build a polylinguistic rope you can certainly do that with Unicode, but you won’t have solved anything because language alone without context doesn’t really define many of the operations you’re describing.

The answer is usually the same as “doctor it hurts when I...” — stop doing it. Stop manipulating user input without context. Stop trying to limit user visible strings by character count, use pixel width in the rendered font. And so on.


> I don’t think the Unicode standard even defines casing except in the human-readable names ascribed to code points

Sure it does; the Unicode Character Database includes fields for the lowercase, uppercase and titlecase mappings. But it also acknowledges that these are just default mappings, and may need to be tailored for specific languages/locales.


Unicode is well thought out! And that's what makes it hard to critique :). I think it's one of the best-maintained, well-thought out standards there is, but I still think the premise is wrong.

If all that good effort went into something along the lines I am describing, where languages, or at least scripts, cannot be arbitrary mixed at the character level, I think we would have an even better result with the same level of effort.


If you treat Unicode as your backing representation -- a pile of glyphs -- you can build what you're asking on top right?


That is like saying I can take an untyped language and then add types. Sure you can! That said, it's much nicer (to me) to first define the typing rules (static semantics) and then define evaluation (dynamic semantics) only on well-typed programs. This avoids the need to include lots of annoying stuff in the domain. See my other comment, https://news.ycombinator.com/item?id=24180620, for an example of something I rather leave ill-typed.

That said, any "multicode" better describe the interopt with "unicode" in great detail for practical reasons. Still this is the "FFI", and one can be careful to let it not muddle things by e.g. not allowing every unicode string to be imported without additional metadata.


I'm suggesting it's more like layering a programming language on top of assembly. The lower level is the universe of what you can do (in this case, the set of all glyphs) and the higher level is an imposition of specific constraints (in your case, which ones go together).


Languages need not by defined by how they compile. (If they do, we tend to call it "desugaring".) At the very least, they usually compile to multiple ISAs, and none is more definitive than the other.

I am happy to define how to translate Multicode to Unicode, but I wouldn't want any of the internal notions of Multicode to be defined in terms of that translation.


> the meaning/behavior/whatever of characters is insanely dependent on their context

I wish you would give an example instead of just proclaiming crapness. You know, so we n00bs can learn something.


Different languages have different rules for change case (as seen here) or what to do when translitterating to 7-bit ascii, in French, you can mostly drop accents if you need to, in German, you need to transform an umlaut to an e following the vowel. Of course, many languages don't have a way to translitterate to 7-bit ascii.

Sorting of strings is language dependent, but I don't know that there's a defined order for mixed language lists, so I guess user's context works if you're sorting for user purposes, but if you're sorting for machine purposes, you better not use the locale aware sort without telling it a hardcoded locale that doesn't change between localization library versions.


@toast0, @lazulicurio, both of your points seem to illustrate the complexities of the languages, not "...that Unicode is a flawed idea" as the original poster said. AFAIKS this is intrinsic complexity showing itself and does not make any indication of how it should be done correctly, or better.


> both of your points seem to illustrate the complexities of the languages, not "...that Unicode is a flawed idea"

The flaw in Unicode is that it punts on the intrinsic complexity---pretending that codepoints have language-independent, plain-text, semantic meaning.

A couple of threads that have molded my views over time:

I can't write my name in Unicode https://news.ycombinator.com/item?id=9219162 (Specifically these two comments https://news.ycombinator.com/item?id=9220530 and https://news.ycombinator.com/item?id=9220970)

Why isn't the external link symbol in Unicode? https://news.ycombinator.com/item?id=23016832


> The flaw in Unicode is that it punts on the intrinsic complexity---pretending that codepoints have language-independent, plain-text, semantic meaning.

> Pretending "plain text" isn't an oxymoron

FTFY :)


The benefit of looking at languages/scripts in isolation is that the combinatorial explosion of all languages/scripts at once is dodged.

E.g. lookalike charaters, and social engineering by using a vs а. (One is Cyrillic). I don't want to even define "a == а". I want Latin and Cyrillic to be different types of characters, and that expression to be ill-typed.

This solves the Turkish problem, where the upper case I is two different charters in two different types (Turkish Roman script?), and the case folding functions likewise have disjoint types.


> I want Latin and Cyrillic to be different types of characters

How do you concatenate English and Ру́сская text, and what is the type of this sentence?


[Either [Latin] [Cyrillic]] is a very simply type taking advantage that the language only switches at word boundary.


Huh. That doesn't quite address my objection (CamelCase like EnglishEtРу́сская still un-works), but that's actually a good point in the overwhelming majority of cases. I'm not quite convinced this approach works in practice (I'm sticking with "A"="A"="A"), but I'd definitely like to see a more technically fleshed-out design.


How about: case folding for the letter 'I' is dependent on whether the locale is Turkish or not.

;)


Unicode supported this with tag sequences but that is deprecated and unlikely to work with modern libs.



.NET is one of the few ecoecosystems to get this right. It offers the invariant culture for identifier-like things, "fr" for French language and "fr-FR" for French language in France, allowing you to specify your intention to every string-modifying function.

Support at the type level would be a lot less verbose, but support at the function level is already much better than many other popular languages.


It would be great if strings and especially date-time values always carried locale and timezone information with them.

It would take slightly more memory but not significant on modern machines.


Putting the locale information on the string sounds like a good idea. However I'm not sure how that should handle combined strings with components from different locales. For example `logLevel + ": " + logMessage` might produce "info: bağlantı kesildi" in Turkish. How to annotate that? Neither English nor Turkish would work correctly, each would produce the wrong result when uppercasing.

You could treat it as a series of string slices with different locales `[("info", "en"), (": ", ""), ("bağlantı kesildi", "tr")]`. That would work correctly, and you could now uppercase each slice according to its appropriate locale, but it wouldn't really be low overhead anymore. Maybe still worth it. It would be an interesting approach that might even be able to be implemented pretty seamlessly as a library in some languages (C++ or rust for example)


That just seems to be a parameter for locale-dependent functions. Very useful, but no, I'm talking about splitting the unicode-string datatype in two: "user-facing unicode string" vs "internal unicode string".

Example: logging.log("INFO", i"This is a localizable string")

In the i18n world, we could gather i-strings just like gettext does (where it looks like `logging.log("INFO", _("This is a localizable string")`). The language could then have other useful hooks/behaviours into that datatype, and definitely one of them would be whether various methods have i18n behaviour enabled on them, versus using a C locale.


In Java, there is Locale.ROOT, which can be used in a similar way. In particular, it is useful when performing locale-dependent operations in locale-independent contexts (e.g. working with case-insensitive identifiers) where you don’t want the behavior of your code to depend on the current default locale.


That would be great! For example, in Python you currently have to do something like this

    import locale
    sorted(list_of_strings, key=locale.strxfrm)
To sort using the current loacale, which many people forget.


https://garygregory.wordpress.com/2015/11/03/java-lowercase-...

In the Turkish locale, the Unicode LATIN CAPITAL LETTER I becomes a LATIN SMALL LETTER DOTLESS I. That’s not a lowercase “i”.


My genius idea was once to use toupper() to normalise paths on Windows, which are case-insensitive. One day, a customer from Azerbaijan reported that my application failed to access a file in C:\WİNDOWS\...


i feel your pain


07/04/2008 -> April 7th seems about as reasonable a result as July 4th, especially when you've explicitly opted in to a Turkish locale. I don't agree with the article's assertion that the format being interpreted according to the user's locale is wrong here, the one wrong part is a US centric programmer's expectation that PP-QQ-YYYY is an unambiguous format. Use YYYY-mm-dd when you need a format that's not ambiguous


YYYY-mm-dd also plays nice with lexicographic ordering, which is why I always use it when I need to put dates in e.g. filenames


I'm a European working primarily with Americans. My home country uses dd/mm/YYYY (or dd/mm for short) and the US uses mm/dd/YYYY for with mm/dd for short. I've switched to YYYY-mm-dd simply for my own sanity and if I omit the year I write the month in text format, such as "5 June".


The US military uses the almost same convention (dd-mmm-yyyy) so 07-aug-2020.


That’s dd-mmm-yyyy


Thanks!


Note: This is actually a reply to the article here: https://news.ycombinator.com/item?id=24178270 , for some reason I thought that was the top level link.

Maybe if dang sees this, it could be reparented?


> 07/04/2008 -> March 7th

I think you mean April.


Fixed


I don't get why people don't just use something like 8m16d2020y. There, it's the same number of characters and clearly unambiguous even to someone who hasn't seen the format before.


Japan does something like that; according to Wikipedia (https://en.wikipedia.org/wiki/Date_format_by_country), its date format is 2020年08月16日


> PP-QQ-YYYY is an unambiguous format

“US centric” is one way to say it


Repeat after me: don’t do string operations without explicit locale. Don’t do string operations without explicit locale.

I don’t know why so many languages have string functions that should take a locale but provide an overload that doesn’t and which uses the system locale as the default. It can’t be what many developers actually want, yet it has become the norm. Worse, code using a default locale appears to work on the developers machine and in production, until someone parses a number in France or lowercases a string in Turkey, which is a late and expensive discovery of the bug.

The default shouldn’t be the system locale, it should be an invariant locale. And I’ll go so far as arguing this invariant locale should be invariant across systems (meaning it can’t just defer to a system C library either).


> I don’t know why so many languages have string functions that should take a locale but provide an overload that doesn’t and which uses the system locale as the default.

That‘s a relict from the past, before Unicode became prevalent, where systems used to ever only work in a single locale, and where users expected applications (running locally of course) to use the local system locale. Hence applying the system locale to everything was the standard behavior for applications. The C standard library was defined in that way, and since then every other runtime (usually based on C at some level) does the same.


FTR, Python does not in fact do this. Python 2 did have this locale-dependent behavior, but Python 3 has never behaved this way. The workaround in the OP is, thankfully, quite obsolete.

If you call a case-related method like `lower` on a Python string, the behavior you get is based on tables which are built into Python, taken straight from the Unicode standard's data files, and completely independent of your system configuration.

It would nice to also have the option of explicitly using a particular locale. Here's a discussion from 2019 about potentially adding that option: https://bugs.python.org/issue37848 You'll be glad to see everyone there agrees the default should remain invariant.


I ran into this with C#/.NET on Windows - I tried to convert a string "1.3" to the float 1.3, and it failed on languages that use comma as their decimal separator.

That was a learning experience.


Indeed. As a person from a comma country, I find these mistakes in most code bases I look at. It makes it frustrating to contribute to open source, for example.

Perhaps it’ll make you feel better about your parsing bug that even the C# compiler (Roslyn) code base had several of these issues.


For a similar reason, Java on Mac and Linux was briefly broken for anyone using it in the Turkish locale. It was because in the Turkish locale, !“POSIX”.toLowerCase().equals(“posix”).

Relevant bug report here: https://bugs.openjdk.java.net/browse/JDK-8047340


As it isn't yet mentioned: for these cases the Python standard library explicitly has https://docs.python.org/3.8/library/stdtypes.html#str.casefo... (str.casefold), which aggressively lowercase-normalizes strings with an algorithm from the unicode standard. Every case comparison using lower() instead of casefold() can be considered a bug.


> Every case comparison using lower() instead of casefold() can be considered a bug.

If you just casefold two strings and compare them, it's still a bug. You need to normalize them to NFKC first.


Is NFKC necessary, isn't NFKD enough? (As in you have to normalize and decompose both strings, but at that point you can check them for equality, and doing the canonical composition isn't needed, right?)


I think that would work if you're just checking for equality and want to minimize processing. I guess as a web developer I always just assume people are going to be storing strings in a database after normalizing them, so would want to minimize string length.


Correct: you would get "ınfo", "warnıng" and "crıtıcal" in Turkish and in Azerbaijani.


Further context:

https://en.m.wikipedia.org/wiki/Dotted_and_dotless_I

Did not know Istanbul is actually İstanbul.


Me neither. I did know it's not Constantinople, though.


Constantinople (Fatih) is the capital town of Eistipolis (Istanbul).


Please stop doing this. Don't bind lower() upper() functions to environment variables or anything else system related. Sun did this in Java and doesn't even bother to mention the issue in documents. It caused huge problems for more than a decade.

You can just make string lowercase() uppercase() function work the same everywhere, regardless of locale settings. Provide a special case function lowercaseTR() or so. This works very well in Go.

By the way, Azerbaijan has the same problem because they accepted help from wrong guys when they switched to Latin.


You'll be glad to hear that Python did stop doing this: Python 3 has never behaved this way, and its `lower` and `upper` methods have always been independent of your locale or anything else from your system.

The workaround in the OP was added in 2006 (note the reference to an issue on "SF", i.e. SourceForge -- another era!), and is now long obsolete.


Very much so. Thanks.


> lowercaseTR()

Huh, that works well if we know the input string is in Turkish. What if this information is not available as you're writing the code?

And what will lowercase()/uppercase() be hard coded to do, and what are they supposed to output when the input isn't ASCII?


Give me an example. I'll try to find the best -IMHO- solution.


In C (POSIX.1-2008, specifically), there's tolower_l() and the rest of the _l functions for this use case, which take a locale as an argument. That let's you ask for the English (or even "C locale") lowercase versions of these English words, even when your process's current locale is Turkish.

https://www.man7.org/linux/man-pages/man3/tolower_l.3.html


The mention of _l functions reminded me of this gloriously over the top git message/rant.

"Those not comfortable with toxic language should pretend this is a religious text."

https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...


Looks like it's no longer the case in Python 3:

   Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
   [GCC 8.3.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> from locale import *
   >>> setlocale(LC_ALL, 'tr_TR.UTF-8')
   'tr_TR.UTF-8'
   >>> 'INFO'.lower()
   'info'


Oddly, it also wasn’t the case for Python 2 Unicode strings (u'INFO'), only for Python 2 byte strings ('INFO'). So it’s possible that Python 3 lost this behavior by accident.


On some more digging through history, it looks like the change in behavior for byte strings was intentional: https://github.com/python/cpython/commit/6ccd3f2dbcb98b33a71...

Author: Guido van Rossum <guido@python.org>

Date: Tue Oct 9 03:46:30 2007 +0000

    Replace all (locale-dependent) uses of isupper(), tolower(), etc., by
    locally-defined macros that assume ASCII and only consider ASCII letters.


Python 3.7.5 (default, Nov 5 2019, 22:30:48)

[Clang 11.0.0 (clang-1100.0.33.12)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> from locale import *

>>> setlocale(LC_ALL, 'tr_TR.UTF-8')

'tr_TR.UTF-8'

>>> 'INFO'.lower()

'info' >>> '🧘 ️'.lower()

'🧘\u200d️'

>>> exit()

There's something wrong with emojis + lower() though


It lowercased the 'show this as emoji' variation selector to zero width joiner?


I remember running into problems with SQL stored procedures where column and table names were case-insensitive, so you don't know if you've properly typed all the column and table names. Until a customer in Turkey eventually installs it and you find out you've missed the proper capitalization of an identifier containing the letter "I", and the stored procedure fails.


Honestly, I'm very pro case-insensitivity, but my experience with SQL servers have impressively demonstrated how not to do it.

For example, MS SqlPackage, used for deploying schema, is case-insensitive... But that also means changes to text constants within your stored procs do not get treated as changes.


This is what I usually think about whenever people say yay to Unicode in language identifiers.


"I" is in ASCII.


"İ" and "ı" are not.


Note to the next language designer: don't use strings as a substitute for enums.


It might be OK if strings are immutable and therefore internable.


It doesn’t prevent someone from calling your function with “INFO” instead of “info”, does it?



ITT calling setlocale or std::locale::global(...) is ALMOST ALWAYS a heinously bad idea and should rarely be done, because it breaks tons of code (notably everything that uses printf/scanf and everything using stringstream).


I think things like these should be explicit. Even it is convenient to have a default, it should be what most people would expect.

For example, instead of .lower(), we can have .lower_ascii(), .lower_turkish() or .lower(locale) . But I know it would be tedious to use if you need to specify it everytime, so it makes sense to have a .lower(locale=DEFAULT_LOWER_LOCALE) . As for what should DEFAULT_LOWER_LOCALE be, it is worth debating, but I think it shouldn't introduce unexpected behavior.


Stringly typed: Play stupid games, win stupid prizes.


The PHP interpreter has an internal reimplementation of string case conversion that's ASCII-only in order to avoid this problem.


doesn't php have this exact problem with their case-insensitive (hate that btw) function/method names and turkish localization? or did they actually fix it at some point?


I'm guessing that they might have "fixed" it by implementing the ascii-only tolower function, but yes, PHP used to not work properly with Turkish localization.


Why do you think the interpreter needs such a function?


Serious question.

Why on earth would you hard-code these, instead of simply call a lowercase function in the en-US locale?

These are English words. Naively lowercasing them according to whatever locale the server or user has set seems like a terrible programming practice. Any call to a lowercase function should be explicitly including an argument that specifies it's English, no?

In the same way we've all learned to never store times without an explicit timezone (even if it's UTC), or locate a string offset without knowing your encoding... you should never perform language transformations (case changes, accent removal, etc.) without a locale.

Hardcoding these things is just patching over the symptoms without addressing the cause, no?


Hence toUpper/toLower is not a strategy that passes the Turkey Test for case insensitivity.



This particular case seems odd to me because INFO is an English word, and ınfo is not.


You could make a case that Unicode should have different "i" characters for different languages. Then you could do all transformations unambiguously. On the other hand almost everyone abuses the minus sign as a dash, and treats the apostrophe and the prime sign (signifying feet or minutes) as interchangeable, so in all likelihood they would constantly use the wrong i too.


I have a better solution: use combining characters COMBINING DOT ABOVE (which already exists) and DELETE DOT ABOVE (which needs to be added into Unicode), which would manipulate "I" into "İ" and "i" into "ı" respectively. Those combining characters would also work perfectly with j too.


The only issue I can see is with people working in a Turkish locale writing Latin text producing, let's say English blogposts with the wrong i and I. I still think that this should have been done this way though...


Indeed. LATIN SMALL LETTER I + DELETE DOT ABOVE becomes LATIN CAPITAL LETTER I + DELETE DOT ABOVE in uppercase, which then becomes LATIN SMALL LETTER I + DELETE DOT ABOVE back in lowercase. The same thing applies to LATIN CAPITAL LETTER I + COMBINING DOT ABOVE. Survives infinite number of case conversions.


> On the other hand almost everyone abuses the minus sign as a dash

Unicode calls it HYPHEN-MINUS. It does also have an unambiguous ‘−’ MINUS SIGN as well as ‘‐’ U+2010 HYPHEN and the various dashes, but most people use bad keyboard layouts.


> You could make a case that Unicode should have different "i" characters for different languages.

And different "SS" for any case where the lowercase was an sz, of course at some point Germany introduced an uppercase SZ character to avoid that round trip loss issue, but we still have tons of text that use the old sz -> SS conversion. Also note that "y" in Germany, not all German speaking countries follow the same rules for sz, some dropped it entirely. We basically need something like the time zone database to have even a snowballs chance in hell to handle text correctly.


Well a round-trip or two could still be ambiguous which could easily fail when comparing strings later in some edge case. Especially when we can't even consistently agree to use by-application, by-OS, by-language and by-locale settings consistently. I don't have a solution, just pointing out that this is a really challenging problem to fully solve.


Pretty sure that’s not true. When you switch your keyboard you will have a proper i character in another language unless your keymap is broken. How do you think Chinese, Russians or Greek type their characters?


The grandparent obviously meant “latin i”; none of the three languages you mention have any latin letters, but at least Russian and Greek have some lowercase and some more uppercase letters with the same glyph/shape as latin ones.


Yeah, and those similar glyphs are not available on their own language keyboard.


I frequently type German with a US layout with dead keys (so I can type "a to get ä). I also imagine that most Turkish developers type English on a Turkish layout, since Turkish contains all characters used by English.


I'm going to be a bit controversial here and say that that mapping logic should always exist even if toLower() were reliable across all locales. You're mapping between different use cases here, eg. internal to logfile to API to database to method name to whatever, and inserting magic transformations in your constant values rather than treating them as different tokens for different use cases constrains you and introduces unnecessary amounts of "magic".



Only for the record, there is something very similar that may happen when creating CD/DVD's (please read when using mkisofs and similar), with the "dash" that when "capital" becomes underscore (but not only ) depending on the reference ISO 9660/Joliet/RockRidge convention in use.

https://web.archive.org/web/20151007005513/http://www.911cd....


The practice of converting enum-like keys into their string representation by using toString, toLower, etc seems convenient but gets very contrived very fast. How do you deal with underscores? What about using the message in a sentence? I say, use the enum in your code as a conditional or something but always explicitly write out the messages intended for the user.



I learned about this in javascript when I discovered Angular has its own lowercase method. Apparently it's internal only now.

https://github.com/angular/angular.js/commit/1daa4f2231a89ee...


Yeah, there were some weird bugs about that. I remember one in a media player. Also "info".upper() would be İNFO probably.



I think we should have stopped at ASCII, I don't care that my language has letters not in there, it'd be neater if we just did now like back then: "This is a computer, so everything is in English" :) Or adapt the alphabet to use ASCII.


In the Danish locale "aa" doesn't start with "a".


Dumb question, if you really need the exact string “info” in a given context, why not hard code it? What does .lower() or even a map liked the linked one actually buy you?


Maybe the input is case insensitive, for example if you work with html you might see "DIV","div" who knows some crazy dev or tool might generate "DIv" or "dIv" so is simpler to lowercase the input then work on it.


Presumably it's for normalising input. Following the principle that you ought to be permissive in what data you accept, and strict in what data you give out.


Wouldn't converting to nfkd/c first solve this issue too? My understanding of those forms was that they're made exactly for this case.


Case mapping and case folding are independent of normalization (in practice and it is the case here, see the end of SpecialCasing.txt)

There is a good Unicode FAQ on the topic: < http://unicode.org/faq/casemap_charprop.html >

E: to elaborate, I'm not sure whether the independence of case handling and normalization is guaranteed anywhere, and if we for example were to change the uppercase of ſ to something else than S then its compatibility forms' (s) case handling would differ. In practice the SpecialCasing.txt is designed to "make it work" (e.g. ſ uppercases to S).


No, these are ASCII strings, so they are already normalized.


Oh, I haven't used python much, but I thought it's all Unicode? If this were ascii it would work out of the box since there is no dotless lowercase i in ascii.


There are no code point for TURKISH LOWERCASE DOTTED I not for TURKISH UPPERCASE DOTLESS I, which means that the text doesn't carry enough information for roundtrip preservation.

I believe this has proven to be a mistake but I'm not an expert. I don't know why it wasn't done.


The "İ" strikes again


what a gorgeous source-comment. Makes the non-obvious crystal-clear.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: