The string type is broken

tomp · on Nov 27, 2013

The problem with text (that Unicode solves only partially) is that text representation, being a representation of human thought, in inherently ambiguous and imprecise.

Some examples:

(1) A == A but A != Α. The last letter is not uppercase "a", but uppercase "α". Most of the time, the difference is important, but sometimes humans want to ignore it (imagine you can't find an entry in a database since it contains Α that looks just like A). Google gives different autocomplete suggestions for A and Α. Is this outcome expected? is it desired?

(2) The Turkish alphabet is mostly the same as the Latin alphabet, except for the letter "i", which exists in two variants: dotless ı and dotted i (as in Latin). For the sake of consistency, this distinction is kept in the upper case as well: dotless I (as in Latin) and dotted İ. We can see that not even the uppercase <==> lowercase transformation is defined for text independently of language.

These are just two examples of problems with text processing that arise even before all the problems with Unicode (combining characters, ligatures, double-width characters, ...) and without considering all the conventions and exceptions that exist in richer (mostly Asian) alphabets.

taeric · on Nov 27, 2013

I think your first assertion can be strengthened even further. It isn't like this is unique to letters that look the same. That is, sometimes WORD != WORD. Consider a few common words. Time? As in Time of day? As in how long you have? An interesting combination of the two? Day? As in a marker on the calendar? Just the time when the sun is out? Then we get into names. Imagine the joy of having to find someone named "Brad" that isn't famous. From a city named Atlanta, but not the one in GA. (If you really want some fun, consider the joy that is abbreviations. Dr?)

derleth · on Nov 27, 2013

Except these are all well outside the ambit of what programmers usually think of as text processing, so they won't try to solve them using the same tools.

More to the point, they sound hard, so people won't be so quick to claim they've solved them.

On the other hand, case-insensitive string matching sounds easy, even if it's actually somewhat difficult due to the language dependencies mentioned above, so people will claim to have a general solution that fails the first time it's faced with i up-casing to İ instead of I, or the fact the German 'ß' up-cases to 'SS' as opposed to any single character. (Unicode does contain 'ẞ', a single-character capital 'ß', which occurs in the real world but is vanishingly rare. As far as modern German speakers are concerned, the capital form of 'ß' is 'SS'.)

http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

http://opentype.info/blog/2013/11/18/capital-sharp-s-design-...

http://blogs.msdn.com/b/michkap/archive/2009/07/28/9850675.a...

http://www.personal.psu.edu/ejp10/blogs/gotunicode/2008/07/a...

taeric · on Nov 27, 2013

Right, I do not disagree. I just feel better treating them the same. That is, both are actually easy and reliable so long as you realize you have to make some gross simplifications. And most of the time your life will be much easier if you start with the gross simplifications and try to expand beyond them only when necessary. (This is also why I'm loathe to try programming in unicode...)

zokier · on Nov 27, 2013

I think (2) is an issue with Unicode specifically. They should have specified Turkish alphabet to use ı and a diacritic to make the dotted one. That would have made (in this case) capitalization locale-independent.

samatman · on Nov 27, 2013

While that's a problem with Unicode, it's a really big problem with Unicode. As the name alludes to, Unicode preserved as much as possible of existing regional encodings, which is why (among other reasons) there's a pre-composed version of basically every accented Latin letter.

jheriko · on Nov 27, 2013

isn't this solving the wrong side of the problem? how about not having to think about such things at all and just accepting that uppercase/lowercase conversion is never going to be language agnostic.

thats futureproof and powerful, rather than extra thinking and work...

zokier · on Nov 27, 2013

Most likely case-changes need to be locale-aware, that is true. But still I think minimizing number of locale-specifics is a reasonable goal and in that light I dislike the common usage of turkish i as a example because it is such a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather than fundamental issue.

jheriko · on Nov 28, 2013

You are right, everything should be as easy as possible. This is a good philosophy for design in general...

ygra · on Nov 27, 2013

Homoglyphs vary sometimes with text styles, though. So Α doesn't always have to look like A. Or, more to the point, while T and т might look alike, T and т often do not (the latter of which often looks like m). So even as humans we need to keep track of the script at times.

pfortuny · on Nov 27, 2013

The funny thing is that, according to "the rules" (the Real Academia de la Lengua Española), in Spanish we should be always using \u0130, but of course no one does...

pilif · on Nov 27, 2013

A nitpick from the article

>This spells trouble for languages using UTF-16 encodings (Java, C#, JavaScript).

if they were using UTF-16, this wouldn't be a problem as UTF-16 can be used to perfectly well encode code points outside of the BMP (at the cost of losing ability for O(1) access to specific code points of course. If you need to know what the n-th code point is, you have to scan the string until the n-th position).

They are, however, using UCS-2 which can't. If you use a library that knows about UCS-2 to work on strings encoded in UTF-16, then you will get broken characters, your counts will be off and case transformations might fail.

Most languages that claim Unicode support still only have UCS-2 libraries (Python 3 is a notable exception)

Renaud · on Nov 27, 2013

Exactly, UTF-16 perfectly defines surrogate pairs for code points that do not fit into the 16 bit plane. A perfect implementation of UTF-16 should have no problem, unfortunately, most are broken when it comes to surrogate pairs.

Many languages pre-date the introduction of UTF-16 and implemented 16 bit string encoding as UCS-2, and still do.

Then there are oddities like VBA using UTF-16 internally, but converting all strings going through the Win32 API as 8-bit (relying on the current code page for character translation!)...

kvb · on Nov 27, 2013

.NET uses 16-bit characters, but you can use the System.Globalization.StringInfo class to iterate through a string one Unicode character at a time, index into strings by Unicode character, etc. The API's a bit awkward, but it works.

panzi · on Nov 27, 2013

One unicode character at a time or one unicode codepoint at a time? (see character composition)

masklinn · on Nov 27, 2013

Codepoint. Java 5 also added new string APIs for this.

IIRC, Cocoa is one of the very few frameworks/languages/whatever which provides APIs for manipulating and iterating on grapheme clusters out of the box. And provides a page explaining some of the unicode concepts and how they map to NSString: https://developer.apple.com/library/mac/documentation/Cocoa/...

kvb · on Nov 28, 2013

This is incorrect - see the definition of a text element in the remarks of http://msdn.microsoft.com/en-us/library/system.globalization....

masklinn · on Nov 28, 2013

Thanks for the info.

danbruc · on Nov 27, 2013

The .NET StringInfo class provides methods to work at the grapheme level, not code points.

kvb · on Nov 27, 2013

Good point - I meant one codepoint at a time.

kvb · on Nov 27, 2013

But I was wrong and it's actually by grapheme, as danbruc correctly notes.

jeltz · on Nov 27, 2013

> Most languages that claim Unicode support still only have UCS-2 libraries (Python 3 is a notable exception)

Most non-JVM languages[1] actually use UTF-8 as the internal encoding so they should not suffer from this. Python 3 does not use UTF-16 either, it selects an encoding based on the contents of the string.

http://www.python.org/dev/peps/pep-0393/

1. I think .NET too uses UCS-2 or UTF-16, but I am not a Windows developer.

pilif · on Nov 27, 2013

.NET uses UCS-2 because the Windows API uses UCS-2 (so when you use Visual Studio out of the box, you will get UCS-2). ECMAScript (JS) uses UCS-2 because that's all there was when the spec was written.

Other scripting languages I know for certain are

- PHP doesn't care and treats strings as arrays of bytes. All the str functions operate on these byte arrays and thus happily destroy your strings if they are encoded as anything but the old 8-bit encodings. If you need to support utf-8, you have to use different library functions (mb_*) and a special syntax in their regex support (/u modifier).

- Python < 3 treats strings as byte arrays or UCS-2 depending on whether you use the byte type or the Unicode type. As such, it has all the same issues as all other UCS-2 libraries

- Ruby < 1.9 treats strings a byte arrays. There is some limited UTF-8 support, but it's in additional libraries. The internal API is treating strings as byte arrays. Ruby >= 1.9 lets you chose your internal encoding. Most people use utf-8, but you don't have to.

- Perl I don't know enough about, but I hear it as an UTF-8 mode that is actually well-supported by the language itself and gets almost everything right.

These are the more common scripting languages.

Of the compiled languages, I know for certain about Go (utf-8; good library support), C (OS dependent, but the standard string API treats strings as byte arrays), C++ (dito) and Delphi (UCS-2 since 2010, byte arrays before that)

I would say that there are so many exceptions to the UTF-8 rule that I wouldn't say "most" languages are using UTF-8.

masklinn · on Nov 27, 2013

> - Python < 3 treats strings as byte arrays or UCS-2 depending on whether you use the byte type or the Unicode type. As such, it has all the same issues as all other UCS-2 libraries

It's Python < 3.3 (the Flexible String Represrntation was introduced in 3.3), there's a byte array type (str in P2, bytes in P3) and a string type (unicode/str), which may be UCS2 ("narrow" builds, the default) or UCS4 ("wide" builds, set by many linux distros)

berdario · on Nov 27, 2013

Python <3.3 uses UCS2 or UCS4, depending on the build

Ruby >1.8 lets you choose the encoding

.NET UCS2/UTF-16 (I know the difference, imho if the stdlib has a .size, .length or .count that works on code units instead of code points it's broken... thus I'll mention only UCS2 from now on)

Java UCS2

Clojure UCS2

Scala UCS2

QT UCS2

Haskell String UCS4

Haskell Data.Text UTF-16 (yes, not a naive UCS-2)

Rust UCS4 (last time I checked)

Javascript UCS2

Dart UCS2

PHP Unicode-oblivious

Vala UCS4

Go UTF-8 (but it lets you call len() on strings, and it doesn't return the length of the string, but its size in bytes)

I can't really think of another language that uses UTF-8 internally, are you sure?

pitterpatter · on Nov 27, 2013

> Rust UCS4 (last time I checked)

Rust chars are 32bit Unicode codepoints. But strings themselves are utf-8. That is the string type, ~str, is basically just ~[u8], a vector of bytes and not ~[char].

`.len()` [O(1)] gives you byte length while `.char_len()` [O(n)] gives you the number of codepoints.

So strings in rust are just vectors of bytes with the invariant that it's valid utf-8.

berdario · on Nov 28, 2013

Thanks, I didn't know that

sedachv · on Nov 27, 2013

Common Lisp comes with two character types, base-char and character, the former being allowed to be a subset of the latter. Clozure Common Lisp uses UTF-32 for all characters and strings internally. SBCL uses base-char and simple-base-string types for ASCII and character and (simple-array character) types for UTF-32 internally. IMO having this option for two types of characters that are compatible but may have different internal representations is a really good part of the Common Lisp standard.

jeltz · on Nov 28, 2013

Perl, Rust, Go and Vala. I take back the "most" part though. It seems like there are many popular solutions.

Erlang uses Unicode code points and also binaries.

pilif · on Nov 27, 2013

Python 3 gets so much of this right. It's one of the things I really loved about python 3 as it allows for correct string handling in most cases (see below).

Note that this is only really true with Python 3.3 and later as in earlier versions stuff would start breaking for characters outside of the BMP (which is where JS is still stuck at, btw) unless you had a wide build which was using a lot of memory for strings (4 bytes per character)

In general, internally using unicode and converting to and from bytes when doing i/o is the right way to go.

But: Due to http://en.wikipedia.org/wiki/Han_unification being locked into Unicode with a language might not be feasible for everybody - especially in Asian regions, Unicode isn't yet as widely spread and you still need to deal with regional encodings, mainly because even with the huge character set of Unicode, we still can't reliably write in every language.

Ruby 1.9 and later helps here by having many, many string types (as many as it knows encodings), which can't be assigned to each other without conversion.

This allows you to still have an internal character set for your application and doing encoding/decoding at i/o time, but you're not stuck with unicode if that's not feasible for your use-case.

People hate this though because it seems to interfere with their otherwise perfectly fine workflow ("why can't I assign this "string" I got from a user to this string variable here??"), but it's actually preventing data corruption (once strings of multiple encodings are mixed up, it's often impossible to un-mix them, if they have the same characer width).

I don't know how good the library support for the various Unicode encodings is in Ruby though. According to the article, there still is trouble with correctly doing case transformations and reversing them.

Which brings me to another point: Some of the stuff you do with strings isn't just dependent on string encoding, but also locale.

Uppercasing rules for example depend on locale, so you need to keep that into account too. And, of course, deal with cases when you don't know the locale the string was in (encoding is hard enough and most of the cases undetectable - but locales - next to impossible).

I laugh at people who constantly tell me that this isn't hard and that "it's just strings".

lelf · on Nov 27, 2013

> Python 3 gets so much of this right

What does it gets right????? It's all broken as nearly everything else!

It's sad 99% comments there are “oh see, I can run some examples from page just fine. So everything's all right, I've got full Unicode!”

The reality is there's 1-2 languages that are trying to make it correct from the beginning (perl6, I'm looking at you). It's 2013 and if language can compose bytes to code points everyone declares a win, sticks "full unicode support" label to it and continues to str[5:9].

”But I've got UnicodeUtils!” — it won't help. People just don't want or cannot write it correctly. Word is not [a-z]. Not [[:alpha:]] either. And not [insert regex here]. You cannot reverse by reversing codepoint list. And you cannot reverse by reversing grapheme list. And string length is hard to compute and then it doesn't any sense. And indexing string doesn't make any sense and it's far away from O(1)

wbond · on Nov 27, 2013

Can you provide some examples of Python 3 getting strings wrong?

Between strings being native unicode code points (you have to encode to bytes to get UTF-8) and unicodedata for normalization and decomposition (http://docs.python.org/3.3/library/unicodedata.html) I've found Python 3 pretty robust. Python 3.3 also uses appropriate Unicode data for regular expressions, as mentioned on http://docs.python.org/3.3/howto/regex.html.

If you want to compare strings you should really normalize them first, which is where unicodedata comes in. In my programming situations it would be wrong to conflate different decomposition of the same unicode string. Why is this? Because other software you interact with uses encodings and the UTF-8 encoding of two different decompositions if different. I've run into this with UTF-8 filenames on OS X when working with Subversion.

lelf · on Nov 27, 2013

Did you read the comment you're replying to at all? You can start at “It's sad 99% comments”.

PS:

  Python 3.3.2 (default, Nov 27 2013, 20:04:48)
  [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> 'öo̧'[1:]
  '̈o̧'

And sorry, those new regexes don't even support \X (grapheme matching)

Edit: python version

wbond · on Nov 27, 2013

Yes, I did, and you did not provide a single example. You just said "“oh see, I can run some examples from page just fine. So everything's all right, I've got full Unicode!".

Taking the time to actually prove your point it useful. However, your recent example seems to be running fine on Python 3.3. You did not include any version info in your example output.

    Python 3.3.0 (default, Mar 11 2013, 00:32:12) 
    [GCC 4.7.2] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> "öo̧"[1:]
    'o̧'
    >>>

I haven't run across any situations where Python 3.3 is doing wrong, which is why I am asking for some examples.

lelf · on Nov 27, 2013

3.3.2. No, it is not. Use 'o\u0308o\u0327'

wbond · on Nov 27, 2013

Oh, I see the issue here. You are expecting the string class to function via graphemes rather than characters. It should be possible to implement grapheme support since character support is there, but I imagine the reverse is not true.

A little googling turned this up. https://mail.python.org/pipermail/python-ideas/2013-July/021...

tekacs · on Nov 28, 2013

TL;DR of parent comment here for those skimming:

  x = 'o\u0308o\u0327'
  len(x) == 4
  x == "öo̧"
  x[:2] == "ö"
  x[2:] == "o̧"
  x[0] == x[2] == "o"

wbond · on Nov 27, 2013

You got me curious about grapheme matching in Python with regex. It looks like it is not in the stdlib yet with 3.3. However, it you install https://pypi.python.org/pypi/regex and then replace:

    import re

with

    import regex as re

Then if you want to get into using graphemes slicing, you could use something like:

    import regex as re
    
    decomposed_str = 'o\u0308o\u0327'
    graphemes = re.findall('(\\X)', decomposed_str)
    sub_graphemes = grapheme[1:]
    decomposed_substr = ''.join(sub_graphemes)

e12e · on Nov 28, 2013

But what is that sequence (I know the unicode sequence is listed below -- but is it some wierd edge-case)? — because if I manually compose/type those (and a few other characters) everything seems to work fine:

    [edit: Python 3.2.3]
    [edit: [GCC 4.7.2] on linux2]

    >>> 'öo̧'[1:] #copy-paste
    'o̧'
    >>> 'öo̧'[::-1] # "reverse" also breaks
    '̧oö'
    #But for Japanese:
    >>> '日本語'[1:]
    '本語'
    >>> '日本語'[:-1]
    '日本'
    >>> '日本語'[-1:]
    '語'
    >>> '日本語'[::-1]
    '語本日'
    # And Norwegian
    >>> 'æåø'[::-1]
    'øåæ'
    # And a few "French" characters (in this case
    # manually typed as alt+~+e, etc
    >>> 'ẽêèe'[::-1]
    'eèêẽ'
    # And crucially for your example, typed as
    # alt+"+o
    >>> 'öo'[::-1]
    'oö'

So is your initial example some kind of unicode-without-bom(b) or something?

[edit2: I gather, that working with "pre-composed" characters work, and working with "de-composed" ones break. Which, while expected, is a little sad, I agree.]

exDM69 · on Nov 27, 2013

> Python 3 gets so much of this right. It's one of the things I really loved about python 3 as it allows for correct string handling in most cases (see below).

One of the biggest things that I feel Python gets right with the string type is that strings are immutable. It makes a lot of things easier.

It really makes sense to have a good string type for small strings, stored in unicode. Immutability makes everything simpler.

The string type is not a good fit for handling large amounts of text. There are trade offs for efficiency that have to be made to create a handy string type. It really makes sense to have a separate "bytes" type or some kind of StringBuffer for doing big text operations.

wsc981 · on Nov 27, 2013

Isn't the string type immutable in many (most?) other languages as well? In Objective-C the default is an immutable string (though optionally

one can create mutable strings as well). Lua also uses immutable strings. In Java and C# I think the situation is the same, since if you want

to use high performance string manipulation, you'll generally resort some form of StringBuilder helper class.

SideburnsOfDoom · on Nov 27, 2013

Correct, C# and .Net have an immutable string class and a mutable StringBuilder helper class.

meepmorp · on Nov 27, 2013

I believe strings are mutable in Ruby.

reflectiv · on Nov 27, 2013

They are...

    s = "hello"
    s << "   world"
    s # hello world

dbaupp · on Nov 27, 2013

Is that allocating a new buffer, leaving the "hello" string to be collected by the GC?

jeorgun · on Nov 28, 2013

No, it's operating in place:

    def append_world(str)
        str << " world"
    end

    a = "hello"
    append_world(a)
    a                       #=> "hello world"

tedunangst · on Nov 27, 2013

No, it expands the existing buffer. (leaving " world" to be collected). Note that the following is different and more like what you're thinking.

    a = a + " world"

fauigerzigerk · on Nov 27, 2013

>In general, internally using unicode and converting to and from bytes when doing i/o is the right way to go.

I'm not sure what "internally using unicode" means. Pyhon's internal representation of strings has changed a lot. It hasn't even been stable in Python 3. Now they are apparently using an internal representation that varies depending on the "widest" character stored.

The only solution that isn't driving me insane is to use UTF-8 everywhere. The Python 3 unicode situation is actually the main reason why I'm not using Python much these days.

pilif · on Nov 27, 2013

In Python 3, you don't care about what they use internally. You don't need to.

If you want to work with strings, you work with strings. If you want to work with bytes, you work with bytes. If you want to convert bytes into strings (maybe because it's user input that you want to work with), then you tell Python what encoding these bytes are in and you have it create a string for you. You don't care what Python uses internally, because their string API is correct and correctly works on characters.

That noël example of the original article consists of 4 characters in Python 3 which is exactly what you want.

I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

UTF-8 also isn't widely used by current operating systems (Mac OS and Windows use UCS-2). It's also not what's used by way too many legacy systems still around.

So as long as the data you work with likely isn't in UTF-8, the encoding and decoding steps will be needed if you want to be correct. Otherwise, you risk mixing strings in different encodings together which is an irrecoverable error (aside of using heuristics based on the language of the content).

fauigerzigerk · on Nov 27, 2013

>In Python 3, you don't care about what they use internally. You don't need to.

I do need to know and I always care. My requirements may be different than those of most others because I write text analysis code and I need to optimize the hell out of every single step. I shiver at the thought that any representation could be chosen for me automatically.

Of course, nothing is stopping me from simply using the bytes type instead of str, but clearly the Python community has decided to go down a road I feel is entirely wrong so I'm not coming along.

>I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

I'm bound to live in a variable length character world unless I decide to use 4 byte code points everywhere, which is prohibitive in terms of memory usage. Memory usage is absolutely critical. Iterating over a few characters now and then to count them is almost always negligible.

The need to index into a string to find the nth character only comes up when I know what I'm looking for. Things like parsing syntax or protocol headers come to mind, and they are always ASCII. I don't remember a situation where I needed to know the nth character of some arbitrary piece of text and repeat that operation in a tight loop.

If one day I find myself in such a situation I will have to convert to an array of code points anyway.

wbond · on Nov 27, 2013

So in your one, specific, performance-limited situation, Python 3's implementation of unicode doesn't work for you. Mostly because you are trying to optimize based on implementation details.

I don't see how this equates to a general purpose language failing at strings, especially when the language isn't particularly focused on performance and optimization. And if memory usage is of concern, I would certainly think anything like Python and Ruby would be out of the running?

fauigerzigerk · on Nov 27, 2013

>I don't see how this equates to a general purpose language failing at strings

And I don't see where I said it did.

I used to favor a dual Python/C++ strategy, but Python's multithreading limitations and the decisions around unicode have convinced me to move on. It's not like anything has gotten worse in Python 3, it's just that there has been a major change and the opportunity to do the right thing was missed.

I happen to think that UTF-8 everywhere is the right way to go, not just for my particular requirements, but for all applications, because it reduces overall complexity.

berdario · on Nov 27, 2013

I strongly disagree

and I'd like to know what do you think the "right thing" would be

I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"... the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings...

there're some weird exceptions, like Haskell Data.Text (I think that's due to haskell laziness)

would you prefer to have O(n) indexing and slicing of strings... or you'd prefer to get rid of these operations altogheter?

if the latter, what'd you prefer to do? force the developers to use .find() and handle such things manually... or create some compatibility string type restricted to non composable codepoints?

Getting an implementation out to see it used in the wild might be an interesting endeavor... probably it'd be easier to do in a language that allows you to customize it's reader/parser... like some lisp... clojure

fauigerzigerk · on Nov 27, 2013

>I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"

Then we agree entirely. I want all strings to be UTF-8. Period. What I said about an array of codepoints was that I would create one seperately from the string if I ever had a requirement to access individual code point positions repeatedly in a tight loop.

>the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings

If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

>would you prefer to have O(n) indexing and slicing of strings

I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

berdario · on Nov 27, 2013

> If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

Actually, you can in Python... and obviously most developers ignore such issues [citation needed]

My point is that most developers don't know these details, a lot of idioms are ingrained... get them to work with string types properly won't be easy (but a good stdlib would obviously help immensely in this regard)

> I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

Ok, so with your proposal an hypothetical slicing method on a String class in a java-like language would have this signature?

byte[] slice(int start, int end);

I've been fancying the idea of writing a custom String type/protocol for clojure that deals with the shortcoming of Java's strings... I'll probably have a try with your idea as well :)

masklinn · on Nov 27, 2013

> Actually, you can in Python...

No, you can only get random access on codepoints which will break text as soon as combining characters are involved. Even if you normalize everything beforehand (which most people don't do) as not all possible combinations have precomposed forms.

Unicode makes random access useless at anything other than destroying text.

> but a good stdlib would obviously help immensely in this regard

Which is extremely rare, and which Python does not have.

fauigerzigerk · on Nov 27, 2013

>Actually, you can in Python

You are right (apart from combining characters as masklinn explained), but as I said, that's only possible if an array of 32 bit ints is used to hold string data or if it can be guaranteed that there are no characters from outside ASCII or BMP. If I understand PEP 393 correctly, what Python 3.3 does is to use 32 bit ints to hold the entire string if even one such code point occurs. So if you load a (possibly large) text file into a string and one such code point exists then the file's size is going to quadruple in memory. All of that is done just to implement one very rare operation efficiently. http://www.python.org/dev/peps/pep-0393/#new-api

judk · on Nov 28, 2013

Sounds like you want to use Go. Feels like Python, but technically correct implementations of concepts.

microtherion · on Nov 28, 2013

Mac OS and Windows use UCS-2

Which parts of Mac OS? You'd have a lot of problems with Emoji support if that were true. To the best of my knowledge, it's UTF-16 everywhere.

Or do you actually mean Mac OS as in Mac OS 9, and not OS X?

panzi · on Nov 27, 2013

Agreed with most of it except:

"because their string API is correct"

Apparently they have a bug in their UTF-7 parser that can lead to invalid unicode strings. Don't know if it's already fixed.

berdario · on Nov 27, 2013

It was a bug in the decoding: it raised an unexpected exception, nothing that couldn't be worked around with a check (afaik it didn't crash the interpreter)

and it has been fixed since more than 1 month, just 2 days after it was reported

http://bugs.python.org/issue19279

Let's avoid spreading fud, shall we? :)

zokier · on Nov 27, 2013

That would be an implementation flaw, not an API issue.

panzi · on Nov 27, 2013

Indeed.

mathias · on Nov 27, 2013

> stuff would start breaking for characters outside of the BMP (which is where JS is still stuck at, btw)

ECMAScript 6 fixes that, mostly. See http://mathiasbynens.be/notes/javascript-unicode for details.

this_user · on Nov 27, 2013

The string type isn't broken. If anything these "X is broken" posts are broken. Taking one special case, finding problems with that case and deducing that the whole concept must therefore be discarded is just silly. Strings work fine for the vast majority of use cases. No technology is free of flaws and engineering decisions are almost always based on weighting the pros and cons and choosing a solution that on balance works best. Strings are a useful feature and Unicode is a notoriously hard problem. Proposing to go back to arrays of characters makes things worse for most people in most cases and therefore is not a practical solution.

gutnor · on Nov 27, 2013

Vast majority of use cases in the English-speaking world.

In other countries like China, Japan, India, ... those edge cases are common enough to represent a significant portion of use cases and make X truly broken.

The article is maybe a bit provocative, but you know what, that's exactly what is needed to raise awareness of mainly US-centric developers who would completely ignore the technical issues until they face a clone in China whose only innovative feature is not breaking on Chinese text.

hrktb · on Nov 27, 2013

The point is not to reduce the number of options (everyone going back to arrays of characters) but to put the spotlight on some problems where going a level lower could help a lot.

> *Strings work fine for the vast majority of use cases

In the CKJ space (a third of the population ?) strings are "broken" in the vast majority of use cases (really, things like what format you should accept for a telephone number). It get exponentially dirty as you try more complex manipulations, and I think these are interesting problems. It helps discussing them from time to time.

regularfry · on Nov 27, 2013

Conflating the responsibilities of "character list" with "byte array" is always going to go badly.

lnanek2 · on Nov 27, 2013

That's OK, sounds like he is writing a new language, so screwing up on the strings implementation is par for the course. Languages and databases don't typically get correct string handling for many years later after they are born, if ever. Supporting all the unicode and other character set insanity takes years of work. Asking someone writing a language to get strings right is like asking a five year old to obtain a drivers license.

mercurial · on Nov 27, 2013

In my experience, the world is full of software which "work fine for the majority of use cases" until the point where you take the wrong code path and things go south.

jodrellblank · on Nov 27, 2013

Much like human brains, and business processes.

jckt · on Nov 27, 2013

Exactly. Engineering =/= Maths.

judofyr · on Nov 27, 2013

In many languages it's difficult fixing the string type without breaking existing code. In Ruby: String#upcase only handles ASCII (by spec), #length counts codepoints, #reverse reverses codepoints.

You can use UnicodeUtils if you need "full" Unicode support:

    >> UnicodeUtils.upcase("baﬄe")
    => "BAFFLE"
    >> graphemes = UnicodeUtils.each_grapheme("noe\u0308l").to_a
    >> graphemes.reverse.join
    => "lëon"
    >> graphemes.size
    => 4
    >> graphemes[0, 3]
    => "noë"

lelf · on Nov 27, 2013

> String#upcase only handles ASCII (by spec)

Bad for Ruby

> You can use UnicodeUtils if you need "full" Unicode support:

Oh, sure

  Betty:~ lelf$ ruby -r unicode_utils/u -e 'puts UnicodeUtils.each_grapheme("A‮͜CB‬D").to_a.reverse.join'
  D‬BC͜‮A

So, "full" (it's not) Unicode support won't help you if you have little idea about what you're doing (like indexing stringه҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿҈̿s)

cpeterso · on Nov 27, 2013

Strange. Chrome indents those characters into a >-shaped "flock formation", but Firefox renders them as a vertical column.

jhull · on Nov 27, 2013

what are these characters printing here?

preek · on Nov 27, 2013

It's an awesome little gadget - looks like one character, but is a really big messy bunch of bytes:

"\xD9\x87\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\ x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD 2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\ xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xB F\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\ xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xC C\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\ xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x8 8\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\ x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD 2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\ xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xB F\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\ xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xCC\xBF\xD2\x88\xC C\xBF\n"

jhull · on Nov 27, 2013

funny that an ad comes up when you search that character on google. http://imgur.com/Y5vkbHF

notJim · on Nov 27, 2013

Ugh your post is totally breaking the page layout!

jbert · on Nov 27, 2013

Perl seems to pass nearly all the tests (including uppercasing baffle):

  $ perl -E 'use utf8; binmode STDOUT, ":utf8"; say uc("baﬄe");'

BAFFLE

The only failure I can see is that it treats "no<combining diaresis>el" as 5 characters (so reports length as 5 and reversing places the accent on the wrong character). That's documented here: http://perldoc.perl.org/perluniintro.html#Handling-Unicode "Note that Perl considers grapheme clusters to be separate characters"

All else seems to work though (including precomposed/decomoposed string equiality etc). The docco also says that perl's regex engine with Do The Right Thing with matching the entire grapheme cluster as a single char.

cursork · on Nov 27, 2013

Perl is actually very good with Unicode. Note that a character is "The smallest component of written language that has semantic value" according to the Unicode glossary - I'd say Perl respects that meaning. As noted in the docs, graphemes can be handled with \X in regular expressions (although admittedly that's not pretty):

    my $length = 0; $length++ while $dec =~ /\X/g;

Note that a grapheme is defined as "A minimally distinctive unit of writing in the context of a particular writing system" - i.e. context is required to determine what a grapheme actually is. A few others have pointed that out... Given the definitions from Unicode, Perl does a pretty good job (esp. when using Unicode::Normalize to normalize input).

http://www.unicode.org/glossary/

llimllib · on Nov 27, 2013

Python 3 gets that one, but python 2.7 doesn't:

    $ python3 -c 'print("baﬄe".upper())'
    BAFFLE
    $ python -c 'print "baﬄe".upper()'
    BAﬄE
    $ python -c 'print u"baﬄe".upper()'
    BAÏ¬E

deathanatos · on Nov 28, 2013

It's interesting that you get BAFFLE for the first one. I get the same result in both 3 and 2.

Note first that the reason you get "BAÏ¬E" is a bit of garbarge-in garbage-out. Strangely, the interpreter isn't rejecting that with the typical "SyntaxError: Non-ASCII character <char> in file" error; instead, it appears to be assuming ISO-8859-1, and then performing .upper(). You can fix that:

    python2 -c '# coding: utf-8
    print u"baﬄe".upper()'

(Note, of course, that the #coding needs to match your terminals encoding, which is likely UTF-8, but it isn't guaranteed.)

That, for me, prints "BAﬄE" in both Python 2 and 3 (adjusting for 3 by adding parens around print, and removing the u prefix on the literal.) I'm on Python 3.2, so perhaps 3.3 does better. (I'm behind on updates, but last I did update, Gentoo stable was still on 3.2.)

llimllib · on Nov 28, 2013

Interesting.

I am on python 3.3, but I don't know if it's the updated interpreter that fixes the bug.

jmah · on Nov 27, 2013

Also Cocoa's NSString:

  [@"baﬄe" uppercaseString]; // @"BAFFLE"

edent · on Nov 27, 2013

˙ƃuᴉuɐǝɯ ⅋ 'spɹoʍ 'sɥdʎlƃ 'sɹǝʇɔɐɹɐɥɔ uǝǝʍʇǝq 'ɹǝʌǝʍoɥ 'ǝɔuǝɹǝɟɟᴉp ɐ sᴉ ǝɹǝɥ┴ ˙ʇxǝʇ ɥʇᴉʍ punoɹɐ ƃuᴉsɹɐ oʇ sǝɯoɔ ʇᴉ uǝɥʍ sǝᴉʇᴉlᴉqᴉssod ƃuᴉʇsǝɹǝʇuᴉ ǝɯos sɹǝɟɟo ǝpoɔᴉu∩

vorg · on Nov 27, 2013

Which looks cleaner, "丄" or "┴" ? I.e. "ǝɹǝɥ┴" or ǝɹǝɥ丄" ?

antocv · on Nov 27, 2013

Awesome way to exercise the brain.

afandian · on Nov 27, 2013

Interesting. I had no problem reading that (except for 'arsing' which I thought I'd misread). The ability of the brain to pattern-match upside down is amazing.

aroman · on Nov 28, 2013

Fascinatingly, I read your comment first, then tried to read the upside down post — the only word I had trouble with was arsing.

Sight reading is really fascinating.

ceras · on Nov 28, 2013

Varies by person. I'm nearly 100% incapable of reading upside down text.

lmm · on Nov 27, 2013

I think the mistake here is seeing a string as an extension of an array or vector. What I would prefer is a string type that didn't support all the operations of vectors. The length of a string is not inherently a meaningful question (and for the cases where it is, what you want is something like a vector of grapheme clusters - which is a useful type to have, but not so useful that every string in your program should incur the overhead of creating such a thing); likewise reversing and splitting are operations that simply shouldn't be allowed for your "fast path, undecoded string" type.

rlpb · on Nov 27, 2013

I'm with you here; but in that case, I'd like an ascii_string type, which most languages don't provide specifically. This type _would_ support string reversal, substring slices, and so on, but be limited to 7-bit ASCII only. I think there are many use cases that are purely internal, and don't need i8n. It's handy to be able to do things, including operations on strings, for internal things. Filename handling where you control the filename, the "keys" in languages which use strings for a dictionary type, and so forth.

jeltz · on Nov 27, 2013

I think this might just confuse new programmers and the filename thing is especially dangerous since at some point you might want to support i18n there. I think it would be better to have two types of string: 1) unicode strings and 2) arrays of 8 byte data with some string like functions (essentially C strings). The second case is essentially binary data strings.

tshaddox · on Nov 27, 2013

A big problem here is a lack of clear definitions for various concepts like "character," "reversed string," "upper case," etc. The author briefly recognizes this, but brushes it off with statements like "I generally expect that..." and "I assume most people would not be happy with the current result."

I think these hand-wavings aren't helpful. Short of extensive surveying, which is bound to be controversial no matter what the result, talking about "general expectations" is a purely subjective notion, and not a good way to evaluate the actions of cold, soulless silicon that is just following orders.

Like the author, I also consider myself a mostly reasonable person, yet is might come up with very different expectations. If I saw that "ffl" ligature, how would I know it's a ligature and not some single unrelated character in another language? You might respond "but it's clearly part of the word 'baffle' and should be capitalized thusly." But would you suggest that string libraries ship with word lists and perform contextual analysis to determine how to perform string operations? Surely that's a fool's errand, not to mention that it would inevitably produce unexpected results.

itsybitsycoder · on Nov 27, 2013

"If I saw that "ffl" ligature, how would I know it's a ligature and not some single unrelated character in another language?"

Because the name of that character is "Latin Small Ligature ffl". Knowing to capitalize ﬄ as FFL doesn't require a word list any more than knowing to capitalize "ffl" does.

masklinn · on Nov 27, 2013

I'm not sure I agree with the title, although I do agree with just about all of the content:

* a string type is probably a good idea to bundle the subtleties of unicode, a plain array or list (whether it's of bytes or of codepoints) won't cut it: standard array operations are incorrect/invalid on unicode streams

* the vast majority of string types are broken anyway, as even in the best case they're codepoint arrays (possibly with a smart implementation). The bad cases are just code unit arrays, which break before you even reach fine points of unicode manipulation

And then, you've got the issue that a lot of unicode manipulation is locale-dependent, which most languages either ignore completely or fuck up (or half and half, for extra fun)

b-johansson · on Nov 27, 2013

If you are actually manipulating strings rather than just storing and pushing them around I would suggest looking at ICU. Handling Unicode is difficult and it's easy to confuse encodings, code points and glyphs or make assumptions based on your own culture and language.

ICU has support for a lot of the basic operations you would want to perform on strings as well as conversion to whatever format is suitable for your platform and environment.

billpg · on Nov 27, 2013

Do people really need to reverse strings in the real world?

I don't think I've ever written code to do that outside of homework assignments and interviews.

ygra · on Nov 27, 2013

Substrings exhibit similar problems and those are used quite often. It's just that in this case the effect of seeing it fail is a little more dramatic (i.e., l̈ – which doesn't even seem to render properly here).

jeltz · on Nov 27, 2013

"l̈" renders just fine for me, maybe your font does not include it.

ygra · on Nov 27, 2013

Verdana doesn't seem to properly support U+0308, apparently. It's wrong (with that font) in Chrome, IE 10, Firefox and Word 2010. Other operating systems might substitute a different font that works better, perhaps.

jeltz · on Nov 27, 2013

Yes, I am running Debian without having installed the Microsoft core fonts so Verdana is substituted for DejaVu Sans.

abcd_f · on Nov 27, 2013

May be not reversing, but trimming a Unicode string to certain character count is a close relative and it is a very common operation.

lmm · on Nov 27, 2013

Right, but what's the count you want there? It's either a byte count or a grapheme cluster count. The .count() on most current languages' string types doesn't correspond to either of those, so isn't really useful.

jeltz · on Nov 27, 2013

What do you use it for? Unless you have a monospaced font the number of characters do not mean much. So unless you are implementing command line tools or text editors it should not be that common.

Pxtl · on Nov 27, 2013

Truncating with ellipsis in the GUI in a desktop app. I can measure rendered length on a desktop, so I can truncate down to the desired number of pixels, round down to the nearest char, and then tack on "...". I would hate to see a semantically-important accent mark lost this way.

jerf · on Nov 27, 2013

I have a database field limited to 100 "characters" [1]. The user sent me a form submission with 150. I need to do something to resolve that. This is incredibly common. Truncation to a defined size is routine.

[1]: I'm leaving "characters" undefined here, because no matter what Unicode-aware definition you apply here, you've got trouble.

reginaldjcooper · on Nov 27, 2013

This is a good real-world example and the response is an armchair programmer informing you that you are doing it wrong. The internet is rife with know-it-alls. "Just do X." Well, I cannot because I am contractually obligated to write the software as specified and not cowboy up and do whatever I like.

Maybe someone decided 100 characters was a reasonable cutoff and that field is not important enough to reject (read: increase bounce rate) on if someone manages to send too much.

Maybe the 100 characters is a short string generated from an unrestricted long string and cached on a separate server.

al2o3cr · on Nov 27, 2013

"I have a database field limited to 100 "characters"."

Well there's your problem right there...

"The user sent me a form submission with 150. I need to do something to resolve that."

Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash. If you must have fixed-length fields, surely telling the user "much characters, wow overflow" is better than just chopping the input.

jerf · on Nov 27, 2013

Since this seems to be confusing people, I'm providing a small hypothetical example here.

"Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash."

You are reading far more in than I put in. I merely said somehow you need to resolve this; you put a particular resolution in my mouth, then attacked.

I did choose the web for one reason, which is that you can't avoid this case; you can try to limit the UI to generating 100 characters only (and I still haven't defined "characters"...), but it's 15 seconds for a user to pull open Firebug and smash 150 characters into your form submission anyhow. Somehow, you better resolve this, and as quickly as you mounted the high horse when faced with the prospect of mere truncation, throwing the entire request out for that will cause somebody else to mount an equally high horse....

glhaynes · on Nov 27, 2013

What if it's a batch ETL process where there is no "user" to tell that it went wrong?

The point that when you're worrying about string length, it's often an indicator of a separate problem is a good one. But some things really do need the ability to measure/truncate strings and not every situation allows just throwing the software in the trash as an option.

jeltz · on Nov 27, 2013

Have you checked how your database counts? Does it count code points or does it try to count graphemes? I assume the former, but I guess you would still have to cut the input at a grapheme border when truncating the input.

ygra · on Nov 27, 2013

Ellipsisising text when it does not fit into a label, for example. And if you just remove code points from the end (instead of graphemes) until the string (including ellipsis) fits then you might just drop a diacritic.

ashashwat · on Nov 27, 2013

You have a search query and you want to remove stopwords and normalize the query.

chiph · on Nov 27, 2013

I've been waiting for someone to ask me to reverse a string in an interview, so I can tell them why the code I just wrote for them (using the XOR trick, which is what they're usually expecting), is wrong.

zwily · on Nov 27, 2013

When I've asked people to reverse a char* in the past, it's just been to see if they understand the basics of pointers. The XOR "trick" hasn't been impressive since high school. :)

jmpe · on Nov 27, 2013

Had such a case a few months back. Strings of single-byte characters are Endian-agnostic but multi-byte character encoding is affected by Endianness. To cope with it I read the sequence as single byte, then reversed, then changed the encoding to proper encoding and reversed again. The data came from a binary dump where I only needed a section that contained a few strings.

I admit it's dirty but it was throwaway code for an isolated case.

Edit: eh, guys, as I stated the string came from a binary dump. I didn't get to choose the encoding, it came from ROM in an embedded system with a different Endianness. I had to figure out a way to make it human readable.

justincormack · on Nov 27, 2013

Use UTF8, no endian issues. Thats yet another reason why UTF16 and UTF32 are broken.

monkeyninja · on Nov 27, 2013

language will not store unicode string internally with UTF8. Yes, we use it as input and output, but in memory, utf8 is terrible for random access characters. endian is only an issue (normally) for input and output, not really an issue for internal storage. especially when using UTF16 and UTF32 you know exactly the size of items.

ygra · on Nov 27, 2013

UTF-16 is just as bad as UTF-8 regarding variable-width code points. The only thing you always have (unless using compression schemes like SCSU) is random access to code units. Only UTF-32 also allows random access to code points. However, that's still of questionable value because when dealing with text you often want to handle graphemes, not code points, code units or bytes.

jeltz · on Nov 27, 2013

You cannot do random access at all in Unicode, not even UTF-32 (and absolutely not UTF-16), due to combining characters.

jmdavis · on Nov 27, 2013

UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).

jeltz · on Nov 27, 2013

For this you do not need to reverse the string in a unicode aware way. You need to operate on the raw bytes.

markild · on Nov 27, 2013

The same mechanism that is used for reversing a string can be very useful though.

Think in the lines of python's:

>>> 'abcd'[:-1]

'abc'

rayiner · on Nov 27, 2013

This is why the U.S. dominates the software world. Back when everyone was figuring out how to express their languages, we had the option to punt on complexity and just use ASCII.

adamtj · on Nov 27, 2013

ASCII doesn't make the U.S. special. ASCII is special because it's from the U.S.

Lots of people speak languages that trivially fit in 8 bits with no real "figuring out" to do. Before Unicode, we all had our different codepages or encodings. Including the U.S.

The U.S. is pretty central to computing. Because of that, and because ASCII only uses 7 bits, some other 8-bit cultures use it as a subset for their native 8-bit encodings. Even in the U.S, we use extensions to ASCII so we can represent text in languages that are close cousins to English. I doubt you actually use ASCII much. You've probably been using either ISO 8859-1 (aka Latin-1), which is a superset of ASCII, or Windows-1252, which is a superset of Latin-1.

http://msdn.microsoft.com/en-us/library/cc194884.aspx

This mess of incompatible codepages and culture specific encodings is one of the main problems that Unicode was invented to solve. It also happens to help languages which need more than 8 bits.

rayiner · on Nov 27, 2013

Many languages fit into 8 bits, but English is particularly simple in its alphabet. Even many of the European languages that can fit in 8 bits have things like accented characters that complicates things somewhat.

Of course this isn't to say English is simple overall. Just that it's complexities lie elsewhere, and it's simplicities lie in an area that made it particularly simple for early computer systems to process.

meepmorp · on Nov 27, 2013

> Even many of the European languages that can fit in 8 bits have things like accented characters that complicates things somewhat.

I don't see your point here, with respect to English orthography making computer implementation easier. How exactly does not needing representations for accented characters make anything easier?

ianbicking · on Nov 27, 2013

If it was just some additional characters like ñ (which is considered a letter of its own, not an accented n) then it wouldn't be a big deal – but e and é are the same letter with different accents, which adds some subtlety that English simply doesn't have. Given a small enough number of accented characters you can punt on that, call them each a character, but English is objectively simpler since the only real distinction it has between letters is caps or not-caps. (I was just watching the Mother Of All Demos, though, and everything was in caps but they put an overline over capital letters. So even normal English lettering was too complicated for a while.)

Groxx · on Nov 27, 2013

It has fewer characters (don't need one for each accent, possibly exceeding 8 bits otherwise) and/or no variable width characters. Also capitalization rules are trivial.

Not that I'm claiming English is unique here, just convenient, and many languages can't claim that.

sesqu · on Nov 27, 2013

Meanwhile, many of those languages with accented characters have no use for letters like z.

It's not really worth mentioning the alphabet when talking about unique features of English.

Dewie · on Nov 27, 2013

It seems rather that it is the other way around - the US dominated (and still does) the computer industry, and so ASCII, the English-centered character set, became the standard. ASCII is good enough (you might lose some accents on certain characters in certain words and such, but nothing much) for English but has no consideration for any other characters that might be used in other languages.

If Turkey was the dominant country in IT, I don't see why they wouldn't do the same thing only for their own alphabet; include all the characters of their alphabet (latin alphabet plus a few more), plus some more common characters used in math etc.

frezik · on Nov 27, 2013

The OP probably needed to clarify. English having a simple alphabet gave the US a leg up on personal computing compared to the KJC countries. It's only one contributing factor, though.

Pitarou · on Nov 27, 2013

Hat tip to Guido van Rossum for passing (nearly) all the tests in Python 3.

Is the "ffl-ligature to uppercase" test really relevant? Isn't that fixed by appropriate use of string normalisation?

riquito · on Nov 27, 2013

The ﬄ ligature passes

  $ python3
  Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
  [GCC 4.8.1] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> "baﬄe".upper()
  'BAFFLE'

Strange that the article claims that no languages passes it. It seems from another post that perl passes it too

Pitarou · on Nov 27, 2013

Embarrassing. I should have checked for myself.

And it is strange. Maybe the author needs to check his locale settings?

unwind · on Nov 27, 2013

I also doubt the validity of the upper-casing, it feels like in an internationalization/localization context, converting a string to all upper case is not a valid thing to be doing.

Not all languages (or even characters) have a well-defined upper-case versions of their glyphs.

Even if they all did, I would expect the interpretation by a (human) reader to vary culturally.

judofyr · on Nov 27, 2013

The Unicode standard includes uppercase rules. If you're already representing strings using Unicode codepoints, why not follow the whole Unicode standard?

EDIT: And yes, different languages can have different uppercasing rules. There's still a standard: https://github.com/lang/unicode_utils/blob/master/data/CaseF...

unwind · on Nov 28, 2013

Thanks, I was not aware of that.

I guess "uppercase this string" goes from being a tiny loop to a big ... thing based on a lot of hardcoded knowledge, which in turn might indicate that it's not a very simple operation any more.

XorNot · on Nov 27, 2013

The usual goal is to apply a consistent transform though, to smooth out interpretation differences - i.e. when looking for command input I either lowercase or uppercase things to smooth over the fact that "yes" "YES" "Yes" are all completely valid ways of saying the same thing with those characters.

If there's only one way of expressing the thing - i.e. a single chinese character - then it would be valid to do nothing. It's just in english "y" and "Y" might change context, but as far as computer input is generally concerned they are the same thing.

ianburrell · on Nov 27, 2013

To compare strings case-insensitively, you want case-folding instead of lowercase or uppercase. Unicode defines case-folding for comparing strings. There are enough complexities with case, like characters that don't have other case or multiple mappings, that it can't be correctly used for comparison.

TorKlingberg · on Nov 27, 2013

For that use case it is better to compare case insensitively with "yes" instead of converting the input to lower case first.

asveikau · on Nov 27, 2013

If you can compare case-insensitively then you (or the library you call into) must be aware of case and you face the exact same problems. It's a pretty good thought exercise to attempt writing your own Unicode-aware case insensitive string compare. A lot of people call into libraries for this stuff without realizing how complex the problem gets.

zokier · on Nov 27, 2013

How do you do case-insensitive comparison without normalizing the case of the operands?

gnaritas · on Nov 28, 2013

Most string compare routines in the library offer a case insensitive compare option already, you don't have to normalize it.

DougBTX · on Nov 27, 2013

Here are the Unicode rules, which do consider localization: ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

It would be interesting to know how often these rules are actually used though...

russellsprouts · on Nov 27, 2013

Many designers use

    abbr{
        text-transform:lowercase;
        font-variant:small-caps;
        letter-spacing:.1em;
    }

To make acronyms like HTML and CSS look better on the page. To support i18n, HTML allows setting the language on a per-document or even per-element basis. That way the upper- or lower-casing can be done following the rules of the language.

andrewcooke · on Nov 27, 2013

[edit:] ah, ok, so on python 3 it depends how it's constructed.

[originally i had a post here saying i couldn't get the noel to work on 3.3.2]

mercurial · on Nov 27, 2013

Seems to work fine on 3.3.3 (Linux)

  Python 3.3.3 (default, Nov 23 2013, 09:49:26)
  [GCC 4.8.2] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> a = "noël"
  >>> len(a)
  4
  >>> a[::-1]
  'lëon'

judofyr · on Nov 27, 2013

This is the test case: https://eval.in/73766

    print(len("noe\u0308l"))

mercurial · on Nov 27, 2013

Ah, I see, the decomposed case indeed doesn't work as well.

masklinn · on Nov 27, 2013

The test case is decomposed.

panzi · on Nov 27, 2013

While JavaScript (in browsers) has no way to normalize precomposed/decomposed strings, it has standard methods to correctly compare them: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

E.g.:

    var decomp="noël";
    var precomp="noël";
    console.log(decomp.split(""));
    console.log(precomp.split(""));
    console.log(decomp.localeCompare(precomp));

Prints:

    ["n", "o", "e", "̈", "l"]
    ["n", "o", "ë", "l"]
    0

Browser support for this varies, The Intl.Collator interface is currently only supported Chrome (maybe also in Opera? Idk).

Note: In Chrome when comparing (e.g. sorting) a lot of strings String.prototype.localeCompare is much slower than using a pre-composed Intl.Collator instance (because internally localeCompare creates a new collator for each call). Using Intl.Collator rediced startup time of my http://greattuneplayer.jit.su/ immensely. node.js currently has no support for Intl.*. It probably will be a compile time option for 0.12.

brihat · on Nov 27, 2013

This article is mostly written from a European language perspective. For Indian scripts, storing combining characters as a separate code points is the right thing to do.

For example, कि (ki) is composed of क and ि When I'm writing this in an editor, say, I typed ku (कु) instead of ki (कि) and I press backspace, I indeed want to see क rather than deleting the whole "कि".

frenchy · on Nov 27, 2013

Only some times I figure, because if you want to make the first letter green, you'd want that to apply to the whole कि.

gcr · on Nov 27, 2013

For the record, Racket gets the "baﬄe" example right:

    racket@> (string-upcase "baﬄe")
    "BAFFLE"

It also passes all of the author's other tests (except for the ones involving combining diacritics, but racket includes built-in functions for normalizing such strings so you can work with them)

revelation · on Nov 27, 2013

I seem to have rather little use for the cases the author presents here. If I'm working with strings, they are either of the debug or internal variant, where even basic ASCII would suffice, or I get them from somewhere and don't touch them at all, just pass them around.

But what I absolutely need in a language is to have a very very clear seperation between strings and byte arrays, or raw data, and ideally a way to transform between the two. C# gets this right with its byte and string types, the framework uses them correctly, and there is the wonderful Encoding namespace to interchange the two. Python 2.7 is the absolte worst, it's apparently impossible to get anything done with raw data and not run into some obscure 'ASCII codec can't handle octet 128' whatever exception (reminds you why we have strict typing: magic is fucking annoying).

duncan_bayne · on Nov 28, 2013

I'd have hoped Common Lisp would fare well here, but SBCL (1.1.11 on 64-bit Linux Mint 15) is pretty broken. My results:

string: noël, reversed: l̈eon, first 3 chars: noe, length: 5

string: 😸😾, reversed: 😾😸, first 1 char: 😸, length: 2

string: baﬄe, upcase: BAﬄE

string: noël, equals precomposed: NIL

Edited: GNU CLISP 2.49 produces identical results.

aerique · on Nov 28, 2013

I was somewhat disappointed as well. I wrote some tests here: http://paste.lisp.org/display/140280

Perhaps playing around with different internal representations as pointed out by sedachv (https://news.ycombinator.com/item?id=6811407) would work but the initial, naive string usage doesn't work.

While I expected the default usage to work correctly in Common Lisp.

implicit · on Nov 27, 2013

What we really should be doing is doing away with broken nomenclature.

What does the "length" of a string even mean? A database will tell you it has to do with storage. A nontechnical person will say it's the number of symbols. A visual designer might say that it has to do with onscreen width when rasterized in a particular way. None of these people are obviously right or wrong.

It's very useful to be able to count the number of glyphs in a string, or the number of unicode codepoints, or bytes, or pixels when rasterized in a particular way, but "length" isn't clear enough to unambiguously refer to any of them. Any meaning you try to ascribe to the "length" operation is going to be wrong to someone.

NkVczPkybiXICG · on Nov 27, 2013

All of these examples work in Haskell's canonical text library, 'text'! It's the only language I know of that works.

FreeFull · on Nov 27, 2013

The reversal of the decomposed noël doesn't produce the right result. Converting baﬄe to uppercase does do the right thing though, and the rest works as expected.

AndyKelley · on Nov 27, 2013

I don't think the solution to this problem is to make our string classes more complicated. I think it's to make our languages and character sets less complicated. I can't believe that multiple codepoints being used to generate a single glyph made it into the Unicode spec. That breaks a bunch of extremely useful abstractions. I think it is reasonable to expect human languages to be made up of distinct glypths that do not interfere with each other. Any language that does not is too complicated to be worth supporting. Let it die.

delinka · on Nov 27, 2013

Now let's take the lower case of "BAFFLE" - should we get "baffle" or should the string class/function/wtfe attempt to recognize that a ligature can replace "ffl" and return to us "baﬄe"? More generally, should the string library ever attempt to replace letter with ligatures? Should this be yet another option?

And as I type this, another issue manifests: the spelling correction can't even recognize baﬄe as a properly spelled word; it highlights the 'ba' and ignores the rest.

ygra · on Nov 27, 2013

Uppercasing and lowercasing is inherently lossy. E.g. the German ß becomes SS when uppercased, yet there is no way to know whether SS should be lowercased to ss or ß again. That's a reason why those things should be used, if at all, only as display transformations. Same goes for ligatures, but even those actually shouldn't be applied automatically, depending on the language. E.g. in German ligatures cannot span syllables and few layout engines can detect that.

zokier · on Nov 27, 2013

I feel like I should learn German only so that I would be able to comment on the ß issue every time a Unicode thread pops up. From my uninformed point of view it is not really clear if ß should really be handled as a separate character/grapheme, or just as a ligature in rendering phase and stored as 'ss'. Or even if current-day orthography should be held at such a sacrosanct position that it shouldn't be changed to save significant amount of collective effort.

TillE · on Nov 27, 2013

> or just as a ligature in rendering phase and stored as 'ss'.

Probably.

> to save significant amount of collective effort

I've seen this kind of suggestion a number of times on HN, and I find it highly amusing. When confronted with a difficult challenge in representing the world on a computer, apparently the answer is to instead change the world.

OK, but then how are you going to handle hundreds of years of legacy texts?

rolux · on Nov 28, 2013

In German, 'ß' is definitely not just a ligature of 'ss'.

Consider 'Masse' (mass) vs. 'Maße' (dimensions).

Uppercasing these words will necessarily produce ambiguity.

It would be equally tempting -- and wrong -- to treat the German characters 'ä', 'ö' and 'ü' as ligatures of 'ae', 'oe' and 'ue'. They're pronounced the same, and the latter forms commonly occur as substitutions in informal writing, but they also occur in proper names, where it would be incorrect to substitute them with the former. However, if you want to sort German strings, 'ä', 'ö' and 'ü' sort as 'ae', 'oe' and 'ue'.

ygra · on Nov 27, 2013

The point is, while it may have started out as a ligature (of either ſs or ſz, no one really knows for sure), it has long become a letter in its own right. You cannot treat it like a display-only ligature without throwing away information, e.g. the difference between Maße (measurements) and Masse (mass). People in Switzerland made a conscious decision not to use ß anymore, but that's not the case in other countries where the language is used.

chrisoverzero · on Nov 27, 2013

As "ß" vs. "ss" changes pronunciation of preceding vowels, I can't see how it could be anything other than its own letter.

* "Fuß" ("foot") roughly rhymes with "loose."

* "Fluss" ("river") roughly rhymes with… um, nothing I can think of. It has the vowel sound of "look" and "book," at least as pronounced in the American Northeast.

Since the orthographic reform of 1996, this has become a big deal.

al2o3cr · on Nov 27, 2013

If anybody hasn't seen it, Glitchr's twitter is a fantastic example of how bizarro things can get with "140 characters".

https://twitter.com/glitchr_

Note: may freak out browsers with a flaky Unicode implementation. For instance, scrolling that stream on the iOS Twitter client can get very laggy.

JulianMorrison · on Nov 27, 2013

For Go: the for-range loop iterates 5 times, reversed (manually, using the resulting runes) is l̈eon, utf8.RuneCount is 5. The blog has just recently been talking about text normalization[1] via a library, but it isn't built into the core.

[1] http://blog.golang.org/normalization

brihat · on Nov 27, 2013

The author intentionally chooses decomposed form. Indeed all of them work with Python 3. Here:

    Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
    [GCC 4.8.1] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> noel="noël"
    >>> noel[::-1]       # reverse
    'lëon'
    >>> noel[0:3]        # first three characters 
    'noë'
    >>> len(noel)        # length
    4

The point is, defining what is a character based on how it is displayed is flawed. Just precompose the string ifg you want and carry on. Like I said in my other comment, making automatic conversion of decomposed -> precomposed wrecks havoc with Indian languages.

jfim · on Nov 27, 2013

Works as expected too in Scala, although it might be because the terminal does normalization.

scala> val noel = "Noël" noel: String = Noël

scala> noel.reverse res0: String = lëoN

scala> noel.take(3) res1: String = Noë

scala> noel.length res2: Int = 4

scala> import java.text.Normalizer

val nfdNoel = Normalizer.normalize(noel, Normalizer.Form.NFD) import java.text.Normalizer

scala> nfdNoel: String = Noël

scala> nfdNoel.length res3: Int = 5

scala> nfdNoel.reverse res4: String = l̈eoN

scala> nfdNoel.take(3) res5: String = Noe

The problem with an array of characters, as he mentions, is that it doesn't work properly in many use cases. If your array of characters stores 16 bit codepoints, it breaks with the 32 bit codepoints (Java got bit hard by that, where a char used to be a character prior to the introduction of surrogate pairs in Unicode); if it stores 32 bit codepoints, then it's pretty wasteful in most cases, which is exactly why you'd want a string type that handles storage of series of characters in an optimal fashion.

klrr · on Nov 27, 2013

I hope Haskell Prime solves this. In Haskell String is literally a list of characters. This causes some overhead and leads to bad performance. Of course we got Text and for binary data you can use ByteString, but it's a bit of pain compared to having a real string type by default.

jeorgun · on Nov 28, 2013

I think the specific case of ligatures isn't a failure in strings per se, but a failure in Unicode in that it includes them in the first place. What "ﬁ".upper() (or whatever) should do is kind of ambiguous. The following doesn't really seem appropriate:

  "ﬁ".upper().lower() #=> "fi"

But obviously nor does

   "ﬁ".upper() #=> "ﬁ"

In Turkish (which distinguishes between dotted and dotless 'i'), this issue exists already:

   "ı".upper().lower() #=> "i"

This case couldn't (so far as I know) be fixed by any string library without breaking Unicode compatibility, so it seems slightly disingenuous to call it an issue with strings.

wazoox · on Nov 28, 2013

Tom Christiansen (of Perl fame) made a much, much thorough analysis of Unicode problems in his OSCON 2011 presentation: http://www.oscon.com/oscon2012/public/schedule/detail/24252

Here are the slides: http://training.perl.com/OSCON2011/gbu/gbu.pdf

The site seems down ATM, but Internet Archive has it: https://web.archive.org/web/20121224081332/http://98.245.80....

mathias · on Nov 27, 2013

The article briefly mentions JavaScript, which uses something similar to UTF-16/UCS-2: http://mathiasbynens.be/notes/javascript-encoding>

Here’s a slightly more in-depth blog post on the many issues this causes, and how to avoid them in JavaScript: http://mathiasbynens.be/notes/javascript-unicode Some of these problems are briefly mentioned in the above post, too.

ddebernardy · on Nov 28, 2013

This is misinformation. OP's strings are just wrong...

    >> "\u0308"
    => "̈"
    >> "\u00eb"
    => "ë"
    >> "noe\u0308l"
    => "noël"
    >> "no\u00ebl"
    => "noël"

His noël examples work just fine if you don't copy/paste the string he posts, and instead type them in like I just did.

If anything, languages are reporting correct reverses and length, since he's really manipulating 5 characters rather than four.

enko · on Nov 28, 2013

Congratulations, you've discovered unicode composition!

  2.0.0p247 :045 > Unicode::compose("e\u0308").unpack('U').first.to_s(16)
   => "eb" 
  2.0.0p247 :046 > Unicode::compose("\u00eb").unpack('U').first.to_s(16)
   => "eb" 
  2.0.0p247 :047 > Unicode::decompose("e\u0308").unpack('U').first.to_s(16)
   => "65" 
  2.0.0p247 :048 > Unicode::decompose("\u00eb").unpack('U').first.to_s(16)
   => "65"

I presume the ones you pasted in were changed by the browser. His examples are not wrong at all, indeed how can a string be "wrong"?

ddebernardy · on Nov 28, 2013

His example is "wrong" in the sense that you cannot reasonably complain that "noe¨l" gets reversed to "l¨eon" and put the "¨" part on top of the "l" when it does — which seems entirely correct. Or for that matter, that the string's length is 5 when there are indeed 5 characters.

As for being changed by the browser, the latter (or rather the OS) copied what there was, and the OS pasted it verbatim insofar as I can tell.

vfclists · on Nov 28, 2013

Honestly I think 'you' computer programmers love useless challenges too much. Why can't you adopt lessons from Q?

If it isn't easy to get some languages working with Unicode properly then fix the languages and leave Unicode alone. Remove all the language characteristics that makes working with Unicode difficult. If Unicode will not go to the language then the language must go to Unicode, or opt out of the computer era, or die!!

KISS!!

mbq · on Nov 27, 2013

There is a one more issue -- the easier it is to manipulate strings in some language the greater chance that they will be used as an internal data structure for things that certainly aren't texts. And this almost always causes substantial performance loss and awful bugs that are either untraceable due to a dependence on subtle configuration details or form security holes. Or both.

on Nov 27, 2013

[deleted]

alayne · on Nov 27, 2013

Try the correct test input: noe\u0308l

ademarre · on Nov 27, 2013

I think a lot of programmers don't properly understand character encoding simply because their programming languages don't give them the proper treatment. We need more APIs that force developers to acknowledge character encodings, probably in the type system.

jheriko · on Nov 27, 2013

this hits on one of my biggest problems with native android and ios development. the wcs/wchar functions are largely broken or unusable... it caused me a real headache from not knowing upfront.

the idea of the string type is just fine though (or a character array) broken implementations don't invalidate it, they just invalidate the myth of '3rd party libraries must be good because hundreds of programmers worked on them for years' - which is exactly a myth. it doesn't just apply to strings but everything. (not brokeness, just that you shouldn't expect them to work beyond what you can measure, and certainly shouldn't expect that they are flawless or even good implementations)

rverghes · on Nov 27, 2013

Out of curiosity, why only have one string type? We don't do the same for numbers. Many languages don't have "number", they have int, float, long, etc.

Instead of just String, maybe we should have ASCIIString, UTF8String, and UTF16String.

monkeyninja · on Nov 27, 2013

I don't understand the reason of using C++ char array to store unicode text....

drdaeman · on Nov 27, 2013

That's because in a truly sane languages there should be a distinction between data type and its implementation.

Then it would be not "string" type, that's broken, but an implementation of "string" type.

agravier · on Nov 27, 2013

I agree, it seems like a much saner thing to do. Now that you make me think of that, I do not know many instances of this. I just could think of https://github.com/clojure-numerics/core.matrix upon which I stumbled recently. Do you have other example of efforts to separate a type from its implementations?

lmm · on Nov 27, 2013

Most collection libraries (e.g. the Java one) work like this - you have List as an interface and can use LinkedList or ArrayList or so on. I particularly like scala's approach to factory-like methods combined with this; Seq is an interface, as is List, with implementations like LinkedList. But you can do any of LinkedList(1, 2, 3), List(1, 2, 3), or Seq(1, 2, 3) - and get back a LinkedList, a List (which will be an implementation-selected implementation, possibly LinkedList), or a Seq (which again will be an implementation-selected implementation, possibly LinkedList).

riffraff · on Nov 27, 2013

doesn't every statically typed imperative language do this, and recommend it?

drdaeman · on Nov 27, 2013

Not to my knowledge.

C++ and Java are statically typed and they, as far as I know, don't have distinction between string interface and implementation, just a standard string type. You can't make your own string implementation and make others (given that - would it exist - they use standard string interface) transparently accept them instead of language's standard string implementation.

Even Haskell (with standard Prelude) doesn't have a readily available and widely accepted typeclass for strings. As String is just an alias to [Char], if library writer used that, they won't accept, say, Data.Text (I know, it's a bit distinct thing, but...)

riffraff · on Nov 27, 2013

I was referring to "efforts to separate a type from its implementations", not String specifically, and thinking of containers & co.

Although, for example, even java has CharSequence which only gives you access to codepoints in a char sequence, you can inherit from that and create your own.

agravier · on Nov 27, 2013

You are right. It's interesting that I didn't think of it, probably because switching implementations in compiled languages is often less trivial, and I don't remember doing it. Actually, are there many alternative implementations of, say, the C++ STL?

daGrevis · on Nov 27, 2013

Logically equivalent doesn't mean equivalent for computers. While you can't define why reverse of “noël“ is “lëon“ by set of rules that computer can follow, computer just can't know.

zokier · on Nov 27, 2013

Umm. For that case you definitely can define a valid reversing algorithm. The key is using grapheme clusters as the indivisible base unit. Sure, there are probably some weird languages that will not reverse properly with such algorithm, but it would still be a significant improvement over the current situation.

sp332 · on Nov 27, 2013

That's why the article says, it's better to have a bare-bytes data structure, than a broken string type.