The point is that these are trade-offs. The reason for the stupid API is because...

jerven · on Dec 5, 2014

Even if you use 3 bytes per char you are not guaranteed O(1) string indexing as you will still have surrogate pairs :( So you need to normalize these first... or deal with them in searching :(

String in more modern languages are standardised in one form and widely used in all libraries. C and C++ are the exceptions to this.

For example I hardly ever see CharSequence implementations in Java code. It is the most underused interface I know of, because for most people String is good enough. In idiomatic Java code the performance issue discussed here does not occur. Strings being final and passed by reference (or more correctly the reference is copied by value) the same string does not get reallocated all over the place.

The C++ problem is that std:string does not suffice for the common case and we end up with QStrings that are difficult to convert to boost:string.

In Java land when String does not meet the perf needs we implement a custom CharSequence but these are easy to convert to Strings if needed (easier than QString to Boost string).

So we end up with most Java programers knowing just StringBuilder, String and char[]. ByteBuffer is really about bytes and easy to convert to a CharBuffer.

The Java Language has some downsides in standardising on UTF-16 for its common String type (Moving on from UCS-2). But even if had chosen extended ASCII like Oberon it would at least be consistent everywhere.

I believe that javac replaces + with StringBuilder calls. And in loops the multiple StringBuilder allocs are removed and replaced by appending to one StringBuilder.

The Java Strings are far from perfect in UniCode terms but a whole lot better than std::String and its 16+32 friend.

MrBuddyCasino · on Dec 5, 2014

You made some good points, but I'm not sure all of these are necessary trade offs.

Lets say that the issue of code points not fitting into 16 bit chars requires a separate indexOf() API, to preserve O(1). They could have solved this one with better method naming, or at least improved documentation, so people are aware of the fact. The String#charAt() Javadoc is not really useful, unless you fully understand the implications of: "If the char value specified by the index is a surrogate, the surrogate value is returned." Also, if you are handling CJK strings, is there actually a need to split them by charAt()? The runtime could just tag such strings and use the fastest indexing method, falling back to the slow one for those.

Concatenation: true, and that the jit can't optimize most loops by using StringBuilder is a little embarrassing. Abstractions are always leaky, but this one I think could be improved, so that more cases would be made faster by the runtime.

Of course language designers aren't usually stupid, but its not like they all were created in some cozy Languages Workshop, where they could take their time to ripen, being guarded over by benevolent gardeners. See PHP, see Javascript, and of course Java, too. We don't have to accept accidental complexity as a given, there are useful abstractions, and we should use them. Rust and C# are better than previous attempts, imho, because they got proper funding and are designed by people who know what they're doing. And the whole field of software is better for it!

If you are Google, you are in the unique position of having top talent and huge amounts of servers. In that case, the trade-off to use C++ is probably the right one. People can handle its complexity, and it pays off due to increased performance, because the code runs on a million servers. But this simply isn't true for 99% of the business, and using C or C++ is not the optimal choice for them.