Hacker News new | past | comments | ask | show | jobs | submit login

The point is that these are trade-offs. The reason for the stupid API is because by getting precise multilingual handling, you give up either O(1) string indexing or representing characters in less than 3 bytes/char. If you naively use offsetByCodePoint on megabyte-long strings, you may find your performance slows to a crawl.

The reason for the memory leak is that you can either have fast substring() or accidental memory leaks when substrings are stored in persistent data structures. Not both. If they removed the optimization, then suddenly str.substring(1) takes O(1) time and on a megabyte string allocates a full copy of the entire megabyte. In some other use-case, the full copy is far worse.

There are others, too. Idiomatic Java uses '+' for string concatenation, but this is O(n^2) time when done repeatedly. (Except that HotSpot optimizes out the repeated allocations when all concatenations appear in the source code, but it still can't do anything about loops.) To get around this performance pitfall, there is StringBuilder. Now Java programmers need to know 2 APIs. Well, more like 6, considering there's also CharSequence, ByteBuffer, StringBuffer, and char[].

C++ has always had the philosophy of "The programmer knows their requirements better than we do". This is why it is complex; because problems are complex, and it is designed to solve many problems. It's very rare for a language designer of a popular language to actually be stupid; it's very common for a language to get used outside of the domains that the language designer optimized for.




Even if you use 3 bytes per char you are not guaranteed O(1) string indexing as you will still have surrogate pairs :( So you need to normalize these first... or deal with them in searching :(

String in more modern languages are standardised in one form and widely used in all libraries. C and C++ are the exceptions to this.

For example I hardly ever see CharSequence implementations in Java code. It is the most underused interface I know of, because for most people String is good enough. In idiomatic Java code the performance issue discussed here does not occur. Strings being final and passed by reference (or more correctly the reference is copied by value) the same string does not get reallocated all over the place.

The C++ problem is that std:string does not suffice for the common case and we end up with QStrings that are difficult to convert to boost:string.

In Java land when String does not meet the perf needs we implement a custom CharSequence but these are easy to convert to Strings if needed (easier than QString to Boost string).

So we end up with most Java programers knowing just StringBuilder, String and char[]. ByteBuffer is really about bytes and easy to convert to a CharBuffer.

The Java Language has some downsides in standardising on UTF-16 for its common String type (Moving on from UCS-2). But even if had chosen extended ASCII like Oberon it would at least be consistent everywhere.

I believe that javac replaces + with StringBuilder calls. And in loops the multiple StringBuilder allocs are removed and replaced by appending to one StringBuilder.

The Java Strings are far from perfect in UniCode terms but a whole lot better than std::String and its 16+32 friend.


You made some good points, but I'm not sure all of these are necessary trade offs.

Lets say that the issue of code points not fitting into 16 bit chars requires a separate indexOf() API, to preserve O(1). They could have solved this one with better method naming, or at least improved documentation, so people are aware of the fact. The String#charAt() Javadoc is not really useful, unless you fully understand the implications of: "If the char value specified by the index is a surrogate, the surrogate value is returned." Also, if you are handling CJK strings, is there actually a need to split them by charAt()? The runtime could just tag such strings and use the fastest indexing method, falling back to the slow one for those.

Concatenation: true, and that the jit can't optimize most loops by using StringBuilder is a little embarrassing. Abstractions are always leaky, but this one I think could be improved, so that more cases would be made faster by the runtime.

Of course language designers aren't usually stupid, but its not like they all were created in some cozy Languages Workshop, where they could take their time to ripen, being guarded over by benevolent gardeners. See PHP, see Javascript, and of course Java, too. We don't have to accept accidental complexity as a given, there are useful abstractions, and we should use them. Rust and C# are better than previous attempts, imho, because they got proper funding and are designed by people who know what they're doing. And the whole field of software is better for it!

If you are Google, you are in the unique position of having top talent and huge amounts of servers. In that case, the trade-off to use C++ is probably the right one. People can handle its complexity, and it pays off due to increased performance, because the code runs on a million servers. But this simply isn't true for 99% of the business, and using C or C++ is not the optimal choice for them.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: