Dive into Python 3: Strings

tumult · on May 22, 2009

This is actually one of the best intros to Unicode (and string encoding in general) I've seen. If the rest of the book ends up being of this quality, I'll be pretty pumped.

yan · on May 22, 2009

This especially: "In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. "Is this string UTF-8?" is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions."

Couldn't have said it better myself.

enoren · on May 22, 2009

Agreed. If only this had been around 7 years ago when I was first introduced to unicode and character encoding when trying to internationalize a C++ app. It took me forever to get my head wrapped around this idea as I were generally never forced to deal with the actual conversion and Strings were always just Strings. Even 7 years later this document completely cleared up the way I was thinking about it.

cammil · on May 23, 2009

"In Python 3, all strings are sequences of Unicode characters"

What does that mean exactly? Everything seemed really well explained until then. The line above lost me.

EDIT: I think I'm confused with the difference between UTF32 and Unicode. Is there one?

inklesspen · on May 26, 2009

Yes. Unicode is a method of converting between code points and characters. A character is a symbol used in a human language or other written communication. A code point is a number, commonly written as "U+<decimal integer expression of the number>".

There are many different ways to actually store these sequences of code points in a computer. UTF-32 is one of those ways. It takes the decimal integer, coverts it to a 32-bit binary integer, and then splits that number into four 8-bit bytes. As the book says, there are problems with space usage -- in ordinary English text, all the code points will be U+127 or less, which leads to a lot of zero bytes taking up space. In addition to the waste of space, zero bytes can cause problems in C, since they're the symbol for the end of the string. So people invented other 'encodings' to convert Unicode code-points into bytes. UTF-16, UTF-8, etc.

akirk · on May 22, 2009

I love how you can hover your mouse over certain lines of code and it highlights a paragraph below that explains something about it, and vice versa.

MarkPilgrim · on May 22, 2009

Thanks for noticing!

Fun fact: I originally coded that "by hand," i.e. manipulating the DOM in pure JavaScript. Then I decided to rewrite it in jQuery, which I had heard about but never used. Then I realized that I will never voluntarily write JavaScript without jQuery, ever again.

tvon · on May 22, 2009

Nice work, this is probably the nicest looking technical material I've ever seen (but then I'm a sucker for nice clean typography).

yan · on May 22, 2009

I also love how he uses the "Etc" ligature. This feels exceptionally well-done.

MarkPilgrim · on May 22, 2009

I am, in fact, particularly proud of that. 3.5% of my (compressed) CSS is devoted to using the best available ampersand. http://simplebits.com/notebook/2008/08/14/ampersands.html

denimboy · on May 22, 2009

   "On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters."

This is not exactly true. Chinese has an iconic lexicon where each glyph is a single word. Both Cantonese and Mandarin speakers both use the same lexicon, but have different pronunciations. There are several lexicons (pinyin, big5, ancient) and thousands of glyphs in each.

Japanese has three lexicons; hiragana, katagana, and kanji. Kanji is the oldest and adapted from Chinese. Glyphs are iconic. Hiragana and katagana were developed in Japan and are phonetic. Together they form most of what you see as Japanese text today. I think katagana is used more for foreign, non-Chinese words. There is also romanji which is essentially English letters. Anyway, apart from kanji which is Chinese, there are less than 100 hiragana and katagana glyphs.

Korean is even simpler. They too borrowed from the Chinese and occasionally still use some Chinese glyphs, but the official lexicon is Hangul. Hangul is phonetic and has 24(?) basic glyphs. Some glyphs can be combined into compound glyphs called double consonants and double vowels making about 40 glyphs total. A Korean word can be written by breaking it down into syllables, combining glyphs to form a syllable super-glyph, then putting those together to form a word.

The explanation of unicode and python3 is great. I just wanted to clear up the misconception stated in the first paragraph. Nothing more to see here...

chairface · on May 22, 2009

I am having trouble seeing what you're claiming. Are you saying that there are not thousands of characters in Japanese and Korean because they borrowed from the Chinese? If so, I'd say your correction is misplaced. These characters must still be taken into account for a charset to be used for these languages.

In any case, I found many of your comments to be irrelevant to the question of how many characters must be used in a language. For instance, the difference between Cantonese and Mandarin pronunciation doesn't have anything to do with this issue. Nor does Chinese origin.

edit: I just spent a little time researching Korean (which I know much less of than Chinese or Japanese), and now I understand more what you were saying about it. However, it seems to me that each "super-glyph" as you call them counts as a character, as far as any charset is concerned. The fact that they can be broken up into constituent glyphs is irrelevant.

bobbyi · on May 23, 2009

Really? I thought the idea of Han Unification was that the duplicated characters between the CJK languages all map to the same unicode codepoints.

chairface · on May 23, 2009

Yes that's true, but even so, you can't fit all those characters into 8 bits, which is basically the point of that first section. Also, I don't see how mapping to the same codepoint would mean that Chinese has thousands of characters, while Japanese does not. They just share many of those thousands in common.

chairface · on May 22, 2009

Also, now that I am reading focusing on your words and not trying to figure out what you mean, a lexicon is a collection of words, not a collection of characters. The concepts overlap somewhat in the case of Chinese, but katakana, for instance, is certainly not a lexicon.

(apologies for replying again, but the time for editing has passed)

mace · on May 23, 2009

Here's another great explanation of Unicode from PyCon '08: http://farmdev.com/talks/unicode/

It is really helpful in understanding Unicode handling in Python pre 3.0.

brisance · on May 23, 2009

Mark... thank you for doing this and making it available on the web. I will buy a hardcopy of it when it's out.