Hacker News new | past | comments | ask | show | jobs | submit login
Dive into Python 3: Strings (diveintopython3.org)
91 points by arthurk on May 22, 2009 | hide | past | favorite | 17 comments



This is actually one of the best intros to Unicode (and string encoding in general) I've seen. If the rest of the book ends up being of this quality, I'll be pretty pumped.


This especially: "In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. "Is this string UTF-8?" is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions."

Couldn't have said it better myself.


Agreed. If only this had been around 7 years ago when I was first introduced to unicode and character encoding when trying to internationalize a C++ app. It took me forever to get my head wrapped around this idea as I were generally never forced to deal with the actual conversion and Strings were always just Strings. Even 7 years later this document completely cleared up the way I was thinking about it.


"In Python 3, all strings are sequences of Unicode characters"

What does that mean exactly? Everything seemed really well explained until then. The line above lost me.

EDIT: I think I'm confused with the difference between UTF32 and Unicode. Is there one?


Yes. Unicode is a method of converting between code points and characters. A character is a symbol used in a human language or other written communication. A code point is a number, commonly written as "U+<decimal integer expression of the number>".

There are many different ways to actually store these sequences of code points in a computer. UTF-32 is one of those ways. It takes the decimal integer, coverts it to a 32-bit binary integer, and then splits that number into four 8-bit bytes. As the book says, there are problems with space usage -- in ordinary English text, all the code points will be U+127 or less, which leads to a lot of zero bytes taking up space. In addition to the waste of space, zero bytes can cause problems in C, since they're the symbol for the end of the string. So people invented other 'encodings' to convert Unicode code-points into bytes. UTF-16, UTF-8, etc.


I love how you can hover your mouse over certain lines of code and it highlights a paragraph below that explains something about it, and vice versa.


Thanks for noticing!

Fun fact: I originally coded that "by hand," i.e. manipulating the DOM in pure JavaScript. Then I decided to rewrite it in jQuery, which I had heard about but never used. Then I realized that I will never voluntarily write JavaScript without jQuery, ever again.


Nice work, this is probably the nicest looking technical material I've ever seen (but then I'm a sucker for nice clean typography).


I also love how he uses the "Etc" ligature. This feels exceptionally well-done.


I am, in fact, particularly proud of that. 3.5% of my (compressed) CSS is devoted to using the best available ampersand. http://simplebits.com/notebook/2008/08/14/ampersands.html


   "On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters."
This is not exactly true. Chinese has an iconic lexicon where each glyph is a single word. Both Cantonese and Mandarin speakers both use the same lexicon, but have different pronunciations. There are several lexicons (pinyin, big5, ancient) and thousands of glyphs in each.

Japanese has three lexicons; hiragana, katagana, and kanji. Kanji is the oldest and adapted from Chinese. Glyphs are iconic. Hiragana and katagana were developed in Japan and are phonetic. Together they form most of what you see as Japanese text today. I think katagana is used more for foreign, non-Chinese words. There is also romanji which is essentially English letters. Anyway, apart from kanji which is Chinese, there are less than 100 hiragana and katagana glyphs.

Korean is even simpler. They too borrowed from the Chinese and occasionally still use some Chinese glyphs, but the official lexicon is Hangul. Hangul is phonetic and has 24(?) basic glyphs. Some glyphs can be combined into compound glyphs called double consonants and double vowels making about 40 glyphs total. A Korean word can be written by breaking it down into syllables, combining glyphs to form a syllable super-glyph, then putting those together to form a word.

The explanation of unicode and python3 is great. I just wanted to clear up the misconception stated in the first paragraph. Nothing more to see here...


I am having trouble seeing what you're claiming. Are you saying that there are not thousands of characters in Japanese and Korean because they borrowed from the Chinese? If so, I'd say your correction is misplaced. These characters must still be taken into account for a charset to be used for these languages.

In any case, I found many of your comments to be irrelevant to the question of how many characters must be used in a language. For instance, the difference between Cantonese and Mandarin pronunciation doesn't have anything to do with this issue. Nor does Chinese origin.

edit: I just spent a little time researching Korean (which I know much less of than Chinese or Japanese), and now I understand more what you were saying about it. However, it seems to me that each "super-glyph" as you call them counts as a character, as far as any charset is concerned. The fact that they can be broken up into constituent glyphs is irrelevant.


Really? I thought the idea of Han Unification was that the duplicated characters between the CJK languages all map to the same unicode codepoints.


Yes that's true, but even so, you can't fit all those characters into 8 bits, which is basically the point of that first section. Also, I don't see how mapping to the same codepoint would mean that Chinese has thousands of characters, while Japanese does not. They just share many of those thousands in common.


Also, now that I am reading focusing on your words and not trying to figure out what you mean, a lexicon is a collection of words, not a collection of characters. The concepts overlap somewhat in the case of Chinese, but katakana, for instance, is certainly not a lexicon.

(apologies for replying again, but the time for editing has passed)


Here's another great explanation of Unicode from PyCon '08: http://farmdev.com/talks/unicode/

It is really helpful in understanding Unicode handling in Python pre 3.0.


Mark... thank you for doing this and making it available on the web. I will buy a hardcopy of it when it's out.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: