APPENDIX -- UNDERSTANDING UNICODE ------------------------------------------------------------------------ SECTION -- Some Background on Characters ------------------------------------------------------------------- Before we see what Unicode is, it makes sense to step back slightly to think about just what it means to store "characters" in digital files. Anyone who uses a tool like a text editor usually just thinks of what they are doing as entering some characters--numbers, letters, punctuation, and so on. But behind the scene a little bit more is going on. "Characters" that are stored on digital media must be stored as sequences of ones and zeros, and some encoding and decoding must happen to make these ones and zeros into characters we see on a screen or type in with a keyboard. Sometime around the 1960s, a few decisions were made about just what ones and zeros (bits) would represent characters. One important choice that most modern computer users give no thought to was the decision to use 8-bit bytes on nearly all computer platforms. In other words, bytes have 256 possible values. Within these 8-bit bytes, a consensus was reached to represent one character in each byte. So at that point, computers needed a particular -encoding- of characters into byte values; there were 256 "slots" available, but just which character would go in each slot? The most popular encoding developed was Bob Bemers' American Standard Code for Information Interchange (ASCII), which is now specified in exciting standards like ISO-14962-1997 and ANSI-X3.4-1986(R1997). But other options, like IBM's mainframe EBCDIC, linger on, even now. ASCII itself is of somewhat limited extent. Only the values of the lower-order 7-bits of each byte might contain ASCII-encoded characters. The top 7-bits worth of positions (128 of them) are "reserved" for other uses (back to this). So, for example, a byte that contains "01000001" -might- be an ASCII encoding of the letter "A", but a byte containing "11000001" cannot be an ASCII encoding of anything. Of course, a given byte may or may not -actually- represent a character; if it is part of a text file, it probably does, but if it is part of object code, a compressed archive, or other binary data, ASCII decoding is misleading. It depends on context. The reserved top 7-bits in common 8-bit bytes have been used for a number of things in a character-encoding context. On traditional textual terminals (and printers, etc.) it has been common to allow switching between -codepages- on terminals to allow display of a variety of national-language characters (and special characters like box-drawing borders), depending on the needs of a user. In the world of Internet communications, something very similar to the codepage system exists with the various ISO-8859-* encodings. What all these systems do is assign a set of characters to the 128 slots that ASCII reserves for other uses. These might be accented Roman characters (used in many Western European languages) or they might be non-Roman character sets like Greek, Cyrillic, Hebrew, or Arabic (or in the future, Thai and Hindi). By using the right codepage, 8-bit bytes can be made quite suitable for encoding reasonable sized (phonetic) alphabets. Codepages and ISO-8859-* encodings, however, have some definite limitations. For one thing, a terminal can only display one codepage at a given time, and a document with an ISO-8859-* encoding can only contain one character set. Documents that need to contain text in multiple languages are not possible to represent by these encodings. A second issue is equally important: Many ideographic and pictographic character sets have far more than 128 or 256 characters in them (the former is all we would have in the codepage system, the latter if we used the whole byte and discarded the ASCII part). It is simply not possible to encode languages like Chinese, Japanese, and Korean in 8-bit bytes. Systems like ISO-2022-JP-1 and codepage 943 allow larger character sets to be represented using two or more bytes for each character. But even when using these language-specific multibyte encodings, the problem of mixing languages is still present. SECTION -- What is Unicode? ------------------------------------------------------------------- Unicode solves the problems of previous character-encoding schemes by providing a unique code number for -every- character needed, worldwide and across languages. Over time, more characters are being added, but the allocation of available ranges for future uses has already been planned out, so room exists for new characters. In Unicode-encoded documents, no ambiguity exists about how a given character should display (for example, should byte value '0x89' appear as e-umlaut, as in codepage 850, or as the per-mil mark, as in codepage 1004?). Furthermore, by giving each character its own code, there is no problem or ambiguity in creating multilingual documents that utilize multiple character sets at the same time. Or rather, these documents actually utilize the single (very large) character set of Unicode itself. Unicode is managed by the Unicode Consortium (see Resources), a nonprofit group with corporate, institutional, and individual members. Originally, Unicode was planned as a 16-bit specification. However, this original plan failed to leave enough room for national variations on related (but distinct) ideographs across East Asian languages (Chinese, Japanese, and Korean), nor for specialized alphabets used in mathematics and the scholarship of historical languages. As a result, the code space of Unicode is currently 32-bits (and anticipated to remain fairly sparsely populated, given the 4 billion allowed characters). SECTION -- Encodings ------------------------------------------------------------------- A full 32-bits of encoding space leaves plenty of room for every character we might want to represent, but it has its own problems. If we need to use 4 bytes for every character we want to encode, that makes for rather verbose files (or strings, or streams). Furthermore, these verbose files are likely to cause a variety of problems for legacy tools. As a solution to this, Unicode is itself often encoded using "Unicode Transformation Formats" (abbreviated as 'UTF-*'). The encodings 'UTF-8' and 'UTF-16' use rather clever techniques to encode characters in a variable number of bytes, but with the most common situation being the use of just the number of bits indicated in the encoding name. In addition, the use of specific byte value ranges in multibyte characters is designed in such a way as to be friendly to existing tools. 'UTF-32' is also an available encoding, one that simply uses all four bytes in a fixed-width encoding. The design of 'UTF-8' is such that 'US-ASCII' characters are simply encoded as themselves. For example, the English letter "e" is encoded as the single byte '0x65' in both ASCII and in 'UTF-8'. However, the non-English "e-umlaut" diacritic, which is Unicode character '0x00EB', is encoded with the two bytes '0xC3 0xAB'. In contrast, the 'UTF-16' representation of every character is always at least 2 bytes (and sometimes 4 bytes). 'UTF-16' has the rather straightforward representations of the letters "e" and "e-umlaut" as '0x65 0x00' and '0xEB 0x00', respectively. So where does the odd value for the e-umlaut in 'UTF-8' come from? Here is the trick: No multibyte encoded 'UTF-8' character is allowed to be in the 7-bit range used by ASCII, to avoid confusion. So the 'UTF-8' scheme uses some bit shifting and encodes every Unicode character using up to 6 bytes. But the byte values allowed in each position are arranged in such a manner as not to allow confusion of byte positions (for example, if you read a file nonsequentially). Let's look at another example, just to see it laid out. Here is a simple text string encoded in several ways. The view presented is similar to what you would see in a hex-mode file viewer. This way, it is easy to see both a likely on-screen character representation (on a legacy, non-Unicode terminal) and a representation of the underlying hexadecimal values each byte contains: #---- Hex view of several character string encodings ----# ------------------- Encoding = us-ascii --------------------------- 55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20 | Unicode ------------------- Encoding = utf-8 ------------------------------ 55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20 | Unicode ------------------- Encoding = utf-16 ----------------------------- FF FE 55 00 6E 00 69 00 63 00 6F 00 64 00 65 00 | U n i c o d e !!! SECTION -- Declarations ------------------------------------------------------------------- We have seen how Unicode characters are actually encoded, at least briefly, but how do applications know to use a particular decoding procedure when Unicode is encountered? How applications are alerted to a Unicode encoding depends upon the type of data stream in question. Normal text files do not have any special header information attached to them to explicitly specify type. However, some operating systems (like MacOS, OS/2, and BeOS--Windows and Linux only in a more limited sense) have mechanisms to attach extended attributes to files; increasingly, MIME header information is stored in such extended attributes. If this happens to be the case, it is possible to store MIME header information such as: #*------------- MIME Header -----------------------------# Content-Type: text/plain; charset=UTF-8 Nonetheless, having MIME headers attached to files is not a safe, generic assumption. Fortunately, the actual byte sequences in Unicode files provides a tip to applications. A Unicode-aware application, absent contrary indication, is supposed to assume that a given file is encoded with 'UTF-8'. A non-Unicode-aware application reading the same file will find a file that contains a mixture of ASCII characters and high-bit characters (for multibyte 'UTF-8' encodings). All the ASCII-range bytes will have the same values as if they were ASCII encoded. If any multibyte 'UTF-8' sequences were used, those will appear as non-ASCII bytes and should be treated as noncharacter data by the legacy application. This may result in nonprocessing of those extended characters, but that is pretty much the best we could expect from a legacy application (that, by definition, does not know how to deal with the extended characters). For 'UTF-16' encoded files, a special convention is followed for the first two bytes of the file. One of the sequences '0xFF 0xFE' or '0xFE 0xFF' acts as small headers to the file. The choice of which header specifies the endianness of a platform's bytes (most common platforms are little-endian and will use '0xFF 0xFE'). It was decided that the collision risk of a legacy file beginning with these bytes was small and therefore these could be used as a reliable indicator for 'UTF-16' encoding. Within a 'UTF-16' encoded text file, plain ASCII characters will appear every other byte, interspersed with '0x00' (null) bytes. Of course, extended characters will produce non-null bytes and in some cases double-word (4 byte) representations. But a legacy tool that ignores embedded nulls will wind up doing the right thing with 'UTF-16' encoded files, even without knowing about Unicode. Many communications protocols--and more recent document specifications--allow for explicit encoding specification. For example, an HTTP daemon application (a Web server) can return a header such as the following to provide explicit instructions to a client: #*------------- HTTP Header -----------------------------# HTTP/1.1 200 OK Content-Type: text/html; charset:UTF-8; Similarly, an NNTP, SMTP/POP3 message can carry a similar 'Content-Type:' header field that makes explicit the encoding to follow (most likely as 'text/plain' rather than 'text/html', however; or at least we can hope). HTML and XML documents can contain tags and declarations to make Unicode encoding explicit. An HTML document can provide a hint in a 'META' tag, like: #*------------- Content-Type META tag -------------------# However, a 'META' tag should properly take lower precedence than an HTTP header, in a situation where both are part of the communication (but for a local HTML file, such an HTTP header does not exist). In XML, the actual document declaration should indicate the Unicode encoding, as in: #*------------- XML Encoding Declaration ----------------# Other formats and protocols may provide explicit encoding specification by similar means. SECTION -- Finding Codepoints ------------------------------------------------------------------- Each Unicode character is identified by a unique codepoint. You can find information on character codepoints on official Unicode Web sites, but a quick way to look at visual forms of characters is by generating an HTML page with charts of Unicode characters. The script below does this: #---------- mk_unicode_chart.py ----------# # Create an HTML chart of Unicode characters by codepoint import sys head = 'Unicode Code Points\n' +\ '\n' +\ '\n

Unicode Code Points

' foot = '' fp = sys.stdout fp.write(head) num_blocks = 32 # Up to 256 in theory, but IE5.5 is flaky for block in range(0,256*num_blocks,256): fp.write('\n\n

Range %5d-%5d

' % (block,block+256)) start = unichr(block).encode('utf-16') fp.write('\n
     ')
          for col in range(16): fp.write(str(col).ljust(3))
          fp.write('
') for offset in range(0,256,16): fp.write('\n
')
              fp.write('+'+str(offset).rjust(3)+' ')
              line = '  '.join([unichr(n+block+offset) for n in range(16)])
              fp.write(line.encode('UTF-8'))
              fp.write('
') fp.write(foot) fp.close() Exactly what you see when looking at the generated HTML page depends on just what Web browser and OS platform the page is viewed on--as well as on installed fonts and other factors. Generally, any character that cannot be rendered on the current browser will appear as some sort of square, dot, or question mark. Anything that -is- rendered is generally accurate. Once a character is visually identified, further information can be generated with the [unicodedata] module: >>> import unicodedata >>> unicodedata.name(unichr(1488)) 'HEBREW LETTER ALEF' >>> unicodedata.category(unichr(1488)) 'Lo' >>> unicodedata.bidirectional(unichr(1488)) 'R' A variant here would be to include the information provided by [unicodedata] within a generated HTML chart, although such a listing would be far more verbose than the example above. SECTION -- Resources ------------------------------------------------------------------- More-or-less definitive information on all matters Unicode can be found at: . The Unicode Consortium: . Unicode Technical Report #17--Character Encoding Model: . A brief history of ASCII: .