TECHNICAL TIPS -- Understanding Unicode --

Technical Tips
Understanding Unicode

David Mertz, Ph.D.
February, 2001

Unicode provides a unified means of representing character sets across languages, and makes multi-language and multi-national computing much more transparent. This article provides some background on the development of Unicode, and some beginning hints on working with Unicode.

Some Background On Characters

Before we see what Unicode is, it makes sense to step back slightly to think about just what it means to store "characters" in digital files. Anyone who uses a tool like a text editor usually just thinks of what they are doing as entering some characters--numbers, letters, punctuation, etc. But behind the scene a little bit more is going on. "Characters" that are stored on digital media must be stored as sequences of ones and zeros, and some encoding and decoding must happen to make these ones and zeros into characters we see on a screen or type in with a keyboard.

Sometime around the 1960s, a few decisions were made about just what ones and zeros (bits) would represent characters. One important choice that most modern computer users give no thought to was the decision to use 8-bit bytes on nearly all computer platforms. In other words, bytes have 256 possible values. Within these 8-bit bytes, a consensus was reached to represent one character in each byte. So at that point, computers needed a particular encoding of characters into byte values; there were 256 "slots" available, but just which character would go in each slot? The most popular encoding developed was Bob Bemers' American Standard Code for Information Interchange (ASCII), which is now specified in exciting standards like ISO-14962-1997 and ANSI-X3.4-1986(R1997). But other options, like IBM's mainframe EBCDIC linger on, even now.

ASCII itself is of somewhat limited extent. Only the values of the lower-order 7-bits of each byte might contain ASCII encoded characters. The top 7-bits worth of positions (128 of them) are "reserved" for other uses (back to this). So, for example, a byte that contains "01000001" might be an ASCII encoding of the letter "A", but a byte containing "11000001" cannot be an ASCII encoding of anything. Of course, a given byte may or may not actually represent a character; if it is part of a text file, it probably does, but if it is part of object code, a compressed archive, or other binary data, ASCII decoding is misleading. It depends on context.

The reserved top 7-bits in common 8-bit bytes have been used for a number of things in a character-encoding context. On traditional textual terminals (and printers, etc.) it has been common to allow switching between codepages on terminals to allow display of a variety of national language characters (and special characters like box-drawing borders), depending on the needs of a user. In the world of internet communications, something very similar to the codepage system exists with the various ISO-8859-* encodings. What all these systems do is assign a set of characters to the 128 slots that ASCII reserves for other uses. These might be accented roman characters (used in many Western European languages), they might be non-roman character sets like Greek, Cyrillic, Hebrew, or Arabic (or in the future, Thai and Hindi). By using the right codepage, 8-bit bytes can be made quite suitable for encoding reasonable sized (phonetic) alphabets.

Codepages and ISO-8859-* encodings, however, have some definite limitations. For one thing, a terminal can only display one codepage at a given time, and a document with an ISO-8859-* encoding can only contain one character set. Documents that need to contain text in multiple languages are not possible to represent by these encodings. A second issue is equally important: many ideographic and pictographic character sets have far more than 128 or 256 characters in them (the former is all we would have in the codepage system, the latter if we used the whole byte and discarded the ASCII part). It is simply not possible to encode languages like Chinese, Japanese, and Korean in 8-bit bytes. Systems like ISO-2022-JP-1 and codepage 943 allow larger character sets to be represented using two or more bytes for each character. But even when using these language-specific multi-byte encodings, the problem of mixing languages is just as present.

What Is Unicode?

Unicode solves the problems of previous character encoding schemes by providing a unique code-number for every character needed, worldwide and across languages. Over time, more characters are being added, but the allocation of available ranges for future uses has already been planned out, so the room exists). In Unicode encoded documents, no ambiguity exists about how a given character should display (for example, should byte value 0x89 appear as e-umlaut, as in codepage 850, or as the per-mil mark, as in codepage 1004?). Furthermore, by giving each character its own code, there is no problem or ambiguity in creating multi-lingual documents that utilize multiple character sets at the same time. Or rather, these documents actually utilize the single (very large) character set of Unicode itself.

Unicode is managed by the Unicode Consortium (see Resources), a non-profit group with corporate, institutional, and individual members. Originally, Unicode was planned as a 16-bit specification. However, this original plan failed to leave enough room for national variations on related (but distinct) ideographs across East Asian Languages (Chinese, Japanese, and Korean), nor for specialized alphabets used in mathematics and the scholarship of historical languages. As a result, the code space of Unicode is currently 32-bits (and anticipated to remain fairly sparsely populated, given the 4 billion allowed characters).

Encodings

A full 32-bits of encoding space leaves plenty of room for every character we might want to represent, but it has its own problems. If we need to use 4 bytes for every character we want to encode, that makes for rather verbose files (or strings, or streams). Furthermore, these verbose files are likely to cause a variety of problems for legacy tools. As a solution to this, Unicode is itself often encoded using "Unicode Transformation Formats" (appreviated as UTF-*). The encodings UTF-8 and UTF-16 use rather clever techniques to encode characters in a variable number of bytes, but with the most common situation being the use of just the number of bits indicated in the encoding name. In addition, the use of specific byte value ranges in multi-byte characters is designed in such a way as to be friendly to existing tools. UTF-32 is also an available encoding, one that simply uses all four bytes in a fixed-width encoding.

The design of UTF-8 is such that US-ASCII characters are simply encoded as themselves. For example, the English letter "e" is encoded as the single byte 0x65 in both ASCII and in UTF-8. However, the non-English "e-umlaut" diacritic, which is Unicode character 0x00EB is encoded with the two bytes 0xC3 0xAB. In contrast, the UTF-16 representation of every character is always at least 2 bytes (and sometimes 4 bytes). UTF-16 has the rather straightforward representations of the letters "e" and "e-umlaut" as 0x65 0x00 and 0xEB 0x00, respectively. So where does the odd value for the e-umlaut in UTF-8 come from. Here is the trick: no multi-byte encoded UTF-8 character is allowed to be in the 7-bit range used by ASCII, to avoid confusion. So the UTF-8 scheme uses some bit shifting, and encodes every Unicode character using up to 6 bytes. But the byte values allowed in each position are arranged in such a manner as not to allow confusion of byte positions (for example, if you read a file non-sequentially).

Let's look at another example, just to see it layed out. Here is a simple text string encoded in several ways. The view presented in the graphic is what you would see in a hex-mode file viewer. This way, it is easy to see both a likely onscreen character representation (on a legacy, non-Unicode terminal) and a representation of the underlying hexadecimal values each byte contains:

Hex view of several character string encodings

Declarations

We have seen how Unicode characters are actually encoded, at least briefly, but how do applications know to use a particular decoding procedure when Unicode is encountered? How applications are alerted to a Unicode encoding depends upon the type of data stream in question.

Normal text files do not have any special header information attached to them to explicitly specify type. However, some operating systems (like MacOS, OS/2 and BeOS--Windows and Linux only in a more limited sense) have mechanisms to attach extended attributes to files; increasingly, MIME header information is stored in such extended attributes. If this happens to be the case, it is possible to store MIME header information such as:

Content-Type: text/plain; charset=UTF-8

Nonetheless, having MIME headers attached to files is not a safe generic assumption. Fortunately, the actual byte sequences in Unicode files provides a tip to applications. A Unicode-aware application, absent contrary indication, is supposed to assume that a given file is encoded with UTF-8. A non-Unicode-aware application reader the same file will find a file that contains a mixture of ASCII characters and high-bit characters (for multi-byte UTF-8 encodings). All the ASCII-range bytes will have the same values as if they were ASCII encoded. If any multi-byte UTF-8 sequences were used, those will appear as non-ASCII bytes, and should be treated as non-character data by the legacy application. This may result in non-processing of those extended characters, but that is pretty much the best we could expect from a legacy application (that, by definition, does not know how to deal with the extended characters).

For UTF-16 encoded files, a special convention is followed for the first two bytes of the file. One of the sequences 0xFF 0xFE or 0xFE 0xFF act as small headers to the file. The choice of which header specifies the endianness of a platform's bytes (most common platforms are little-endian, and will use 0xFF 0xFE). It was decided that the collision risk of a legacy file beginning with these bytes was small, and therefore these could be used as a reliable indicator for UTF-16 encoding. Within a UTF-16 encoded text file, plain ASCII characters will appear every other byte, interspersed with 0x00 (null) bytes. Of course, extended characters will produce non-null bytes, and in some cases double-word (4 byte) representations. But a legacy tool that ignores embedded nulls will wind up doing the right thing with UTF-16 encoded files, even without knowng about Unicode.

Many communications protocols--and more recent document specifications--allow for explicit encoding specification. For example, a HTTPd application (a web server) can return a header such as the following to provide explicit instructions to a client:

HTTP/1.1 200 OK Content-Type: text/html; charset:UTF-8;

Similarly, an NNTP, SMTP/POP3 message can carry a similar Content-Type: header field that makes explict the encoding to follow (most likely as text/plain rather than text/html, however; or at least we can hope).

HTML and XML documents can contain tags and declarations to make Unicode encoding expicit. An HTML document can provide a hint in a META tag, like:

However, a META tag should properly take lower precedence than an HTTP header, in a situation where both are part of the communication (but for, e.g., a local HTML file such an HTTP header does not exist).

In XML, the actual document declaration should indicate the Unicode encoding, as in:

<?xml version="1.0" encoding="UTF-8"?>

Other formats and protocols may provide explicit encoding specification by similar means.

Resources

More-or-less definitive information on all matters Unicode can be found at:

http://www.unicode.org/

The Unicode Consortium:

http://www.unicode.org/unicode/consortium/consort.html

ISO 8859 Alphabet Soup:

http://czyborra.com/charsets/iso8859.html

Unicode Technical Report #17--Character Encoding Model:

http://www.unicode.org/unicode/reports/tr17/

Yudit Unicode Text Editor:

http://yudit.org

A number of TrueType Unicode fonts can be found at:

http://www.ccss.de/slovo/unifonts.htm

A brief history of ASCII:

http://www.bobbemer.com/ASCII.HTM

About The Author

Picture of Author David Mertz knows a little bit about a lot of things, but a lot about fewer things than he once did. The smooth overcomes the striated. David can be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.