Technical Tips
Understanding Unicode
David Mertz, Ph.D.
February, 2001
Unicode provides a unified means of representing character sets across languages, and makes multi-language and multi-national computing much more transparent. This article provides some background on the development of Unicode, and some beginning hints on working with Unicode.
Some Background On Characters
Before we see what Unicode is, it makes sense to step back slightly to think about just what it means to store "characters" in digital files. Anyone who uses a tool like a text editor usually just thinks of what they are doing as entering some characters--numbers, letters, punctuation, etc. But behind the scene a little bit more is going on. "Characters" that are stored on digital media must be stored as sequences of ones and zeros, and some encoding and decoding must happen to make these ones and zeros into characters we see on a screen or type in with a keyboard.
Sometime around the 1960s, a few decisions were made about just what ones and zeros (bits) would represent characters. One important choice that most modern computer users give no thought to was the decision to use 8-bit bytes on nearly all computer platforms. In other words, bytes have 256 possible values. Within these 8-bit bytes, a consensus was reached to represent one character in each byte. So at that point, computers needed a particular encoding of characters into byte values; there were 256 "slots" available, but just which character would go in each slot? The most popular encoding developed was Bob Bemers' American Standard Code for Information Interchange (ASCII), which is now specified in exciting standards like ISO-14962-1997 and ANSI-X3.4-1986(R1997). But other options, like IBM's mainframe EBCDIC linger on, even now.
ASCII itself is of somewhat limited extent. Only the values of the lower-order 7-bits of each byte might contain ASCII encoded characters. The top 7-bits worth of positions (128 of them) are "reserved" for other uses (back to this). So, for example, a byte that contains "01000001" might be an ASCII encoding of the letter "A", but a byte containing "11000001" cannot be an ASCII encoding of anything. Of course, a given byte may or may not actually represent a character; if it is part of a text file, it probably does, but if it is part of object code, a compressed archive, or other binary data, ASCII decoding is misleading. It depends on context.
The reserved top 7-bits in common 8-bit bytes have been used for a number of things in a character-encoding context. On traditional textual terminals (and printers, etc.) it has been common to allow switching between codepages on terminals to allow display of a variety of national language characters (and special characters like box-drawing borders), depending on the needs of a user. In the world of internet communications, something very similar to the codepage system exists with the various ISO-8859-* encodings. What all these systems do is assign a set of characters to the 128 slots that ASCII reserves for other uses. These might be accented roman characters (used in many Western European languages), they might be non-roman character sets like Greek, Cyrillic, Hebrew, or Arabic (or in the future, Thai and Hindi). By using the right codepage, 8-bit bytes can be made quite suitable for encoding reasonable sized (phonetic) alphabets.
Codepages and ISO-8859-* encodings, however, have some definite limitations. For one thing, a terminal can only display one codepage at a given time, and a document with an ISO-8859-* encoding can only contain one character set. Documents that need to contain text in multiple languages are not possible to represent by these encodings. A second issue is equally important: many ideographic and pictographic character sets have far more than 128 or 256 characters in them (the former is all we would have in the codepage system, the latter if we used the whole byte and discarded the ASCII part). It is simply not possible to encode languages like Chinese, Japanese, and Korean in 8-bit bytes. Systems like ISO-2022-JP-1 and codepage 943 allow larger character sets to be represented using two or more bytes for each character. But even when using these language-specific multi-byte encodings, the problem of mixing languages is just as present.
What Is Unicode?
Unicode solves the problems of previous character encoding
schemes by providing a unique code-number for every character
needed, worldwide and across languages. Over time, more
characters are being added, but the allocation of available
ranges for future uses has already been planned out, so the
room exists). In Unicode encoded documents, no ambiguity
exists about how a given character should display (for example,
should byte value 0x89
appear as e-umlaut, as in codepage 850,
or as the per-mil mark, as in codepage 1004?). Furthermore, by
giving each character its own code, there is no problem or
ambiguity in creating multi-lingual documents that utilize
multiple character sets at the same time. Or rather, these
documents actually utilize the single (very large) character
set of Unicode itself.
Unicode is managed by the Unicode Consortium (see Resources), a non-profit group with corporate, institutional, and individual members. Originally, Unicode was planned as a 16-bit specification. However, this original plan failed to leave enough room for national variations on related (but distinct) ideographs across East Asian Languages (Chinese, Japanese, and Korean), nor for specialized alphabets used in mathematics and the scholarship of historical languages. As a result, the code space of Unicode is currently 32-bits (and anticipated to remain fairly sparsely populated, given the 4 billion allowed characters).
Encodings
A full 32-bits of encoding space leaves plenty of room for
every character we might want to represent, but it has its own
problems. If we need to use 4 bytes for every character we
want to encode, that makes for rather verbose files (or
strings, or streams). Furthermore, these verbose files are
likely to cause a variety of problems for legacy tools. As a
solution to this, Unicode is itself often encoded using
"Unicode Transformation Formats" (appreviated as UTF-*
). The
encodings UTF-8
and UTF-16
use rather clever techniques to
encode characters in a variable number of bytes, but with the
most common situation being the use of just the number of bits
indicated in the encoding name. In addition, the use of
specific byte value ranges in multi-byte characters is designed
in such a way as to be friendly to existing tools. UTF-32
is
also an available encoding, one that simply uses all four bytes
in a fixed-width encoding.
The design of UTF-8
is such that US-ASCII
characters are
simply encoded as themselves. For example, the English letter
"e" is encoded as the single byte 0x65
in both ASCII and in
UTF-8
. However, the non-English "e-umlaut" diacritic, which
is Unicode character 0x00EB
is encoded with the two bytes
0xC3 0xAB
. In contrast, the UTF-16
representation of
every character is always at least 2 bytes (and sometimes 4
bytes). UTF-16
has the rather straightforward
representations of the letters "e" and "e-umlaut" as 0x65
0x00
and 0xEB 0x00
, respectively. So where does the odd
value for the e-umlaut in UTF-8
come from. Here is the
trick: no multi-byte encoded UTF-8
character is allowed to
be in the 7-bit range used by ASCII, to avoid confusion. So
the UTF-8
scheme uses some bit shifting, and encodes every
Unicode character using up to 6 bytes. But the byte values
allowed in each position are arranged in such a manner as not
to allow confusion of byte positions (for example, if you read
a file non-sequentially).
Let's look at another example, just to see it layed out. Here is a simple text string encoded in several ways. The view presented in the graphic is what you would see in a hex-mode file viewer. This way, it is easy to see both a likely onscreen character representation (on a legacy, non-Unicode terminal) and a representation of the underlying hexadecimal values each byte contains:
Declarations
We have seen how Unicode characters are actually encoded, at least briefly, but how do applications know to use a particular decoding procedure when Unicode is encountered? How applications are alerted to a Unicode encoding depends upon the type of data stream in question.
Normal text files do not have any special header information attached to them to explicitly specify type. However, some operating systems (like MacOS, OS/2 and BeOS--Windows and Linux only in a more limited sense) have mechanisms to attach extended attributes to files; increasingly, MIME header information is stored in such extended attributes. If this happens to be the case, it is possible to store MIME header information such as:
Content-Type: text/plain; charset=UTF-8 |
Nonetheless, having MIME headers attached to files is not a
safe generic assumption. Fortunately, the actual byte
sequences in Unicode files provides a tip to applications. A
Unicode-aware application, absent contrary indication, is
supposed to assume that a given file is encoded with UTF-8
.
A non-Unicode-aware application reader the same file will find
a file that contains a mixture of ASCII characters and high-bit
characters (for multi-byte UTF-8
encodings). All the
ASCII-range bytes will have the same values as if they were
ASCII encoded. If any multi-byte UTF-8
sequences were used,
those will appear as non-ASCII bytes, and should be treated as
non-character data by the legacy application. This may result
in non-processing of those extended characters, but that is
pretty much the best we could expect from a legacy application
(that, by definition, does not know how to deal with the
extended characters).
For UTF-16
encoded files, a special convention is followed
for the first two bytes of the file. One of the sequences
0xFF 0xFE
or 0xFE 0xFF
act as small headers to the file.
The choice of which header specifies the endianness
of a
platform's bytes (most common platforms are little-endian, and
will use 0xFF 0xFE
). It was decided that the collision risk
of a legacy file beginning with these bytes was small, and
therefore these could be used as a reliable indicator for
UTF-16
encoding. Within a UTF-16
encoded text file, plain
ASCII characters will appear every other byte, interspersed
with 0x00
(null) bytes. Of course, extended characters will
produce non-null bytes, and in some cases double-word (4 byte)
representations. But a legacy tool that ignores embedded nulls
will wind up doing the right thing with UTF-16
encoded files,
even without knowng about Unicode.
Many communications protocols--and more recent document specifications--allow for explicit encoding specification. For example, a HTTPd application (a web server) can return a header such as the following to provide explicit instructions to a client:
HTTP/1.1 200 OK Content-Type: text/html; charset:UTF-8; |
Similarly, an NNTP, SMTP/POP3 message can carry a similar
Content-Type:
header field that makes explict the encoding to
follow (most likely as text/plain
rather than text/html
,
however; or at least we can hope).
HTML and XML documents can contain tags and declarations to
make Unicode encoding expicit. An HTML document can provide a
hint in a META
tag, like:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> |
However, a META
tag should properly take lower precedence
than an HTTP header, in a situation where both are part of the
communication (but for, e.g., a local HTML file such an HTTP
header does not exist).
In XML, the actual document declaration should indicate the Unicode encoding, as in:
<?xml version="1.0" encoding="UTF-8"?> |
Other formats and protocols may provide explicit encoding specification by similar means.
Resources
More-or-less definitive information on all matters Unicode can be found at:
http://www.unicode.org/
The Unicode Consortium:
http://www.unicode.org/unicode/consortium/consort.html
ISO 8859 Alphabet Soup:
http://czyborra.com/charsets/iso8859.html
Unicode Technical Report #17--Character Encoding Model:
http://www.unicode.org/unicode/reports/tr17/
Yudit Unicode Text Editor:
http://yudit.org
A number of TrueType Unicode fonts can be found at:
http://www.ccss.de/slovo/unifonts.htm
A brief history of ASCII:
http://www.bobbemer.com/ASCII.HTM
About The Author
David Mertz knows a little bit about a lot of things, but a lot about fewer things than he once did. The smooth overcomes the striated. David can be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.