 | Level: Intermediate Ken Lunde (lunde@adobe.com), Manager, CJKV Type Development, Adobe Systems
01 Sep 2001 This article summarizes the latest developments in Unicode, and provides an overview of its encodings, specifically UTF-8, UTF-16, and UTF-32. In addition, this article demonstrates how these three encodings can interoperate through the use of simple algorithms. Working Perl functions are provided as an example of Unicode Transformation Format (UTF) interoperability, and three Unicode-enabling libraries that offer full UTF interoperability are introduced. Now that there are characters in Unicode beyond the Basic Multilingual Plane (BMP), it is critical that operating systems and applications support the full range of 1,112,064 valid Unicode code points, and interoperate between the UTFs.
The Unicode Standard, while considered a single (yet huge) character set, can be represented by three encodings: UTF-8, UTF-16, and UTF-32. Unicode Version 3.1, the latest version, is equivalent to two ISO standards: ISO 10646-1:2000 (Part 1: Architecture and Basic Multilingual Plane) and ISO 10646-2:2000 (Part 2: Supplementary Planes). The lock-step relationship between Unicode and ISO 10646 is important, and is expected to continue. This ensures that any Unicode-based software is also compliant with an accepted international standard. Unicode as a character set
As a character set, Unicode is composed of 17 planes of up to 65,536 code points each. A plane is a grouping of characters within in a 256 x 256 matrix, and each plane thus contains up to 65,536 characters. A plane can also be thought of as 65,536 contiguous code points. The first plane is special, is referred to as Plane 00 or the Basic Multilingual Plane (BMP), and has only 63,488 available code points. The remaining 16 planes are referred to as Supplementary Planes, and have 65,536 code points each. The missing 2,048 code points in the BMP (65,536 minus 63,488) are called surrogates -- specifically, 1,024 high surrogates followed by 1,024 low surrogates. They are used together to gain access to the 1,048,576 code points in the 16 Supplementary Planes. The 2,048 surrogates are used only for UTF-16 encoding. Thus, there are a total of 1,112,064 available code points in Unicode. The latest version of Unicode is Version 3.1, and has a staggering 94,140 characters assigned to the BMP and three of the Supplementary Planes, as shown in the following table: |
Plane
|
Plane name
|
Characters
| |
0 (0x00)
|
Basic Multilingual Plane (BMP)
|
49,196
| |
1 (0x01)
|
Supplementary Multilingual Plane for scripts and symbols (SMP)
|
1,594
| |
2 (0x02)
|
Supplementary Ideographic Plane (SIP)
|
43,253
| |
14 (0x0E)
|
Supplementary Special-purpose Plane (SPP)
|
97
|
Unicode Version 3.0 defined 49,194 characters, all of which are in the BMP. Unicode Version 3.1 added two characters to the BMP, and the remaining 44,944 characters were assigned to three of the Supplementary Planes. The most significant aspect of Version 3.1 is that it is the first version of Unicode that assigns characters outside of the BMP. Previous versions of Unicode supported encodings that supported characters outside of the BMP, but Version 3.1 was the first to actually assign characters outside of the BMP. This has major implications for software developers.
Unicode as an encoding
The latest version of Unicode supports three encodings, UTF-8, UTF-16, and UTF-32. The numbers used in these names -- 8, 16, and 32 -- represent the basic unit in terms of number of bits. For example, UTF-8 is made up of eight-bit units (each of
which equals one byte). UTF-16 is made up of 16-bit units, and UTF-32 uses 32-bit units. These three encodings have one aspect in common. The 1,048,576 code points of the 16 Supplementary Planes are represented by 4 bytes or 32 bits. UTF-8 uses four bytes,
UTF-16 uses two 16-bit units (high plus low surrogate), and UTF-32 uses a single 32-bit unit. UTF-8 encoding
UTF-8 encoding is variable-length, and characters are encoded with one, two, three, or four bytes. The first 128 characters of Unicode (BMP), U+0000 through U+007F, are encoded with a single byte, and are equivalent to ASCII. U+0080 through U+07FF (BMP) are encoded with two bytes, and U+0800 through U+FFFF (still BMP) are encoded with three bytes. The 1,048,576 characters of the 16 Supplementary Planes are encoded with four bytes. UTF-16 encoding
UTF-16 encoding is variable-length 16-bit representation. Each character is made up of one or two 16-bit units. In terms of bytes, each character is made up of two or four bytes. The single 16-bit portion of this encoding is used to encode the entire BMP, except for 2,048 code points known as "surrogates" that are used in pairs to encode the 1,048,576 characters of the 16 Supplementary Planes. U+D800 through U+DBFF are the 1,024 high surrogates, and U+DC00 through U+DFFF are the 1,024 low surrogates. A high plus low surrogate (that is, two 16-bit units) represent a single character in the 16 Supplementary Planes. UTF-32 encoding
UTF-32 encoding is a fixed 32-bit (four-byte) representation. Those who are familiar with UCS-4 encoding should note that UTF-32 encoding is simply a subset of
UCS-4 encoding that specifically covers only the 17 planes of Unicode. In other
words, UTF-32's encoding range is 0x00000000 through 0x0010FFFF. Beware of UTF-16 and UTF-32 byte order
UTF-8 encoding is made up of bytes. Each character is represented by one, two, three, or four bytes. UTF-16 and UTF-32 encodings are made up of 16- and 32-bit units, respectively. This means that byte order is significant. Luckily, developers are encouraged to use the Byte Order Mark (BOM) as the first character in UTF-16 or UTF-32 test data. This tells the interpreting software what byte order to use. The two byte orders are called little- and big-endian. Intel processors, which typically power computers running Windows, use little-endian byte order. Most computers running Mac OS and most flavors of Unix use big-endian byte order. The BOM is represented in UTF-16 encoding as 0xFEFF for big-endian byte order and 0xFFFE for little-endian. They are 0x0000FEFF and 0xFFFE0000 in UTF-32 encoding. As an example, consider the two bytes 0x4E and 0x00. As a 16-bit unit, they become 0x4E00 or 0x004E, depending on byte order. 0x4E00 (big-endian) is the Chinese character meaning "one," as is 0x004E (little-endian). 0x004E (big-endian) is the Latin character "N," as is 0x4E00 (little-endian). As you can see, if the byte order is not interpreted correctly, disaster can result.
Interoperability between Unicode encodings
Interoperating between the three Unicode encodings is purely an algorithmic problem. I have found that four basic code conversion algorithms can suffice, but bear in mind that software must also handle byte order correctly, and must also recognize and properly handle the BOM. The following table shows how the 16 Supplementary Planes correspond to UTF-32 and UTF-16 encodings, as an example of how these encodings relate to one another: |
Plane
|
UTF-32 Encoding
|
UTF-16 Encoding
| |
1
|
0x00010000-0x0001FFFF
|
0xD800DC00-0xD83FDFFF
| |
2
|
0x00020000-0x0002FFFF
|
0xD840DC00-0xD87FDFFF
| |
3
|
0x00030000-0x0003FFFF
|
0xD880DC00-0xD8BFDFFF
| |
4
|
0x00040000-0x0004FFFF
|
0xD8C0DC00-0xD8FFDFFF
| |
5
|
0x00050000-0x0005FFFF
|
0xD900DC00-0xD93FDFFF
| |
6
|
0x00060000-0x0006FFFF
|
0xD940DC00-0xD97FDFFF
| |
7
|
0x00070000-0x0007FFFF
|
0xD980DC00-0xD9BFDFFF
| |
8
|
0x00080000-0x0008FFFF
|
0xD9C0DC00-0xD9FFDFFF
| |
9
|
0x00090000-0x0009FFFF
|
0xDA00DC00-0xDA3FDFFF
| |
10
|
0x000A0000-0x000AFFFF
|
0xDA40DC00-0xDA7FDFFF
| |
11
|
0x000B0000-0x000BFFFF
|
0xDA80DC00-0xDABFDFFF
| |
12
|
0x000C0000-0x000CFFFF
|
0xDAC0DC00-0xDAFFDFFF
| |
13
|
0x000D0000-0x000DFFFF
|
0xDB00DC00-0xDB3FDFFF
| |
14
|
0x000E0000-0x000EFFFF
|
0xDB40DC00-0xDB7FDFFF
| |
15
|
0x000F0000-0x000FFFFF
|
0xDB80DC00-0xDBBFDFFF
| |
16
|
0x00100000-0x0010FFFF
|
0xDBC0DC00-0xDBFFDFFF
|
I am including some simple Perl functions that illustrate how one can convert between the three UTFs. (Note: these functions are not very efficient; there are commercial libraries, described later in this article, that offer improved efficiency.) The Perl function in Listing 1 converts a single UTF-16 character into UTF-32 encoding, and assumes big-endian byte order:
sub UTF16toUTF32 ($) {
my ($bytes) = @_;
if ($bytes =~ /^([\x00-\xD7\xE0-\xFF][\x00-\xFF])$/) {
pack("N",unpack("n",$bytes));
} elsif ($bytes =~ /^([\xD8-\xDB][\x00-\xFF])([\xDC-\xDF][\x00-\xFF])$/) {
pack("N",((unpack("n",$1) - 55296) * 1024) + (unpack("n",$2) - 56320) +
65536);
} else {
die "Whoah! Bad UTF-16 data!\n";
}
}
|
Listing 2 converts a single UTF-8 character into UTF-32 encoding, and again assumes big-endian byte order:
sub UTF8toUTF32 ($) {
my ($bytes) = @_;
if ($bytes =~ /^([\x00-\x7F])$/) {
pack("N",ord($1));
} elsif ($bytes =~ /^([\xC0-\xDF])([\x80-\xBF])$/) {
pack("N",((ord($1) & 31) << 6) | (ord($2) & 63));
} elsif ($bytes =~ /^([\xE0-\xEF])([\x80-\xBF])([\x80-\xBF])/) {
pack("N",((ord($1) & 15) << 12) | ((ord($2) & 63) << 6) | (ord($3) & 63));
} elsif ($bytes =~ /^([\xF0-\xF7])([\x80-\xBF])([\x80-\xBF])([\x80-\xBF])/) {
pack("N",((ord($1) & 7) >> 18) | ((ord($2) & 63) << 12) | ((ord($3) &
63) << 6) | (ord($4) & 63));
} else {
die "Whoah! Bad UTF-8 data! Perhaps outside of Unicode (5- or 6-byte).\n";
}
} |
Listing 3 converts a single UTF-32 character into UTF-8 encoding, and again assumes big-endian byte order:
sub UTF32toUTF8 ($) {
my ($ch) = unpack("N",$_[0]);
if ($ch <= 127) {
chr($ch);
} elsif ($ch <= 2047) {
pack("C*", 192 | ($ch >> 6), 128 | ($ch & 63));
} elsif ($ch <= 65535) {
pack("C*", 224 | ($ch >> 12), 128 | (($ch >> 6) & 63),
128 | ($ch & 63));
} elsif ($ch <= 1114111) {
pack("C*", 240 | ($ch>> 18), 128 | (($ch >> 12) & 63),
128 | (($ch>> 6) & 63), 128 | ($ch & 63));
} else {
die "Whoah! Bad UTF-32 data! Perhaps outside of Unicode (UCS-4).";
}
}
|
Finally, Listing 4 converts a single UTF-32 character into UTF-16 encoding, and once again assumes big-endian byte order:
sub UTF32toUTF16 ($) {
my ($ch) = unpack("N",$_[0]);
if ($ch <= 65535) {
pack("n", $ch);
} elsif ($ch <= 1114111) {
pack("n*", ((($ch - 65536) / 1024) + 55296),(($ch % 1024) + 56320));
} else {
die "Whoah! Bad UTF-32 data! Perhaps outside of Unicode (UCS-4).";
}
}
|
Keep in mind that these Perl functions have been written to handle only big-endian byte order because in my development environment, I do not need to handle little-endian data. The end result of my work are files that deal with PostScript, which uses big-endian byte order. Beware of binary ordering
Database developers need to be aware of different binary orderings when representing data in Unicode encodings. UTF-8 and UTF-32 encodings share the same binary ordering. That is, if you order character codes according to their byte values, they are ordered the same. UTF-16 encoding has a different binary ordering, due to the 2,048 high and low surrogates that it uses to represent the 1,048,576 code points in the 16 Supplementary Planes. Implementations of UTF interoperability
There are at least three Unicode-enabling libraries that provide full
UTF interoperability. That is, given the 1,112,064 valid Unicode code points,
they are able to convert between the three UTFs through their APIs. These implementations are IBM's International Components for Unicode (ICU), X.Net's
xIUA (Internationalization & Unicode Adaptor) which interfaces with IBM's ICU,
and Basis Technology's Rosette. ICU is available for Java and
C/C++, xIUA is available for C/C++, and Rosette is available for C/C++ (see Resources). I encourage you to explore all three of them to determine which one best fits your development needs.
Some practical examples
For the past several years I have been maintaining the UCS-2 (that is, UTF-16 encoding without surrogates, and thus no access to the Supplementary Planes) and UTF-8 CMap files for Adobe Systems' CJKV character collections for CID-keyed fonts. A CMap file is analogous to the "cmap" tables in TrueType and OpenType fonts, and serve to map encodings to CIDs (Character Identifiers) which are simple integers that serve to identify a glyph in CIDFonts. Due to the algorithmic relationship between UCS-2 and UTF-8, I maintained only the UCS-2 CMap files, then used a tool to derive the UTF-8 CMap files from the UCS-2 ones in a semi-automatic fashion. This kept the UTF-8 CMap files in sync with the original UCS-2 ones. I used a simple Perl script for this purpose. It supported conversion from the 16-bit UCS-2 representation to the one-, two-, and three-byte representation in UTF-8 encoding. I recently started to develop a new suite of Unicode CMap files that support the Supplementary Planes in the three Unicode encodings, UTF-8, UTF-16, and UTF-32. I enhanced my Perl tools to be able to interoperate between these three Unicode encodings, and handle all 1,112,064 valid code points correctly. I first wrote a tool that can convert between the three Unicode encodings, and found that I only needed the following code conversion algorithms: UTF-8 to UTF-32, UTF-32 to UTF-8, UTF-16 to UTF-32, and UTF-32 to UTF-16. Conversion between UTF-8 and UTF-16 can be handled by using UTF-32 as an intermediate representation, although direct code conversion algorithms could have been just as easily implemented. My concern was not with speed, but with accuracy, so this solution worked out perfectly for my needs. Others' needs may differ. Next, I decided to use UTF-32 as the beginning representation for the CMap files, then derive the UTF-8 and UTF-16 CMap files from them. Through the use of Perl (again), this process has now been fully automated. I maintain only the UTF-32 CMap files, and the equivalent UTF-8 and UTF-16 ones are automatically derived through the use of a single tool. This reduces the amount of time used for CMap file development, and also significantly reduces the possibility of discrepancies between the Unicode CMap files.
Summary
This article has briefly described Unicode as a character set, has shown that there are three representations, and that it is trivial to interoperate between them. Armed with this information, developers can more easily extend their applications to handle the Supplementary Planes, which as of Unicode Version 3.1 has characters assigned. Up until only recently, developers were able to avoid the Supplementary Planes, and thus the four-byte representations. Clearly, this has changed.
Resources - The Unicode Consortium's Web site provides the most
up-to-date information about the Unicode Standard, as well as its own sample UTF code conversion routines, which are written in C.
- More detailed information about Unicode (up through Version 2.1) and its encodings can also be found in Chapters 3 and 4 of my book, CJKV Information Processing (O'Reilly, 1999) . Specifically, on pp 120-130 (Chapter 3) and 186-196 (Chapter 4). Note that UTF-32 encoding didn't exist when my book was published, but because it is a subset of UCS-4 encoding (that is, 0x00000000 through 0x0010FFFF), which is described in Chapter 4 of my book, UCS-4 descriptions can be used for UTF-32.
- More information on X.Net's xIUA can be found at the xIUA Home Page.
- Find more information about Basis Technology's Rosette Unicode-enabling library.
- IBM's ICU Unicode-enabling library can be found at the ICU Home Page.
About the author  | |  | Ken Lunde has been working for San Jose-based Adobe Systems Incorporated for over 10 years, and currently manages CJKV Type Development. He earned a Ph.D. in linguistics from The University of Wisconsin at Madison in 1994, and authored Understanding Japanese Information Processing (O'Reilly, 1993) and CJKV Information Processing (O'Reilly, 1999). He can be reached at lunde@adobe.com. |
Rate this page
|  |