 | Level: Introductory Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.
07 Mar 2006 The Unicode consortium is dedicated to maintaining a character set that allows computers to deal with the vast array of human writing systems. When you think of computers that manage such a large and complex data set, you think databases, and this is precisely what the consortium provides for computer access to versions of the Unicode standard. The Unicode Character Database comprises files that present detailed information for each character and class of character. The strong tie between XML and Unicode means this database is very valuable to XML developers and authors. In this article Uche Ogbuji introduces the Unicode Character Database and shows how XML developers can put it to use.
From the earliest conception of Unicode, a character representation standard designed to cover the vast array of human writing systems, it was clear that it would be essential to provide a convenient way for computers to query information about characters. Part of each version of the Unicode standard has been a corresponding version of The Unicode Character Database (UCD). The Unicode Character Database (UCD) home page describes the Unicode Character Database as follows:
The Unicode Character Database (UCD) consists of a number of data files listing character properties and related data along with a documentation file that explains the organization of the database and the format and meaning of the data in the files.
Character properties are information needed to understand and use a character. Some character properties are independent of the context in which the character is used, for example, whether or not a character is customarily used as a numerical digit. Some character properties depend on its role in a sequence of characters, such as directionality (some writing systems proceed from left to right on a page, some from right to left, and some in further variations).
Navigating the Unicode database
The third edition of XML 1.0 incorporates by reference The Unicode Standard, Version 3.2, so this is the version of UCD XML developers will be most concerned with. The UCD includes a couple dozen data files and about a half dozen documentation files. In this article I'll focus on the main file, UnicodeData-3.2.0.txt. You can find a link to the 3.2 UCD directory in Resources. If you open this file, you'll find a line for each character. Each line is a set of fields delimited by semicolons. As an example the following is the line for the uppercase A character.
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
|
The first field is the code point, 0041. Obtain the conventional identifier for this character by prepending "U+" (In this case it is U+0041). You can represent the character in XML regardless of encoding using the entity A. Beware that code points can use up to six characters, although five characters is the limit you'll find in UnicodeData-3.2.0.txt. The second field LATIN CAPITAL LETTER A is the character name, which is very important in discussion of the character. The third field, Lu is the general category, which is the most important key in Unicode's system for organizing characters. The value Lu is an abbreviation for "Letter, Uppercase". Examples of other catagories are Nd ("Number, Decimal Digit"), Pd ("Punctuation, Dash"), Sc ("Symbol, Currency") and Zs ("Separator, Space"). There are several more fields in each UnicodeData-3.2.0.txt line, but the ones I've mentioned are the most widely used, and give you a flavor of the sort of information you can find in the UCD.
Finding characters for fun and profit
Since the UCD files are simple text you can use all sorts of generic tools to process them. You can often find a character of interest by loading UnicodeData-3.2.0.txt in a text editor and searching for a key word in the character name. You can also use command line tools such as grep in UNIX. In the following example, I look for the dagger characters commonly used to mark notes.
$ grep -i "dagger" UnicodeData-3.2.0.txt
2020;DAGGER;Po;0;ON;;;;;N;;;;;
2021;DOUBLE DAGGER;Po;0;ON;;;;;N;;;;;
|
The -i option makes the search case insensitive. You can also find fun characters by name. Some that I have come across are pencil, skull-and-crossbones and snowman.
$ grep -i skull UnicodeData-3.2.0.txt 2620;SKULL AND CROSSBONES;So;0;ON;;;;;N;;;;;
$ grep -i pencil UnicodeData-3.2.0.txt
270E;LOWER RIGHT PENCIL;So;0;ON;;;;;N;;;;;
270F;PENCIL;So;0;ON;;;;;N;;;;;
2710;UPPER RIGHT PENCIL;So;0;ON;;;;;N;;;;;
$ grep -i snowman UnicodeData-3.2.0.txt
2603;SNOWMAN;So;0;ON;;;;;N;;;;;
|
Help from the library
Your programming language of choice might provide convenient tools for accessing the Unicode database, so that you do not need to parse it yourself. The unicodedata module in the Python standard library provides a thin layer on the UnicodeData-3.2.0.txt file from Unicode 3.2. The following interactive Python session demonstrates a few queries.
>>> import unicodedata
>>> print unicodedata.name(u'5')
DIGIT FIVE
>>> print unicodedata.lookup('DIGIT FIVE') #The lookup is case-insensitive
5
>>> print unicodedata.digit(u'5')
5
>>> #There are many ways in the world for writing the digit five
>>> print unicodedata.digit(unicodedata.lookup('ARABIC-INDIC DIGIT FIVE'))
5
>>> #Get the code point in decimal and hex ...
>>> print ord(unicodedata.lookup('ARABIC-INDIC DIGIT FIVE'))
1637
>>> print hex(ord(unicodedata.lookup('ARABIC-INDIC DIGIT FIVE')))
0x665
>>>
|
In Python you indicate a unicode string by prepending a u, as you can see in the second line of the session.
Wrap up
Once you are familiar with the UCD, you can use it in all sorts of advanced ways: sorting values in XML files in internationally sound ways, normalizing data in XML files for easier comparison and digital signing, and much more. I use the database a lot because I'm not good at rememebering the code points for some uncommon characters I use in XML files, such as special bullet points and international symbols. With the UCD, the wealth of the world's writing systems is right at your fingertips.
Resources Learn
Get products and technologies
About the author  | 
|  | Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net. |
Rate this page
|  |