Level: Introductory Hernan Silberman, Writer, Freelance
03 Jan 2007 One key benefit of XML is the fact that it was designed for international use. But do you really understand the concepts of internationalization and localization? This article explains what they are, how they work, and why you want to use them.
Why internationalization?
Economists have reported for a long while now that the world is indeed flat. What this assertion attempts to highlight is the increasing amount of political, cultural, environmental and, of course, economic interconnectivity of the world's population. The vast world of the past has yielded to the global village of today.
Technology has played a lead role in this transformation, delivering easy and inexpensive options for individuals and organizations to communicate and collaborate across short and vast distances. Such collaboration can be complicated by the presence of different spoken languages, cultural norms, and legal realities.
It's easy to forget that the World Wide Web is worldwide. Most of the people in the world do not read or speak English and the most popular language in the world, Mandarin, doesn't even rely on the Roman character set that is familiar to speakers of English or European languages. As the world gets flatter and smaller, it becomes riskier to ignore the worldwide audience that exists for your software. It's necessary to understand how to build applications that you can easily adapt to work in multiple geographic locations and across different languages and cultures.
XML has become a trusted technology to represent and transmit information of all kinds. XML's designers were visionary in making XML flexible enough to support multiple languages and character encodings-features which make it especially suited for applications that work in multiple locales. Today, XML is the fundamental technology driving the internationalization of applications in the increasingly flat world.
Internationalization and localization
Automobile manufacturers are quite experienced at internationalizing and localizing their products. Depending on the locale, drivers will sit on the right or left side of the vehicle, will see their speed in kilometers or miles per hour, and will find a vehicle's owner's manual in one of a number of possible languages. In addition, safety and environmental laws can differ in each locale and automobile manufacturers have to adapt their cars to sell legally and to appeal to as many drivers as possible in each locale.
The process of modifying a product in this manner is often described as internationalization and localization and understanding them will help you apply the very same techniques to your applications using XML.
Internationalization
Internationalization is a design and engineering approach that anticipates the need to adapt a product for different regions. Internationalization, sometimes abbreviated to i18n ('i' followed by 18 other letters followed by 'n'), is a design-time activity where the process of adapting a product to different target languages and cultures is thought through and all aspects of the product that need to be adapted or localized are identified. i18n influences the design of a product and its production processes in ways which simplify the localization effort required to adapt it for the various regions where it will be sold.
The usual deliverable for an i18n effort is a set of instructions and validation tools that can be used to adapt the base product to a target population in a specific locale. Usually, this set of instructions, referred to as a localization kit, is used by translators to assist them in identifying the language portions of the product that need translation. Localization engineers would also use these instructions to configure the product so that it operates as required by the target locale. This, of course, assumes that the product was built in such a way that it's configurable for all target locales, a key goal of the internationalization process. Getting back to the real-world example, an i18n process for an automobile would surely provide tools allowing engineers to choose which side the driver should be on, what unit of measure the speedometer uses, and in which language the owner's manual is printed.
Experience has shown that internationalization saves lots of time and money by anticipating the localization efforts that your software will require and accommodating them during the design phase of a project. Waiting until a product is finished before considering how it will work in other locales is a common source of rollout delays. As you will see, adapting products for new locales is often a
complicated process involving many people and it should be considered
throughout the development of a software product.
Localization
Localization is the act of specializing a product for a specific population. Localization is often abbreviated to l10n ('l' followed by 10 other letters followed by 'n'). Localization involves the translating of natural language used in a product and its documentation, configuration of a product for a particular set of cultural norms, and any other configuration aimed at tailoring a product to a set of legal and environmental norms.
Marching instructions for localizing a product to a new region are often simply stated: "Make it marketable in Sweden" or "We need to sell this in Mexico". The localization process itself can be fairly straightforward if a corresponding i18n process was in place, or can be a complete nightmare if the software that needs to be localized has no support for localization.
As an example, imagine that you have to dub an American film for a Spanish audience. If you're given a single audio track that includes all of the English dialogue mixed with all of the other sounds heard in the film, you will have a really difficult time separating both to produce a Spanish version of the film. If the filmmakers anticipated a dubbed version, they might have delivered the voices on one audio track and everything else on another making your job much easier. Such anticipation is another example of an i18n effort and how a bit of planning can save lots of time and frustration when it comes to actually localizing a product.
Most software developers have had some experience with localization. At a minimum, you might have had to make your software work across different time zones or maybe you've externalized textual messages in Java properties files (or something similar). These are important techniques for localizing a product and their disciplined use as part of an overall internationalization and localization effort is the best way to add predictability to the otherwise daunting process of adapting your application for multiple locales.
XML
XML was built to support internationalization and localization. By using ISO-10646/Unicode, XML supports text in multiple languages, which might read from right-to-left or left-to-right, might employ their own rules for whitespace, line wrapping, and composite characters, and require additional locale-specific adaptations. In addition, XML also provides support for various types of text encoding, leaving it up to the content author to declare the encoding they're using in each document. While UTF-8 is the recommended encoding, others are possible which keeps XML from being tied to any particular encoding and provides a clear extension point for the future. With XML, users choose the encoding that makes the most sense for their application.
The design of XML gives you one important tool which makes internationalization possible: it allows you to create a markup vocabulary that reveals intent without committing to one specific rendering approach or locale. As an example, consider the XML snippet in Listing 1.
Listing 1. An example of intention-revealing markup
<description>The Battle of New Orleans was fought in January 1815, two weeks
<emphasize>after</emphasize> the peace treaty had been signed.</description>
|
In Listing 1 the emphasize element prescribes that the enclosed text should be emphasized in whatever way the program that reads and displays it decides is appropriate. This is a nice and general expression of your intent, which in some locales might result in the contained text being rendered in italics and in others might result in underlined text or quoted text. By allowing you to create document types of your own that express intent like this, you are able to produce documents that are easy to localize correctly and fully for each locale.
Common localization issues
Localizing an application includes translating all of the natural language elements that a user will potentially see. Also important is ensuring that dates, times and currency values are rendered in a way that makes sense for the target locale. Sometimes there are also images and layout elements to update as part of the normalization process.
Language identification
XML was designed to support documents written in many different languages. For any given document, it's important to be able to identify the language the author uses and this can be done by the lang attribute which is defined in the XML namespace. The valid values for xml:lang include two or three letter language codes followed by an optional two letter country code. For example, en, en-us, and en-uk are all valid language values (the rules for acceptable values are spelled out in RFC 4646).
In general, you should always use xml:lang in your XML documents to specify the language the document is written in. XML also supports documents which contain more than one language and you can use xml:lang to specify different languages for individual elements and their children as shown in Listing 2.
Listing 2. A sample document containing both English and Swedish language elements
<document xml:lang="en">
<paragraph>
<quote xml:lang="se">Tack så mycket</quote> means thanks
in Swedish.
</paragraph>
</document>
|
Notice that you declare English as the preferred language at the element root in Listing 2 and then override that declaration for a short bit of Swedish. This is the recommended way to use the xml:lang attribute instead of specifying it on each element that contains language.
Knowing what to translate
The most visible task involved in localizing XML documents is the translation of their natural language content from the source language to all of the target languages. There's no magic way to do this and it's typical to rely on professional translators to produce the translations for you. It's important that you have a way to clearly identify those portions of your XML documents that you want to translate and those which should not be translated.
The Internationalization Tag Set (ITS), soon to be a W3C Recommendation, provides a solution to this and other internationalization issues. ITS defines several tags which you can include in your schemas and documents to highlight your intent with respect to internationalization. For example, the ITS includes a translate tag which tells a translator or an automated tool used by a translator whether the element's text should be translated.
Listing 3. An example showing use of its:translate
<document xml:lang="en">
<paragraph>Some text to translate.</paragraph>
<paragraph its:translate="no">Some text not to translate</paragraph>
</document>
|
Listing 3 shows the use of the ITS translate tag. ITS comes with implementations in the three popular schema languages, XML DTD, XML Schema, and Relax NG. Schema designers can use the ITS tags in their own schemas, as can content designers in their XML documents.
Marking up your XML files with the ITS translate tag solves one of the biggest internationalization problems and ITS-aware tools already can extract the translatable text from your XML documents and output it using the Localization Interchange File Format (XLIFF), an intermediary format for use in other translation tools.
Facilitating translation with localization notes
A translator is often fairly disconnected from your project and domain and you might never meet them face-to-face. To get the best possible translation, it's important to provide the translator with additional information that might help them better understand the context their translation appears in. ITS provides support for such localization notes.
Listing 4. Example use of ITS localization notes
<document xml:lang="en">
<paragraph its:locNote="This is a description note to the
translator.">Some
text to translate.</paragraph>
<paragraph its:locNote="This is an alert note to the translator."
its:locNoteType="alert">Some text to translate.</paragraph>
</document>
|
Listing 4 shows the use of the localNote attribute from the ITS namespace. This attribute simply provides a hint to the translator that they will either read directly from your XML documents or, more likely, will be read by a translation tool and displayed to the translator as they work. ITS supports two different note types: description and alert. The default is the description type and alert is used to underscore the importance of a given note so that, for example, a translation tool might highlight it or otherwise alert the translator to make sure they see it.
Dealing with dates
Handle dates and numbers carefully when you deal with multiple locales. The best strategy is to choose a neutral representation for your data and format it as appropriate when you display it to a user in a specific locale. Fortunately, the representation of date and time information in strings suitable for inclusion in XML files has been solved in a standard way already so you don't need to invent anything new. ISO 8601 prescribes the standard that appears in Listing 5.
Listing 5. The ISO 8601 date format standard
Year:
YYYY (eg 1997)
Year and month:
YYYY-MM (eg 1997-07)
Complete date:
YYYY-MM-DD (eg 1997-07-16)
Complete date plus hours and minutes:
YYYY-MM-DDThh:mmTZD (eg 1997-07-16T19:20+01:00)
Complete date plus hours, minutes and seconds:
YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)
Complete date plus hours, minutes, seconds and a decimal fraction of a
second
YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00)
where:
YYYY = four-digit year
MM = two-digit month (01=January, etc.)
DD = two-digit day of month (01 through 31)
hh = two digits of hour (00 through 23) (am/pm NOT allowed)
mm = two digits of minute (00 through 59)
ss = two digits of second (00 through 59)
s = one or more digits representing a decimal fraction of a second
TZD = time zone designator (Z or +hh:mm or -hh:mm)
|
One approach you might take in your documents then is to represent your dates in ISO format which would allow you to represent dates and times with as much precision as you need and in a format that's standardized and easy to work with.
Numbers and currency values are dealt with using a similar approach-pick one currency and unit of measure that your application will use natively and localize only when you present information to your users. Depending on your application, localizing such values in your XML files might be unavoidable though generally you can produce XML files that use your application's native representations and localize them only when they're displayed in a user interface.
Unicode and internationalization
Unicode is an important part of the internationalization process because it allows you to treat the details of working with different character sets and languages as a problem external to the real workings of your applications. As the Unicode FAQ puts it, "With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time".
Of course, for Unicode to succeed in this, it has to provide solutions to difficult language problems like bidirectionality -- the fact that some languages read from left to right and others from right to left. Unicode does in fact support bidirectionality, and using the ITS dir tag you can declare the intended directionality of a piece of text. Listing 6 shows an example from the W3C of the ITS dir attribute that changes the nested element's directionality. It's important to note that you must use the dir tag even when you specify languages like Arabic or Hebrew, which you know read from right to left.
Listing 6. An example of the ITS dir attribute used to specify the direction of a given bit of content
<text xmlns:its="http://www.w3.org/2005/11/its" xml:lang="en"
its:version="1.0">
<body>
<par>In Arabic, the title
<quote xml:lang="ar"
its:dir="rtl">نشاط التدويل،W3C</quote>
means <quote>Internationalization Activity, W3C</quote>.
</par>
</body>
</text>
|
Summary
This article has presented the important issues of internationalization and localization. Internationalization is a design approach which anticipates the adaptation of a product to multiple different geographic regions and cultures. Localization is the act of specializing a product for a specific locale, a task that is much easier if it follows an internationalization effort. You saw that XML was designed to support international use and thrives because of its support for multiple character encodings and Unicode, and because of the xml:lang tag which can be used to identify the language used in a given document. Recent developments in XML internationalization include the development of the Internationalization Tag Set (ITS), which provides a standard set of tags for identifying the portions of a document that need to be translated and various additional tools which enable internationalization of XML documents.
As the world shrinks, it becomes more difficult to ignore the potential users of your software that speak different languages and live in different cultures. Making your software adapt to these different locales isn't terribly difficult, it just requires a little bit of discipline throughout the software development process.
Resources Learn
Discuss
About the author  | |  | Hernan Silberman is an enterprise software consultant specializing in distributed programming using Java technology. He received a BS in Computer Science and a BA in Communications from California Polytechnic State University in Pomona, CA. He works as an enterprise systems engineer at a large entertainment company in the San Francisco Bay area. |
Rate this page
|