 | Level: Intermediate Robert Bradley (robert.bradley@gmail.com), Consultant, Freelance
16 Jan 2007 Localizing an application can be planned, or it can happen as a rushed afterthought. Discover techniques and tools such as gettext, XML, XSLT, and design patterns that can help when retrofitting localization into a mature product or planning for localization up front.
Assessing an application
The requirements for localizing can be as vague as "get this application ready for Germany." But even when the requirements seem detailed, you may discover things the product manager didn't consider.
For example, take the bare-bones Yahoo! RSS news reader application shown in Listing 1. When the page is first invoked, a default list of headlines is displayed, and there's a form field in which you can type your choice of news category before resubmitting the page. The application validates the category you typed and either displays an error message if the category isn't valid or displays headlines for the category you requested.
Listing 1. RSS news from Yahoo!
<?php
require_once "XML/RSS.php";
# Normalize and validate user input
$newstype = strtolower(isset($_POST['newstype']) ? $_POST['newstype'] : "tech");
$error = null;
switch($newstype) {
case 'world':
case 'tech':
break;
default:
$error = 'Invalid news type.';
break;
}
?>
<html>
<head>
<title>RSS Feed from Yahoo News</title>
</head>
<body>
<h1>RSS Feed from Yahoo News</h1>
<p>
Welcome to the Yahoo RSS news reader. To view headlines for different kinds of news
enter the news type in the text box below and then submit the form. Valid news types
are 'tech' and 'world'.
</p>
<p>
<form method = "POST">
<?php
if (isset($error)) {
echo '<br/><font color = "red">' . $error . '</font><br/>';
}
?>
<input type = "text" name = "newstype" value = "<?php echo $newstype ?>" />
<br>
<input type = "submit">
</form>
</p>
<?php
$url = "http://rss.news.yahoo.com/rss/" . $newstype;
$rss = new XML_RSS($url);
$rss->parse();
echo "<ul>\n";
foreach ($rss->getItems() as $item) {
echo "<li><a href=\"" . $item['link'] . "\">" . $item['title'] . "</a></li>\n";
}
echo "</ul>\n";
?>
</body>
</html>
|
As simple as it is, this application illustrates many issues that arise during localization. It also reflects some of the pros and cons of any scripting language.
You can always get a product out the door quickly -- at a cost. The user interface (UI), business logic, and configuration, or lack thereof, are mixed together with no thought to maintaining or extending the page with new features.
In localizing this page for German, and probably other languages, several hurdles become apparent:
- You should refactor the code to separate the various application layers.
- UI text, such as headings, content, and error messages, must be extracted and translated.
- Form controls such as the Submit button must be localized.
- The application needs a way of defining the target locale.
- User input requirements and validation vary with the locale.
- Consider locale-dependent business rules, such as the URLs for Yahoo! RSS feeds; the targets and syntax vary from country to country.
- A framework to support configuration and control of application domain behavior in each locale becomes more urgent.
The effort involved in doing all this can easily meet or exceed the cost of the original application -- a difficult pill to swallow for all concerned. As a result, your team must be prepared for this hurdle, either before or after the application is finished.
Localization terminology
If you're going to be dealing with translators in your project, it's good to make sure you understand some of the terminology you're likely to encounter in your conversations with them. Table 1 defines a few of the common terms used by linguists and translation specialists.
Table 1. Localization terminology
| Term | Meaning |
|---|
| Locale | A combination of language and country or region -- for example, en_US, en_GB, de_DE | | Internationalization (I18N) | Planning, designing, and preparing applications to be localized | | Localization (L10N) | Identifying and translating (T9N) an application's UI for a target locale | | Externalization (string extraction) | The process of extracting program strings for translation |
Retrofitting localization
Listing 2 shows how the RSS application looks after refactoring and retrofitting it to support localization. The functional logic and localization framework are abstracted into a handler class. All that is left in the main line code is the construction of a page handler object, the HTML template, and calls to the handler's accessor methods to populate the dynamic portions of the HTML template.
Listing 2. RSS news localized
<?php
# news2.php
require_once "RSSHandler.php";
$handler = new RSSHandler;
?>
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<title><?php echo $handler->getTitle() ?></title>
</head>
<body>
<h1><?php echo $handler->getTitle() ?></h1>
<p>
<?php echo $handler->getGreeting() ?>
</p>
<?php $handler->showerror() ?>
<form method = "POST">
<input type = "text" name = "newstype"
value = "<?php echo $handler->getNewstype() ?>" />
<br/>
<input type = "submit"
value="<?php echo $handler->getSubmitbutton() ?>" />
</form>
<?php $handler->showHeadlines() ?>
</body>
</html>
|
A benefit of refactoring is that the code is probably easier for programmers to maintain. That said, it's arguable whether the UI is more maintainable now that it lacks linguistic context, and significant portions of the HTML remain embedded in the PHP code.
Will the PHP programmers trust a Web developer with these kinds of changes? What is the cost of having back-end developers deal with minor UI change requests? Later, I show how Extensible Stylesheet Language Transformation (XSLT) can help separate the application layers better.
The constructor
Let's start with the constructor for the handler class (see Listing 3) to see the how the page has been retrofitted for localization. When the page handler is instantiated, the constructor determines the appropriate locale, retrieves configuration settings for that locale, validates any user input, computes all formerly static content based on the current locale of the user, then validates any user input. (See Downloads for the source code.)
Listing 3. RSSHandler.php
public function __construct() {
# set up locale
putenv("LANG=de_DE"); # for the configuration
setlocale(LC_ALL, 'de_DE'); # for gettext
# Get locale-dependent configuration
$this->conf = new RSSConfig;
$this->configdecorator = new Configdecorator($this->conf);
# set up the text extraction context
bindtextdomain($this->conf->getTextdomainmessage(),
$this->conf->getTextdomainpath());
textdomain($this->conf->getTextdomainmessage());
# Extract the strings for translation.
$errormsg = gettext("News type is not valid.");
$this->title = gettext("Yahoo RSS News");
$this->submitbutton = gettext("Get the news.");
$lgreeting =
gettext("Welcome to the Yahoo RSS news reader. ") . " " .
gettext("Enter a news type, then submit the form.") . " " .
gettext ("Valid news types are: %s.");
$decoratedtypes = $this->configdecorator->getNewsTypes();
$this->greeting = sprintf($lgreeting, $decoratedtypes);
# Normalize and validate
$this->newstype = $this->normalize();
$this->error = $this->validate() ? '' : $errormsg;
}
|
In this example, the locale is hardcoded to de_DE for the sake of simplicity. In a real-life scenario, you would need to set the locale based on business requirements. Two common ways for determining the right locale are to inspect the request URL (for example, www.ebay.de, amazon.fr) or simply to ask the user and set a preference cookie on his browser.
Retrieving the configuration parameters
After the locale is determined, the handler retrieves configuration parameters for Germany. In this example, configuration involves using a PEAR package (Config.php), an XML configuration file (RSSConfig.xml), an application domain-specific wrapper (RSSConfig.php), and a simple decorator class to add HTML markup for some of the configuration parameters (Configdecorator.php).
The configuration file, shown in Listing 4, stores its information in XML format. It specifies the context for string extraction (//conf/textdomain) -- the location of the locale-dependent components and the name of the message file.
Note: If you download and try to run the samples, be sure to change the path to your extracted and translated strings to fit with your environment.
The configuration files also tell the application how to construct a correct Yahoo! newsfeed URL (//conf/de_DE/baseurl), how to validate the choices the user makes for newsfeed types (//conf/de_DE/newtypes), and which type of news should be the default if the user makes no choice.
Listing 4. Configuration.xml
<?xml version="1.0" encoding="UTF-8"?>
<conf>
<textdomain>
<path>/home/bbradley/Sites</path>
<messagefile>messages</messagefile>
</textdomain>
<en_US>
<baseurl>http://rss.news.yahoo.com/rss/</baseurl>
<newstypes>
<type>
<default>true</default>
<name>tech</name>
<url>tech</url>
</type>
<type>
<default/>
<name>world</name>
<url>world</url>
</type>
</newstypes>
</en_US>
<de_DE>
<baseurl>http://de.news.yahoo.com/</baseurl>
<newstypes>
<type>
<default>true</default>
<name>Ausland</name>
<url>politik/ausland.html.xml</url>
</type>
<type>
<name>Technik</name>
<url>technik/index.html.xml</url>
</type>
</newstypes>
</de_DE>
</conf>
|
You could make the configuration more robust by adding a "default" locale. Then, if the application tries to render an invalid locale, some reasonable application behavior could always be guaranteed. You could also extend the configuration to accommodate other application modalities, such as co-branding or application restrictions based on user privileges.
Most of the rest of the handler object involves deriving the textual portions of the page appropriately for each locale and providing public accessor methods that make these bits of translated text available to the HTML template.
The gettext extraction framework
The application uses an open source string extraction framework called gettext to factualize the translated strings. This framework has several virtues:
- It is free.
- It runs on a variety of platforms.
- It works with many programming languages, including PHP.
- It is particularly suited to retrofitting efforts.
To use gettext:
- Create a directory structure for all locales you must support.
- Mark up strings by moving them into arguments to the special function
gettext().
- Run the
xgettext command-line utility to extract the strings into a messages file.
- Copy the resulting messages file into each locale-specific directory.
- Edit the local versions of the messages file and translate the text.
- Run the
msgfmt command-line utility to create the localized message databases used at runtime.
It's important to realize that the gettext() function serves the obvious purpose of retrieving localized text at runtime. But equally important is the fact that it is also a marker for text extraction. The xgettext utility looks for strings that this function marks.
All the software components for the sample application, including the directory structure defined in the application configuration, appear in Listing 5. Every gettext implementation requires a directory structure similar to this. You don't need to create a directory tree or message database for the default locale; the strings for that locale are already embedded in the code. But you will need to edit the RSSConfig.xml file to correctly identify the full path to these files for the gettext framework.
Listing 5. Directory structure
ConfigDecorator.php -- Decorator class
de_DE -- Directory for Germany
LC_MESSAGES -- Messages sub-directory
messages.mo -- Compiled messages database
messages.po -- Translated messages
messages.po -- Generated by xgettext
news2.php -- Main application / HTML template
RSSConfig.php -- Application configuration class
RSSConfig.xml -- Application configuration file
RSSHandler.php -- HTTP request handler / business logic
|
Translating the message file
In the German message file, the parts the translator must supply are marked in bold. In other words, the translator must fill in the blanks. When the translator first opens the message file, those parts will be empty, and the translator must specify the character encoding he or she will use -- in this case, Latin-1 -- and the translations. The message ID (msgid) is the original text that was marked for extraction with a call to the gettext() function. The message string (msgstr) is the string as it must appear in the target language. Listing 6 shows the translation for the error message when the user specifies an incorrect news type.
Listing 6. The German message file, messages.po.de
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2006-11-05 11:05-0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=ISO-8859-1\n"
"Content-Transfer-Encoding: 8bit\n"
#: RSSHandler.php:31
msgid "News type is not valid."
msgstr "Die Nachrichtkategorie ist falsch."
|
If the text of any of the extracted strings ever changes, a new message ID will be created the next time a message file is generated. Every time strings are added or changed, the message file should be regenerated, or the new text will appear untranslated in the target locales.
In a production environment, the L10N team typically performs all these tasks (other than preparing the source code). But this doesn't mean that you can be totally insensitive to linguistic issues. Keep the following points in mind when localizing an application:
- Do not embed functional information in the strings.
- Use
printf tokens for dynamic content in a full sentence rather than breaking up the sentence into multiple strings. Bear in mind that the word order for sentences differs from language to language; you do not want to take away the translator's ability to control word order. If your string contains more than one dynamic element, msgfmt supports an extended markup syntax that allows you to specify the position of each dynamic element in the string. The order of the elements can easily change during translation. You can learn more about the positional syntax in any of the online manuals for gettext (see Resources).
- Ensure that the strings have enough context to be meaningful. Keep in mind that translators will only see the strings -- not the running application.
Note: See Resources for more advice on the linguistic aspects of localizing software.
When you add new text or reword existing text, the localization team can use other gettext utilities to merge the new or changed strings into the existing messages files, thereby removing the need for retranslation.
Designing for localization
Approaching localization as outlined above is likely good enough in many cases. But, as the scope increases, it makes sense to consider designing the application using design patterns. One of the most widely used patterns is Model-View-Controller (MVC). This pattern encourages you to treat an application in layers -- presentation, domain, data access. Another nice thing about the MVC pattern is that it makes it easier to extend the controller and view to handle other types of HTTP requests either as a SOAP service or as an XML API.
The RSS MVC pattern
Struts, Ruby on Rails, and the Zend Framework all embrace this layered design philosophy. If you find the MVC approach attractive, investigate the Zend Framework for PHP. But, to reduce the number of extra packages the sample application requires, I roll my own MVC pattern, shown in Listing 7.
The main line code is reduced even more in size now. It functions as a controller that delegates most of the work done to service the HTTP request. It constructs a domain object that imposes the same business rules already discussed, retrieves a model of the session data from the domain, then hands that model off to a view class to transform it into HTML, which it sends back to the client.
Listing 7. The RSS news controller, news3.php
<?php
# news3.php
require_once "View.php";
require_once "RSSDomain.php";
putenv("LANG=de_DE");
$domain = new RSSDomain;
$model = $domain->getModel();
$view = new View($domain->getTemplate(), $model);
echo $view->asHTML();
?>
|
The same application configuration scheme devised previously still controls the business rule behavior. What has changed is the way the new presentation layer handles the UI. This layer no longer uses PHP's HTML template support. Instead, it relies on XSL transformations. The domain model is a recursive data structure of nested objects that can be serialized as XML. The view causes an XSL template to transform the XML representation of the session data into localized HTML returned to the client browser.
The presentation layer
The XML model and XSL template forming the core of the presentation layer are shown in Listing 8. The model contains configuration information, such as the news types, especially how these choices can be displayed to the user (//rss/newstypes/displaytypes). The model also returns the user's news type choice (//rss/userinput/newstype) and the retrieved news headlines and URLs (//rss/headlinelist/headline). In case the user's choice of news type was invalid, the model would return an error code, rather than the headlines. The XSL template accesses portions of the returned data model to construct the HTML displayed on the client browser.
Listing 8. The sample XML model and XSL template (RSSView.xsl)
<rss>
<newstypes>
<displaytypes>Ausland, Technik</displaytypes>
<typelist>
<type>Technik</type>
<type>Ausland</type>
</typelist>
</newstypes>
<userinput>
<newstype>Ausland</newstype>
</userinput>
<headlinelist>
<headline>
<hlink>http://de.news.yahoo.com/01112006/12/guatemala.html</hlink>
<htitle>Guatemala und Venezuela halten Besprechungen</htitle>
</headline>
<headline>
<hlink>http://de.news.yahoo.com/01112006/12/mandat.html</hlink>
<htitle>Mandat einer Regierung in Afrika</htitle>
</headline>
</headlinelist>
</rss>
---------------------------------------------------------------
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:variable name="pagetitle">Yahoo Schlagzeilen</xsl:variable>
<xsl:variable name="submitbutton">Hol Schlagzeilen</xsl:variable>
<xsl:variable name="displaytypes">
<xsl:value-of select="/rss/newstypes/displaytypes"/>
</xsl:variable>
<xsl:variable name="newstype">
<xsl:value-of select="/rss/userinput/newstype"/>
</xsl:variable>
<xsl:template match="/">
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>
<xsl:value-of select="$pagetitle"/>
</title>
</head>
<body>
<h1><xsl:value-of select="$pagetitle"/></h1>
<p>
Willkommen! Um den Yahoo Nachrichtleser zu gebrauchen, gib einen von den
folgenden Nachrichttypen ein: \
<xsl:value-of select="/rss/newstypes/displaytypes"/>!
</p>
<!-- errors -->
<xsl:choose>
<xsl:when test="rss/userinput/error/code">
<p><b><font color="red">Falscher Typ</font></b></p>
</xsl:when>
</xsl:choose>
<p>
<form method = "POST">
<input type = "text" name = "newstype" value = "{$newstype}" />
<br/>
<input type = "submit" value="{$submitbutton}" />
</form>
</p>
<!-- headlines -->
<xsl:for-each select="/rss/headlinelist/headline">
<a><xsl:attribute name="href">
<xsl:value-of select="hlink"/>
</xsl:attribute><xsl:value-of select="htitle"/></a>
<br/>
</xsl:for-each>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
|
With this approach, the localized strings are not extracted by a tool like gettext. Rather, each locale has its own XSL template. Because the linguistic context is no longer abstracted, it should be easier to distribute the workload for maintaining the application. Web developers, the localization team, and even product managers can assume responsibility for implementing modest bug fixes and feature requests restricted to the UI.
The view class uses the XSL support that comes with PHP V5, the data model, and the XSL template to construct the HTML. PHP V4 does this in a different way. That approach is also shown as a comment for those cases where it is necessary to localize an older application. In most cases, XSLT is enabled by default, but you may need to enable XSLT processing in your PHP installation for this all to work. Listing 9 shows the XSL transformation.
Listing 9. The XSL transformation (View.php)
public function asHTML() {
$xml = new DomDocument;
$modelxml = $this->model->asXML();
#system('echo " ' . $modelxml . ' " > /tmp/l.txt');
$xml->loadXML($modelxml);
$xsl = new DomDocument;
$xsl->load($this->template);
$proc = new xsltprocessor;
$proc->importStyleSheet($xsl);
$html = $proc->transformToXML($xml);
# Following is XSL transformation syntax for PHP4
# $xh = xslt_create();
# $arguments = array ('/_xml' => \
$this->model->asXML(), '/_xsl' => $this->template);
# $html = xslt_process($xh, 'arg:/_xml', $this->template,
# NULL, $arguments, $this->parameters);
# xslt_free($xh);
return $html;
}
|
Where larger teams are involved, it might still make sense to extract the strings and generate the XSL templates from a master template. This is typically performed by a globalization management system (GMS). Then Web developers work only on the default XSL templates in the preferred language of the enterprise, and the translators use the facilities of the management system, and possibly computer-aided translation systems like SYSTRAN, to do their work.
After development ends and it's time for a release to production, a typical localization process might look like this:
- The GMS extracts all text from the core XSL templates and updates a translation database.
- The L10N team uses the translation database to translate new strings and retranslate changed strings.
- The GMS reconstructs localized XSL templates for each locale from the translation database.
While there is benefit to using a GMS, it may not make sense to do that with a small team or a mature application that is not changing much. As is true of all the tools mentioned, you should understand both the benefits and the costs to acquire, implement, and support what you decide to use.
Conclusion
A short article like this cannot cover every challenge you're likely to encounter. For example, I haven't discussed how to deal with character encodings, dates, currency, numbers in general, addresses, and telephone numbers. In most cases, you must properly format and display dates and currency for a locale. You must also understand how to validate and interpret currency amounts and dates when an international user types them on a form.
But the techniques presented will allow you to do 80 percent of what needs to be done to localize an application. If you're working on a pre-existing application and there are time and budget constraints, focus on a less-ambitious retrofitting approach. If you're developing a new application or have ample resources to localize an existing application, by all means, don't be afraid to be more aggressive if it is warranted. Not only will you end up with a nicely localized product but you will make the work of application maintainers much easier down the road regardless of the bug to be fixed or the feature to be added.
Download | Description | Name | Size | Download method |
|---|
| Sample scripts1 | os-php-intl.zip | 12KB | HTTP |
|---|
Note - The .tar file has complete source code minus the following prerequisite PEAR modules: XML_RSS and Config. You must also make sure that XSLT has been enabled on your PHP installation.
Resources Learn
Get products and technologies
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
Discuss
-
The developerWorks PHP Developer Forum provides a place for all PHP developer discussion topics. Post your questions about PHP scripts, functions, syntax, variables, PHP debugging and any other topic of relevance to PHP developers.
-
Get involved in the developerWorks community by participating in developerWorks blogs.
About the author  | |  | With more than 20 years of software engineering experience, Robert Bradley is semi-retired and spends his time writing and consulting. He used his rusty German for the sample translations. |
Rate this page
|  |