Skip to main content

skip to main content

developerWorks  >  Java technology  >

Internationalization road hazards

Avoid subtle hindrances to software globalization

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Taylor Cowan (taylor_cowan@yahoo.com), Senior Developer, Travelocity

16 Aug 2005

Support in the Java™ language for multilingual and multicountry environments is strong, but it's not foolproof. If you're not careful, mistaken assumptions in three key areas can make their way into your code and cause it to be U.S.-centric. This article identifies these internationalization gotchas and gives you some techniques to help your applications become more usable across the globe.

Don't let the inherent locale support in the JDK fool you into letting your guard down. Even though the Java language is full of localization features, your applications can still become U.S.-centric. Many internationalization problems stem from invalid assumptions that developers make about free-text user input, currency display, and date/time parsing. This article will show you how these assumptions can trip you up, and then help you put your applications on the road to better usability worldwide.

Internationalization hazard #1

Never assume text-field input will always be in US-ASCII. Even if your application is strictly for one locale, your users should be allowed to enter a wider range of text. Consider the case of someone's legal name containing characters outside the application's default language.

Error: Your input of “España” may only contain letters

Java developers are familiar with resource bundling but often overlook reading input as an internationalization-sensitive aspect of applications. To be truly international, your applications should be able to accept input in various languages and character sets. Never assume text-field input will always be in US-ASCII.

Internationalization-savvy regular expressions

Since JDK 1.4, the Java language has provided long-overdue regular expression support. Regular expressions used for input validation have found their way into many common frameworks, such as Struts, and are now supported directly in java.lang.String. But the same power that makes pattern matching and input validation simple can also inhibit internationalization. Consider this common regular expression:

/[a-zA-Z0-9 ]*/

The expression is clearly an alphanumeric mask intended to prevent special characters. This in turn can protect your application from unexpected input. But this kind of strict matching can have unintended consequences. Not only will this regular expression prevent unwanted symbols, but it will also prevent many words that contain characters outside of the Latin alphabet. For example, it will reject many proper nouns in their native spellings, such as España (Spain) or München (Munich). Surprisingly, it will even reject the name of Washington D.C.'s planner, Pierre L'Enfant, because it doesn't allow the apostrophe. International applications need to have broad input masks, not narrow ones. Because of their ASCII limitation, traditional regular expressions tend to work against internationalization, as in the following example:

Unicode support in regular expressions

Unicode Technical Standard #18 defines a standard for Unicode regular expressions (see Resources). Support for Unicode is challenging for two reasons. First, it has a much larger character set than US-ASCII. Second, many of the supported languages have different characteristics from English. Since JDK 1.4, the Java language supports this specification at Level 1 or basic Unicode support.

if (inputString.matches("\\w*"))

The standard expression symbol \w (word character) is identical to the first example. In this case, word really means English words only. Support for international input requires going beyond the standard regexp match specifiers.

When I first encountered this problem I was happy to find that it was already well known and was being addressed. The two ways to specify broader matches are to use Posix character blocks and categories. You specify them as \p{block | category}. For instance, \p{L} matches any Unicode letter. In this case letter has a much broader sense and includes Latin characters as well as Japanese Katakana, Korean Hangul, and many more character sets. Table 1 shows some examples of Posix regular-expression categories.

Table 1. Posix regexp category examples

\p{Lu}Uppercase letter
\p{Ll}Lowercase letter
\p{P}Punctuation

Categories are good for general case matches, but if you need to be more specific you can make use of character blocks. They let you explicitly include or reject characters in certain regions of Unicode. Table 2 shows some examples of Posix character blocks.

Table 2. Posix character block examples

[\p{InKatakana}*]Match any Katakana character
[\p{InBasic Latin}\p{InLatin-1 Supplement}]Match basic and supplemental Latin characters

You must specify character blocks with the correct block names. Unfortunately, the JDK doesn't define any convenient constants, and the javadoc doesn't itemize a list of all the possibilities. The block names are taken from the Unicode standard and are listed in a file on the Unicode site (see Resources).

The best way to get started using Unicode regular expressions is to experiment with simple matches in different languages. The following sample code tests standard and Unicode regular expressions with text in several languages. If you want to run this example you must set the VM default encoding to UTF-8 (-Dfile.encoding=UTF-8).

public static void main(String[] args)
  {
    //category examples
    doMatch("ü", "\\p{Ll}"); // Lowercase Unicode letter
    doMatch("ü", "\\p{Lu}"); // uppercase Unicode letter
    
    //character block examples
    doMatch("한글", "\\p{InHangul Syllables}*"); // Korean
    doMatch("カタカナ", "\\p{InKatakana}*"); // Japanese
    
    // German spelling for Munich
    // only matches the last two expressions
    String s[] = {"Munich", "München"};
    for (int i=0 ; i<s.length ; i++) {
      doMatch(s[i], "[a-zA-Z0-9]*"); //explicit 
      doMatch(s[i], "\\w*"); // word character
      doMatch(s[i], "\\p{Alpha}*"); // alphabetic character
      doMatch(s[i], "[\\p{InBasic Latin}\\p{InLatin-1 Supplement}]*");
      doMatch(s[i], "\\p{L}*"); // Unicode letter
    }
  }
  
  public static void doMatch(String s, String regexp) {
    if (s.matches(regexp))
      System.out.println(s + " matches " + regexp);
    else
      System.out.println(s + " doesn't match " + regexp); 
  }		 



Back to top


Show me the money

Currency display seems trivial, yet it's often overlooked as an area to consider in globalization. Scale, decimal formatting, currency-symbol placement, and disambiguation are all factors in proper currency display.

Decimal scale

One mistaken currency assumption is that all amounts should be represented with two decimal places. $1.25 is roughly equal to 1,314.92 Korean won, but you'd never get that amount in exchange. The reason is simple. It's impossible to give someone 0.92 won because the smallest South Korean denomination is the won. Won (KRW) and Yen (JPY) are normally displayed without any decimal places. The JDK is helpful in this respect by way of the java.util.Currency class. To determine the conventional number of decimal places for a currency, use the getDefaultFractionDigits() method:

Currency c = Currency.getInstance("KRW");
int i = c.getDefaultFractionDigits();

Grouping separators

Another mistake is to use a period (.) as a decimal specifier and a comma (,) as a grouping symbol. Unlike the fraction digits, decimal formatting is relative to the person viewing the currency amount. In some countries a comma is used to specify decimal places, and spaces or commas can be used as a grouping separator, as the examples in Table 3 show.

Table 3. Currency separators

German1.234.567,25
French1 234 567,25

Using the NumberFormat class

The JDK provides for both decimal formatting rules and scale with the NumberFormat class. If used with care, NumberFormat can simplify currency handling (see Resources). It can also introduce new problems because it makes some extremely broad assumptions. One such assumption is made in this brief example:

Internationalization hazard #2

Java's NumberFormat class maps currencies to locales. This assumption is invalid for several reasons. First, a locale's official currency can change unexpectedly. Second, many times the "official" currency isn't the primary currency. Third, global applications cannot assume any single currency and instead must handle several currencies simultaneously. Avoid letting your Java code assign a default currency to your application.

DecimalFormat format = 
   (DecimalFormat)NumberFormat.getCurrencyInstance();
String amount = format.format(1.25);

What currency will the amount be formatted to? Without knowing something about the system the code is running on, it's impossible to predict what the amount variable contains. Behind the scenes NumberFormat is making the currency decision for you. It assumes a currency based on the locale. This seemingly convenient assumption is perilous because the relationship between the locale and a currency is weak at best. At any given time two currencies might be valid in a given locale. And software applications might need to work with more than one currency at a time. To remedy this problem you must apply a Currency instance to the format object:

DecimalFormat format = 
   (DecimalFormat)NumberFormat.getCurrencyInstance();
format.setCurrency(amountCurrency);
String amount = format.format(1.25);

By being explicit about the Currency type you'll avoid problems when the application is redeployed in a different locale or when the application needs to support more than one currency. This also protects your code from real-world currency changes, which can happen unexpectedly and invalidate the JDK's latest rule set for mapping locales to currencies.



Back to top


It's 5 o'clock somewhere

Phileas Fogg, the protagonist of Around the World in 80 Days, nearly lost his entire fortune on a mistaken assumption about time. Traveling eastward, he dutifully moved his watch forward to match local time. Upon his return to England he failed to account for this artificial aspect of local time and mistakenly believed that 80 full days had gone by. A similar hazard awaits any Java developer who isn't fully aware of the implications that time zone can have on an application.

Wall time and implicit time zones

Consider an application that notifies customers two hours before their rental cars are due back at the rental location. The logic is fairly simple: Keep a record of the drop-off time and notify the customer when it falls within the two-hour window. To compute the notification window you need two comparable dates -- the drop-off time and the current system time. Assume the user specified the desired drop-off time via drop-down or free-text entry. Either way the data must be parsed to give you a comparable instance of java.util.Date. Typically in the Java language a date is parsed using an instance of DateFormat:

Internationalization Hazard #3

Take care to note when date/time values are relative to a physical location. If they are, avoid letting DateFormat apply a default time zone. Instead be specific. This will prevent unexpected problems when your servers are relocated.

// sDropOff formatted as hh:mm
Date dropOff = dateFormat.parse(sDropOff);

The parse() method makes a hidden assumption. Unless specified explicitly, sDropOff is parsed with respect to the system time zone. The Java language needs the time zone because it stores Dates internally relative to Greenwich Mean Time (GMT). This means there are 24 different versions of 5:00 p.m., four of them in the continental United States alone. If the drop-off location and the system are located in different time zones, your calculations will be off. DateFormat allows for an explicit time zone:

// sDropOff formatted as hh:mm
dateFormat.setTimeZone(dropOffTimeZone);
Date dropOff = dateFormat.parse(sDropOff);

Time-zone support in the Java language

A time zone has two main attributes. The first is its offset, either positive or negative, from GMT. The second is its daylight saving time (DST) rule set. These rules indicate if the time zone participates in DST, and if so when DST starts and ends. The rules can be extensive, depending on how far back you need to go, and they vary from state to state and country to country. To demonstrate how tricky DST rules can be, try running this code snippet:

Calendar cal = Calendar.getInstance();
cal.setTimeZone(TimeZone.getTimeZone(
  "America/Chicago"));
cal.clear();
cal.set(Calendar.YEAR, 1985);
cal.set(Calendar.MONTH, Calendar.APRIL);
cal.set(Calendar.DATE, 15);
cal.set(Calendar.HOUR, 8);
System.out.println(
cal.getTime().toGMTString());

cal.set(Calendar.YEAR, 2005);
System.out.println(cal.getTime().toGMTString());

Notice that in 1985, 8:00 a.m. shows as 14:00 GMT, but in 2005 it's 13:00 GMT. In this case the discrepancy was caused by Public Law 99-359, passed by the U.S. Congress in 1986, which changed DST from the last Sunday in April to the first Sunday in April. Many examples of this type of DST rule exist, and there may be more to come. The good news is that the Java language has a comprehensive database of these rules, and you can take advantage of it provided you know the names of the time zones you're dealing with.

Time-zone names

The source for the Java language's time-zone rules is a public-domain time-zone database (see Resources). (HP-UX, Solaris, and Mac OS X also use this database.) Valid time zones are normally named after the continent and largest city in the zone. EST (Eastern Standard Time), CST (Central Standard Time), MST (Mountain Standard Time), and PST (Pacific Standard Time) aren't valid time-zone specifiers but are supported for JDK 1.0 backward compatibility. Table 4 shows some common time-zone specifiers.

Common time zone specifiers

United States/ChicagoSame as CST with DST rules
United States/New YorkSame as EST with DST rules
Asia/TokyoCovers Japan
Europe/BerlinCovers Germany



Back to top


Conclusion

Familiarity with currency, time, and text globalization will help you avoid problems, but the most powerful globalization tool available to you is testing. Many of the issues I’ve discussed here can be spotted quickly if you test your applications with them in mind. Be sure to test input from multiple character blocks. (A simple way to test East Asian character sets is to copy-and-paste from a Web site.) Test your applications with the server and client in different time zones by assigning either the server or client different operating-system-level date/time properties. Finally, test that currency amounts and other numbers can be configured to display with non-U.S. conventions. Familiarity with each and every set of international conventions and characters would be helpful, but it’s not necessary. You only need to provide for the possibility of configuration when the time comes, thereby saving time and money when you want to deploy an application for a new international market.



Resources



About the author

Taylor Cowan is a software engineer and occasional freelance author specializing in J2EE. He received a Masters Degree in Computer Science, as well as a Bachelor of Music in Jazz Arranging, from the University of North Texas.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top