 | Level: Intermediate Laura Werner (lwerner@us.ibm.com.), Senior Software Engineer and Project Leader, IBM
01 Apr 1999 Text searching and sorting is one of the most well researched areas in computer science. It is covered in an introductory algorithms course in nearly every engineering school, and there are entire books devoted to the subject. Why, you might ask, am I writing yet another article about searching?
The answer is that most of the well-known, efficient search algorithms don't
work very well in Unicode, which includes the char type in Java. Algorithms such
as Knuth-Morris-Pratt and Boyer-Moore utilize tables that tell them what to do
when a particular character is seen in the text being searched. That's fine
for a traditional character set such as ASCII or ISO Latin-1 where there are
only 128 or 256 possible characters.
Java, however, uses Unicode as its character set. In Unicode, there are 65,535
distinct characters that cover all modern languages of the world, including
ideographic languages such as Chinese. In general, this is good; it makes the
task of developing global applications a great deal easier. However, algorithms
like Boyer-Moore that rely on an array indexed by character codes are very
wasteful of memory and take a long time to initialize in this environment.
And it gets worse. Sorting and searching non-English text presents a number of
challenges that many English speakers are not even aware of. The primary source
of difficulty is accents, which have very different meanings in different
languages, and sometimes even within the same language:
-
Many accented letters, such as "é" in "café", are treated as minor variants on the letter that is accented, in this case "e".
-
Sometimes the accented form of a letter is treated as a distinct letter for the
purposes of comparison. For example, "Å" in Danish is treated as a
separate letter that sorts near the end of the alphabet after "Z" and "Æ".
-
In some cases, an accented letter is treated as if it were two letters.
In traditional German, for example, "ä" is compared as if it were "ae".
Other difficulties arise when multiple characters compare as if they were one,
such as in traditional Spanish where "ch" is treated as a single
letter that sorts just after "c" and before "d", or when
single characters such as the "Æ" ligature or the German "ß"are treated as if they were spelled out as "AE" or "ss".
All of the above might make the problem of sorting and searching international
text seem hopeless. While not impossible, it is a difficult problem. Do
not despair, however. Starting in JDK 1.1, the class java.text.Collator is here
to help.
Collators Made Simple
For JDK 1.1, Taligent, which has since been absorbed into IBM, contributed a
number of international support classes to the Java Class Libraries. These
classes are now maintained and enhanced by engineers at the IBM Center for Java
Technology in Cupertino, California. In combination with work done by Sun, they
provide a powerful foundation for developing truly global applications.
Among the new classes introduced in JDK 1.1 were Collator and RuleBasedCollator ,
both in the java.text package. Collator provides an abstract interface for comparing text, while RuleBasedCollator is a concrete subclass that implements the comparison using a rule-driven algorithm.
Using these classes to compare two strings is quite straightforward:
Collator c = Collator.getInstance();
if (c.compare(string1, string2) == 0) {
// Strings are the same
} |
In a real program, of course, you would only call Collator.getInstance once and
would then use the same collator object to perform as many comparisons as you
wanted. Behind the scenes, getInstance determines the current default Locale,
loads the appropriate collation rules, and constructs a Collator object that it
returns. Currently, this object is always a RuleBasedCollator . However, in the
future different locales may require different classes with customized behavior,
which is why you use a factory method to create instances of the more abstract Collator
class.
Under the Hood
Internally,
RuleBasedCollator.compare
has
an awful lot of bookkeeping to do. A byte-by-byte string comparison function like C's
strcmp
can walk though strings one character
at a time and compare the bytes, but if Collator did something that simple it would
rapidly get out of sync the first time it saw a contracting character like the Spanish
"ch" or an expanding character like "Æ". To keep track of this,
RuleBasedCollator
first translates strings into a series of collation elements, which correspond to
single entities in the input string. In English, each character in the input maps to a
collation element, but "Æ" produces two elements and the Spanish "ch"
produces just one. This translation is done by the utility class CollationElementIterator, which uses a set of mapping tables built
from the rules passed to the collator's constructor.
CollationElementIterator is a public class, and you can use it yourself to do
searches, as we will see below. As an introduction, let's use it to iterate
over the elements for a simple string:
RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator iter = c.getCollationElementIterator("Foo");
int element;
while ((element = iter.next()) != CollationElementIterator.NULLORDER) {
System.out.println("Collation element is: " +
Integer.toString(e,16) );
} |
As you can see from the above example, a collation element is a fairly simple
creature; it's an int that describes where a character or group of
characters falls in the sorting sequence. Higher-numbered elements are sorted
after lower-numbered ones.
Of course, it's really a bit more complicated than that. Each collation
element can be broken down into three components (also known as weights or
orders): primary, secondary, and tertiary. The
primary component of a collation element corresponds to which base letter the
order represents, so "A" and "B" will have different primary
weights. The secondary components typically correspond to accents, so "á"
and "é" will have the same secondary weight, which is different from
the secondary weight of "a". The tertiary components usually represent
the case of a character, so "a" and "A" will have different
tertiary weighting. (There is a fourth level, the normalized original string
itself, which can be used for a final comparison, but you usually don't need to
worry about this level.)
|
Character
|
Primary
|
Secondary
|
Tertiary
| |
a
|
1
|
0
|
0
| |
á
|
1
|
1
|
0
| |
A
|
1
|
0
|
1
| |
...
| | | | |
B
|
2
|
0
|
1
| |
...
| | | | |
é
|
5
|
1
|
0
|
The above table uses simple, made-up numbers to illustrate the components of a
collation element. In practice, the numbers are usually larger, since most
collators have many more possible elements. To see what real collation orders
look like, you can modify the last code example as follows. The following code
will print out the collation elements for the string "Foo":
RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator iter = c.getCollationElementIterator("Foo");
int element;
while ((element = iter.next()) != CollationElementIterator.NULLORDER) {
System.out.println("Collation element is: " + element);
System.out.println(" primary: " + Integer.toString(
CollationElementIterator.primaryOrder(element), 16) );
System.out.println(" secondary: " + Integer.toString(
CollationElementIterator.secondaryOrder(element), 16) );
System.out.println(" tertiary: " + Integer.toString(
CollationElementIterator.tertiaryOrder(element), 16) );
} |
While this notion of three ordering levels seems complicated at first, it
actually makes some tasks easier. If you want to do a case-insensitive
comparison, you simply ignore the tertiary component of each collation element.
If you also don't want to include accents, you can ignore the secondary
component too. Collator handles this by allowing you to set different strength
levels using the setStrength method and constants such as Collator.PRIMARY .
Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // Ignore case and accents
if (c.compare(string1, string2) == 0) {
... // Strings matched
} |
Text Searching in JDK 1.1
Now that you're familiar with the concepts behind collators and collation
elements, we can put some of that knowledge to use. The same collation elements
that RuleBasedCollator
uses to perform string comparisons
can be used to do string searching as well. The basic concept is quite simple: instead of
searching through the characters in the string, we'll search through its collation
elements. Fast string-searching algorithms such as KMP and Boyer-Moore require the ability to
back up or perform random access in the text being searched. Unfortunately, you can't
do this with international text in JDK 1.1, because CollationElementIterator does not
allow random access. It only has a
next
method
and is lacking
setOffset
and previous . This means that Boyer-Moore searching
cannot be implemented without a complicated buffering scheme that is very tricky
to get right.
However, a traditional, brute-force string search is quite possible using the
JDK 1.1 API. Essentially this comes down to comparing the search pattern against
each individual character position in the text being searched. If the collation
elements for the search pattern match the collation elements for that substring
of text being searched, we've found a match. The outer loop looks like this:
String pattern = "for"; // What to search for
String text = "Now is the time for all good men"; // Text being searched
RuleBasedCollator c = (RuleBasedCollator)collator.getInstance();
CollationElementIterator patIter =
c.getCollationElementIterator(pattern);
for (int i = 0; i < text.length(); i++) {
String substr = text.substring(i);
CollationElementIterator textIter =
c.getCollationElementIterator(substr);
patIter.reset();
if (match(patIter, textIter)) {
// They matched! Do something.
}
} |
Of course, I left out the hard part, the function match that decides whether two
sequences of collation elements are equivalent. A simple, naive implementation
would loop through both iterators and ensure that the elements they return are
the same:
boolean match(CollationElementIterator text,
CollationElementIterator pattern)
{
while (true) {
int patternElement = pattern.next();
int targetElement = text.next();
if (patternElement = CollationElementIterator.NULLORDER) {
break; // End of the pattern
} else if (patternElement != targetElement) {
return false; // Mismatch
}
}
return true; // No mismatches
} |
This will work, but only if you want to treat any difference, be it
primary, secondary, or tertiary, as significant. In most applications, that is
not enough; users will want an "Ignore Case" option, and possibly an "Ignore
Case and Accents" option as well. Fortunately, this is not very hard to do. Collator
provides the constants PRIMARY , SECONDARY , and TERTIARY that you can use
to represent the level of comparison you want, and CollationElementIterator
provides methods to break down a collation order into its three components.
All we need to do is create a variable, e.g. weight , that stores the desired
level of comparison. If weight is PRIMARY , we check only the primary component
of each collation element. If weight is SECONDARY , we check both the primary
and secondary components, and if it is TERTIARY we check all three. Fortunately,
these values of these constants are in ascending numerical order, so we can use
simple comparisons such as: "if (weight > Collator.PRIMARY) ... "
To add this extra functionality we have to modify our match function a bit, but
it is still fairly simple. Since the documentation for CollationElementIterator
promises that "the first 16 bits of a collation order is its primary order;
the next 8 bits is the secondary order and the last 8 bits is the tertiary order,"
we can simply mask away the portions of the collation element that we're
not interested in.
// Return a mask for the part of the order we're interested in
static final int getMask(int weight) {
switch (weight) {
case Collator.PRIMARY:
return 0xFFFF0000;
case Collator.SECONDARY:
return 0xFFFFFF00;
default:
return 0xFFFFFFFF;
}
}
boolean match(CollationElementIterator text,
CollationElementIterator pattern)
{
int mask = getMask(weight);
int done = CollationElementIterator.NULLORDER & mask;
while (true) {
int patternElement = pattern.next() & mask;
int targetElement = text.next() & mask;
if (patternElement == done) {
break; // End of pattern
} else if (patternElement != targetElement) {
return false; // Mismatch
}
}
return true; // No mismatches
} |
Ignore That Character!
As you've probably guessed, I left something out again. There's one
last complication: ignorable characters. In Unicode, an accented
character can be represented in two different ways. The single Unicode character
\u00e1 represents "á", but the pair of Unicode values \u0061\u0301
also represents "á". The \u0061 is just a lowercase "a", but
the \u0301 is special. It's a "combining acute accent" that
combines with the value before it to create the single "user-level"
character "á".
These combining characters need special processing during a comparison. Since \u0301
is only an accent, its collation element has a secondary component but no
primary or tertiary component. In a comparison that does not involve accents, we
must ignore this element entirely. If we did not, we might end up comparing an
accent in one string to a base letter in another string, which would give
invalid results. For example, when doing an accent-insensitive comparison "a\u0301b"
and "ab", we want to skip the "\u0301" and go on to the next
character; otherwise we'd compare "\u0301" and "b".
This logic is relatively straightforward, but it does make the code a bit more
complicated. The boolean variables getTarget
and
getPattern
are used to decide whether to fetch the next collation element in the text and the pattern
each time through the loop. Normally both variables are true, but one or the other of them can be set to
false
if we want to skip an element. For example, setting getPattern
to
false
and
getTarget
to
true
causes the current pattern element to be re-used and compared with
the next text element, thus skipping the current text element.
boolean match(CollationElementIterator text,
CollationElementIterator pattern)
{
int mask = getMask(weight);
int done = CollationElementIterator.NULLORDER & mask;
boolean getPattern = true, getTarget = true;
int patternElement = 0, targetElement = 0;
while (true) {
if (getPattern) patternElement = pattern.next() & mask;
if (getTarget) targetElement = text.next() & mask;
getTarget = getPattern = true; // By default get both
if (patternElement == done) {
break; // End of pattern
} else if (targetElement == 0) {
getPattern = false; // skip targetElement
} else if (patternElement == 0) {
getTarget = false; // skip patternElement
} else if (targetElement != patternElement) {
return false; // Mismatch
}
}
return true; // No mismatches
}
|
This is about the best you can do with the JDK 1.1 API. You can add bells and whistles,
such as searching backward though text by reversing the order of the outer loop that calls
match
, but you can't really implement a
more efficient search without a lot of work.
It's Better in 1.2
In JDK 1.2, we have made quite a few improvements to the international classes in java.text and java.util . Among them are
enhancements to CollationElementIterator that make it possible to write faster
and more powerful search routines.
There are two major problems with the searching mechanism outlined above. First,
it uses an inefficient algorithm that can, at worst, compare every character of
the pattern against every character of the target, requiring a number of
comparisons that is proportional to the size of the text being searched
multiplied by the size of the pattern. (In practice, it's usually not quite
that bad, however.) In computer science terms, if the size of the text is T and
the size of the pattern is P, the search time is proportional to T·P, or is O(TP).
Modern searching algorithms can do much better.
The second, and more obvious problem is that there is an awful lot of object
creation going on in the last few examples. Every time through the outer loop we
call substring
, which creates a new String
object, and then we create a new CollationElementIterator. This happens at every
single position in the target string, which is woefully inefficient given the cost of
object creation in Java (and in most other languages, for that matter). This second problem is solved by two new CollationElementIterator methods that we have
added in JDK 1.2:
public void setOffset(int newOffset)
public int getOffset()
|
These methods allow you to change the text position to which an iterator points and to
retrieve its current position. With this flexibility, we can avoid all of the calls to substring and all of the extra iterator objects that
we were creating before. The outer searching loop now looks like this:
String pat = "for"; // What to search for
String text = "Now is the time for all good men"; // Text to search in
RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator patIter = c.getCollationElementIterator(pat);
CollationElementIterator targIter = c.getCollationElementIterator(text);
for (int i = 0; i < text.length(); i++) {
targIter.setOffset(i);
patIter.reset(); // Or setOffset(0)
if (match(patIter, targIter)) {
// They matched! Do something.
}
} |
This will be much faster, because we're no longer creating new
objects each time through the loop. The algorithm is still O(TP), but the
overhead per iteration is considerably lower, so the running time will a lot
better.
Optimized Searching
We've solved the easier of our two efficiency problems; now it's time
for the hard one. As I explained above, the brute-force algorithm we're
using is O(TP). String searching is a well-researched area, and there are
algorithms that can do considerably better. Perhaps the best is the Boyer-Moore
method, which is never worse than O(T+P) and in practice is often
proportional to T/P. That's right: the size of the text divided by
the size of the pattern. Rather than forcing us to examine characters in the
text multiple times, this algorithm actually lets us skip characters.
Boyer-Moore can be a little bit tricky to explain, but once you "get"
it, it seems almost too obvious. The trick is that instead of comparing the
strings starting at the beginning of the pattern, you compare them starting at
the end. If the characters don't match, we've still gleaned a bit of
useful information: we now know what character occurs at that position in the
text being searched. Often, we can take advantage of that information to skip
several characters in the target text rather than simply sliding the pattern
along by one position and trying again.
An example will make this more clear. Imagine that you're searching for the
word "string" inside the phrase "silly_spring_string". To
start off, you line up the beginning of the pattern with the beginning of the
target, but you start comparing at the end, like so:
silly_spring_string
string
|
We compare the g in the pattern with a space character in the target, and there's
no match. So far, there's nothing special. However, we know something else
as well: the character at index 5 in the target is a space, and there are no
space characters anywhere in the pattern we're searching for. Combining
these two facts, we can slide the pattern over by six characters and try again:
silly_sp
ring_string
st
ring
|
This time, there is a match between the pattern and the text. Since we're
going backwards, we now compare the n , i, and
r and find that they match too. However, the
p and the t do not. We know that there is not a p anywhere in the pattern, so we can slide it over again:
silly_spring_ string
string
|
silly_spring_string
string
|
To implement this efficiently, you need to have a table that, for each possible
character in the text, tells you how far from the end of the pattern that
character first occurs. This is the distance by which the pattern can be shifted
when that particular character is seen in the input. For the above example, the
table would look like this:
| Character: | s | t | r | i | n | g | others | | Shift: | 5 | 4 | 3 | 2 | 1 | 0 | 6 |
This table can be computed once, before the search is started, by making a
single pass through the pattern. After that, it can be used each time we search
for that pattern, leading to a huge performance win.
Boyer-Moore and Unicode
But wait! At the very beginning of this article, I said that this kind of
algorithm doesn't work well with Unicode because it has 65,535 possible
character values, which would make the table too large. Actually, it's
worse, because we're concerned with collation elements, which are 32-bit
integers, not with the Unicode values themselves. That's true, but (of
course) there's another trick...
First, consider what happens when a letter occurs twice in the search pattern.
There are two possible shift distances for that letter, one for each occurrence.
To make the Boyer-Moore algorithm work, we always want to enter the smaller of
the two shift distances in the table. If we used the larger one, we might shift
the pattern too far and miss a match. In a sense, the shift table is not
required to be perfectly accurate, and conservative estimates of shift distances
are OK. As long as we don't shift the pattern too far, we're fine.
This realization leads to a simple technique for applying the algorithm to Java
collation elements: simply map all possible elements down to a much smaller set
of shift table indices (say, 256). If two or more elements in your pattern
happen to collide and end up with the same index, it's not a problem as
long as you enter the smaller of the shift distances in the table.
A simple way to map the collation elements to 256 values is to use the low byte
of the element's primary component. There are other approaches that will
lead to a better distribution of values throughout the shift table and will thus
give slightly better performance, but in practice this approach is usually good
enough.
To implement Boyer-Moore searching with JDK 1.2, we first need to construct a
shift table that tells us how far to shift the pattern when a particular
collation element is seen in the text. The hash function is used to map from a
32-bit collation order down to an index in the 256-element array.
// Member variables for storing precomputed pattern data
private int patLen;
private int[] patElem;
private int[] shifts;
// Map a collation element to an array index
int hash(int order) {
return CollationElementIterator.primaryOrder(order) % 256;
}
// Initialize the Boyer-Moore shift tables
void initialize(RuleBasedCollator c, String pat)
{
// First find out how many elements we're dealing with
patLen = 0;
CollationElementIterator iter = c.getCollationElementIterator(pat);
while (iter.next() != CollationElementIterator.NULLORDER)
patLen++;
// Allocate space to store the pattern elements and the shift tables
patElem = new int[patLen];
shifts = new int[256];
// Elements not found in the pattern get the maximum shift distance
for (int i = 0; i < 256; i++) {
shifts[i] = patLen;
}
// Now compute the shift distances for the elements in the pattern.
// While we're at it, save the elements themselves for quick access.
// The "-1" is in the calculation because Java indices are 0-based.
iter.reset();
for (int i = 0; i < patLen; i++) {
patElem[i] = iter.next();
int index = hash(patElem[i]);
shifts[index] = Math.min(shifts[index], patLen - i - 1);
}
} |
Once we have the tables, the search routine is straightforward. It uses another
new JDK 1.2 method: CollationElementIterator.previous . Also note that there is
no longer an outer loop that calls a separate match
method, since that only worked well when we
were marching through the text one character at a time. Now that we can skip ahead an
arbitrary distance through the text, it is easier to combine all of the logic into one
method.
public int find(String text, String pattern)
{
RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance();
CollationElementIterator targIter =
coll.getCollationElementIterator(text);
// build the shift table and the constants we need
initialize(coll, pattern);
int mask = getMask(weight);
int done = CollationElementIterator.NULLORDER & mask;
// Start at the text position corresponding to the end of the pattern
int textIndex = pattern.length();
while (textIndex <= text.length()) {
boolean getPattern = true, getTarget = true;
int targetElement=0, patternElement=0;
iter.setOffset(textIndex);
int patIndex = pattern.length();
// Iterate backward until we hit the beginning of the pattern
while (patIndex > 0)
{
if (getTarget) targetElement = targIter.previous() & mask;
if (getPattern) patternElement = patElem[--patIndex] & mask;
getTarget = getPattern = true;
if (targetElement == 0) {
getPattern = false; // skip targetElement
} else if (patternElement == 0) {
getTarget = false; // skip patternElement
} else if (targetElement != patternElement) {
// There is a mismatch at this position. Decide how far
// over to shift the pattern, then try again.
textIndex = iter.getOffset() +
shifts[hash(targetElement)];
break;
}
}
if (patIndex == 0) {
// We made it back to the beginning of the pattern,
// which means we matched it all. Return the location.
return targIter.getOffset();
}
// Otherwise, we're here because of a mismatch, so keep going....
}
return -1; // No match.
}
|
There you have it: a way to do fast, linear-time, international-friendly string
searching in Java.
The Real Thing
I hope this article has given you a good idea of how you can use collators to add
language-sensitive sorting and searching to your own Java applications. It is not that
hard, and the benefits can be enormous, because global markets are becoming increasingly
more important to the computer industry. For example, according to IBM's first
quarter 1998 financial statement over half of IBM's revenue came from outside North
America. Using the international features of the JDK can help you begin to tap into this
huge market. The code examples in this article were intended primarily for their educational value,
but they do work. However, for clarity I have ignored a few remaining issues to make the
code easier to understand. Expanding ("ä" to "ae") characters in the
pattern are the chief difficulty. If the shorter, non-expanded version of the character
occurs in the text being searched, you can end up shifting too far and missing a possible
match. However, it is not too hard to compensate for this by using the getMaxExpansion method (also new in JDK 1.2) of CollationElementIterator
to decrease the shift
distances when expanded characters are seen in the pattern. The other major feature I have left out is the ability to tell how much of the text
matched the pattern. All of the code examples search for location in the text where a
match starts, but they do not return the length of the text that matched. You would need
to know this if you were writing a search and replace function in an editor, for example.
In JDK 1.2, it is easy to tell where the match ends: just call the iterator's getOffset method. With the JDK 1.1 API, it is harder;
you basically have to resort to the brute-force technique of comparing longer
and longer substrings of the text to the pattern and stopping when the collator
says that they are equal.
If you want to see a real, production-quality package that solves these last few
problems, visit www.alphaworks.ibm.com
and download a trial copy of StringSearch, our fully-functional text searching
class based on the JDK collation API. It supports case- and accent-insensitive
searches, backward searches, and "whole word" searching, among other
features. alphaWorks also contains several other Java internationalization
utilities that you might find useful, as well as a large number of JavaBeans,
utility classes, and even a free XML parser written in Java.
Acknowledgements
The patented technique for applying the Boyer-Moore search algorithm to
collation elements was developed by Dr. Mark Davis of IBM. Kathleen Wilson, the
manager of the text and international groups at IBM's Center for Java
Technology in Silicon Valley, was very indulgent of the time I spent working on
this article and the accompanying code. I would also like to thank Mark,
Kathleen, Michael Pogue, John Raley, and Helena Shih for reviewing drafts of
this article.
This article previously appeared in the February, 1999, issue of Java
Report magazine and was presented at the 14th
International Unicode Conference in March, 1999.
About the author  | |  |
Laura Werner is a Senior Software Engineer and Project Leader for the Java
Internationalization effort at the IBM Center for Java Technology in Cupertino,
CA. After receiving Bachelor's degrees in Geological Sciences and
Integrated Science from Northwestern University, she worked at SPSS, Inc. and UC
Berkeley before joining Taligent in 1994. Now at IBM, she is the project lead
for IBM's international contributions to the JDK and an architect for other
international projects, and occasionally has time to work on side projects such
as this paper. Laura can be reached at lwerner@us.ibm.com.
|
Rate this page
|  |