Level: Intermediate Markus Lorch, Software Engineer, IBM
12 Apr 2007
Get an introduction to "Easy Semantic Search," a new functionality introduced in
IBM® OmniFind® Enterprise Edition V8.4 that enables end-users to leverage
the power of semantic search to query for concepts through the well-accepted keyword
query paradigm. Follow the steps outlined in this article to enable the Easy Semantic
Search functionality with an example processing of telephone numbers. Then customize the Easy Semantic Search functionality by extending the sample configuration.
Introduction
Searching for concepts instead of or in addition to keywords is a powerful means to
improve efficiency and usability of enterprise search solutions. With semantic
understanding a query for "phone number" can, in addition to documents containing
the keywords "phone" and "number," also return documents that contain actual
telephone numbers and highlight the occurrences for easy reference in the search
results. "Easy Semantic Search" is a new functionality introduced in IBM OmniFind
Enterprise Edition 8.4 that enables end-users to leverage the power of semantic
search to query for concepts through the well-accepted keyword query paradigm. No
complex query language has to be learned or special-purpose user interfaces
developed for the end-user to make use of this functionality. Semantic search
requires text analysis components to analyze documents for concepts of relevance and
create searchable metadata. Omnifind Enterprise Edition 8.4 ships with a powerful
text analysis module that can be configured to detect a wide variety of concepts in
processed documents. In this article, get an introduction to semantic search and
follow the necessary steps to enable the Easy Semantic Search functionality with an
example processing of telephone numbers. Furthermore, learn how to customize the Easy Semantic Search functionality by extending the sample configuration.
 |
Overview – OmniFind
processing and custom text analysis
OmniFind Enterprise Edition V8.4 provides a configurable text analysis module that
can detect expressions and enumerable entities (like people, places, and things) when
documents are indexed for enterprise search. This module is called the Regular Expression Annotator. Furthermore, a novel concept called “semantic synonyms" radically simplifies searching for concepts extracted from unstructured text by transforming the user's keyword query. Together, these mechanisms empower enterprise search users to discover information more effectively by searching for both keywords and semantic concepts through the familiar keyword-based query interface. Complicated query syntax or tedious forms to select options from are unnecessary.
For example, a search query for "laptop product number" will not only discover
documents containing the stated keywords, but will also locate documents that contain
"laptop" together with a detected product number such as 266-H2G. The detected
product numbers in this example are also highlighted in the search result. The
functionality can also be used to detect and later search for enumerable sets of
named entities, such as product, country, or company names. A query for "G8 county climate regulations" would discover documents that contain "climate" and "regulations" together with the name of a country that is part of the group of G8 countries.
The Easy Semantic Search functionality is provided by three distinct components in
the OmniFind system:
First, the regular expression annotator is added to OmniFind's Unstructured
Information Management Architecture (UIMA) processing pipeline to detect instances of concepts in the document text, based on a set of extensible rules. The discovered meta information is stored in the search index.
Second, OmniFind's synonym dictionary mechanism is used to define semantic synonyms
for keywords. The semantic synonyms take the form of XML fragment query expressions
and can be used together with ordinary synonyms. For example, the keywords "phone
number" can have the synonym "telephone number" as well as the XML fragment query @xmlf2::'<#phonenumber/>'.
Third, a semantic synonym-expansion code located in the search application retrieves
synonyms for the user-provided keywords from the OmniFind back end and transforms the
original query into an XML fragment query that combines original keywords with
regular and semantic synonyms. The expansion logic does not replace the original
keyword but rather creates an OR expression to locate either the keyword or a concept
instance relating to the keyword. for example., an example query 'IBM phone number'
becomes 'ibm <.or> phone <#phonenumber/> </.or> <.or> number
<#phonenumber/> </.or>'. Because the original keyword in the above
example consists of two terms (phone number), for which a semantic synonym exists,
the expansion logic follows Boolean algebra rules to create two OR parts connected by
AND. (A ^ B) v C becomes (A v C) ^ (B v C), which can easily be represented in an XML fragment expression.
Enable Easy Semantic Search
OmniFind Enterprise Edition, Version 8.4 ships with three files required to enable
the Easy Semantic Search functionality in a sample configuration. The files are located under the OmniFind Enterprise Edition installation root directory in the packages/uima/regex subdirectory. The procedure is described in detail in the Text Analysis Integration manual section "Easy semantic search using the regular expression annotator" (see Resources) and outlined here:
- The first step enables the text analytics: Upload the PEAR file containing the
pre-configured regular expression annotator (of_regex.pear) into
the OmniFind Enterprise Edition system using the OmniFind administrative console (system tab, edit
parser). Then associate the annotator with the collection it is to be used with (collection's parser settings).
- In step two, the text analysis results are mapped to the index: For this, upload
the common analysis structure to index mapping file (of_sample_regex_cas2index.xml)
and associate it with the collection (also in parser settings). With both the
annotator and the mapping file linked to the collection, documents can be crawled, parsed, and indexed.
- Step three configures the semantic synonym expansion of keywords: For this,
upload a provided sample synonym dictionary (of_sample_synonym_dic.dic) to the
OmniFind system (system tab, edit search) and associate it with the search component of this collection (collection's search settings).
By enabling the option
"automatically search for synonyms using semantic expansion" in the OmniFind search
application preferences screen, the collection can now be searched on with semantic
synonyms. With the functionality enabled, basic keyword queries for "phone number"
will be expanded to the semantic concept <phonenumber/>, in addition to a few
regular synonyms like "telephone number." When an actual telephone number is present
in a digested document, this phone number is annotated with the concept
<phonenumber/> during the text analysis step, and the document is returned to a
matching query with the phone number itself highlighted in the result set. Documents
containing the words "phone" and "number," as well as one of the regular synonyms, are also found. The semantic synonym expansion mechanism is limited to expand simple keyword queries. Fielded queries or queries with advanced query terms are currently not expanded. The additional text analysis performed by the regular expression annotator does have a performance impact and will reduce the maximum parser throughput.
Figure 1. Search result with semantic
search for telephone numbers on a Web collection
Customizing the regular expression annotator
Discovering telephone numbers and URLs in documents is a great way to get an idea of
what is possible with the Easy Semantic Search functionality of OmniFind Enterprise
Edition 8.4. In production scenarios, other concepts may be of importance. The
flexibility of the system allows the detection of a large variety of concepts or entities. To detect additional concepts, the rules that govern the text analysis of the regular expression annotator, the index mapping file that instructs the system how to store discovered facts in the index, and the synonym dictionary that defines the binding between keywords and semantic search concepts must be adapted.
The remainder of this section explains how to extend the example rules, mappings,
and synonyms. It also discusses a rule evaluation approach. The examples provided
show how you can configure OmniFind Enterprise Edition to also discover and search
for IBM laptop product numbers when the keywords "IBM laptop" or "thinkpad" are given and to also be able to detect and search for occurrences of country names in texts.
Customizing the rules
The regular expression annotator is configured by means of an XML file. This file
contains a set of rules that define on what type of character or number sequences the
annotator should act and how it should act. The detailed description of the file and
rule format can be found in the OmniFind Enterprise Edition 8.4 Text Analysis
Integration guide (see Resources). The rule file is part of
the annotator PEAR file. A PEAR file is actually a ZIP archive file that contains the
annotator code and configuration in a well-defined directory structure. The XML rule
file resides in a subdirectory named "xml" within the PEAR file and can be extracted
from there with any ZIP file tool (for example, the jar command provided with a Java
SDK). For convenience, the sample rule file that is part of the regular expression
annotator PEAR (named of_sample_regex_rules.xml) is also provided in the subdirectory
packages/uima/regex under the OmniFind Enterprise Edition installation root directory. This directory also contains the XML schema definition (ruleSet.xsd) that can be used to validate changes against the schema.
The sample rule file contains four rules: phonenumber, potential_phonenumber, url,
and email. The URL and e-mail rules are fairly simple regular expressions, while the
two phone number rules are more complex, as they aim to detect a multitude of alternative representations of international telephone numbers.
A simple approach to customizing the regular expression rules is to copy the
complete rule definition of, for example, the URL rule and modify it. Simple rules require changes to the regular expression, the annotation id to be created (unique for each rule), and the type of the annotation to be created.
In Listing 1, a simple rule to match product numbers of
Thinkpad laptops is created from the URL rule definition. The regular expression is
changed to locate a four-digit number, followed by a dash, followed by a three-digit
alphanumerical sequence. For example, the sequence 2668-H2G identifies a Thinkpad
T43p. Further more, the id of the annotation to be created is changed to "thinkpad" with a type of com.ibm.es.uima.Thinkpad.
Listing 1. Sample rule to detect IBM laptop product numbers
<!-- IBM Laptop Product Number e.g. 2668-H2G -->
<rule regEx="([0-9]{4}\x20?-\x20?[A-Z,0-9]{3})"
matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation">
<createAnnotation id="thinkpad" type="com.ibm.es.uima.Thinkpad">
<begin group="0"/>
<end group="0"/>
</createAnnotation>
</rule>
|
In Listing 2, a few countries are detected through a simple
list of names that comprise the regular expression. Regular expression operators are
used to allow for long and short versions (for example, People's Republic of China
versus China) and to govern that trailing punctuation characters are allowed.
Note that some countries are detected in the first rule and an annotation of type
"com.ibm.es.uima.Country" is created. Countries that are also members of the G8
countries are not detected by the first rule, but rather by a second rule that similarly creates the country annotation but also sets a feature of that annotation to identify these countries as belonging to the group of G8 countries. This enables queries for arbitrary countries where all named countries (from both rules) will be found, but queries for countries that are part of the G8 group are also possible. Simple hierarchies (another example are products that belong to a specific brand) can be implemented this way.
Listing 2. Two rules to detect countries
<!-- Country -->
<rule regEx="(Australia|Brazil|Spain|(People's Republic of )?e?China)s?(?!\w)"
matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation">
<createAnnotation id="country" type="com.ibm.es.uima.Country">
<begin group="0"/>
<end group="0"/>
</createAnnotation>
</rule>
<!-- G8 Country -->
<rule regEx="(Canada|France|Germany|Italy|Japan|Russian Federation|United
Kingdom|United States of America|USA|U.S.A.)s?(?!\w)"
matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation">
<createAnnotation id="country-G8" type="com.ibm.es.uima.Country">
<begin group="0"/>
<end group="0"/>
<setFeature name="Group" type="String">G8</setFeature>
</createAnnotation>
</rule>
|
When copying these example rules into the of_sample_regex_rules.xml file, take care not to introduce line breaks into the regular expressions.
Extending the type system
The annotation types used in the new rules must exist as part of the UIMA type
system. The regular expression annotator's type system file is named
of_sample_regex_typesystem.xml (also present in the xml subdirectory of the annotator
PEAR file, as well as in the packages/uima/regex directory of the OmniFind
installation) and must be extended with the additional types. The type system already
contains types for phonenumber and the other annotation types needed by the sample
regular expression annotator rules. Listing 3 illustrates the two new type definitions:
Listing 3. Type system definitions that need to be added for the two new annotation types
<!-- Thinkpad Annotation -->
<typeDescription>
<name>com.ibm.es.uima.Thinkpad</name>
<description>IBM laptop product numbers</description>
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
<!-- Country Annotation -->
<typeDescription>
<name>com.ibm.es.uima.Country</name>
<description>Countries of the world</description>
<supertypeName>uima.tcas.Annotation</supertypeName>
<features>
<featureDescription>
<name>Group</name>
<description/>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
|
Application, evaluation, and refinement
Modifying and extending the rules of the regular expression annotator to detect
additional concept instances in unstructured texts often requires an iterative
approach -- an initial set of rules are written and tested on sample documents, then
the rules may be refined and retested. When the outcome is satisfactory, the rules in
the original annotator PEAR file can be updated with the new rules and the resulting
version installed in OmniFind Enterprise Edition and put to use. The UIMA SDK can be used to run the regular expression annotator outside of OmniFind on selected texts in order to evaluate modified or newly created rules.
The UIMA SDK is available for download from IBM developerWorks for Windows and UNIX operating systems for deployment on a developer workstation. The PEAR installer
utility (runPearInstaller command) that is part of the SDK can be used to install the
regular expression annotator PEAR that shipped with OmniFind Enterprise Edition 8.4
(of_regex.pear found in the subdirectory packages/uima/regex under the OmniFind
Enterprise Edition
installation root directory) into a local directory and to run the annotator in the
CAS Visual Debugger tool (CVD). CVD runs the annotator on sample texts that can be
copied into the CVD interface. After a successful run, the created annotations are
listed and the corresponding original text sections highlighted. Figure 2 illustrates the CVD interface:
Figure 2. Search result with semantic
search for telephone numbers on a Web collection
The example provided in Figure 2 hints at the power of semantic
search. The sample document could be located with a query for "IBM laptop" even though these keywords never appear in the document text directly. Also, a search for documents talking about G8 countries would return the document due to a match on Germany, Russia, and Italy.
Five easy steps to investigate annotations using CVD:
- Execute the runPearInstaller shell or batch script.
- Specify the PEAR file to be installed and provide a destination directory, then
click on Install.
- Once installation has finished, click on Run your AE in the CAS Visual
Debugger.
- Enter some sample text, including a telephone number like (800) 555-1234, in the
right text window, and select Run > Run Regular Expression Annotator.
- Now browse the annotations in the "Analysis Results" window: Select
AnnotationIndex, then select the detected instance of
com.ibm.es.uima.PhoneNumber in the window below.
The rule and type system changes discussed in the previous section must be applied
to the XML rule and type system files in the installed version of the regular
expression annotator PEAR. During the installation process performed by the UIMA SDK PEAR installer tool, the PEAR file is extracted to a directory specified in the installation wizard. The rule and type system files are located in the subdirectory named jedii_an_regex/xml. When modifying these files, it is a good idea to make backup copies of the working versions and to make incremental changes and evaluate often.
After the rules and the type system have been modified, the CAS Visual Debugger tool
(CVD) from the UIMA SDK can be used to validate the changes. Before starting CVD from
the command line, the Java classpath environment variable must be augmented with the
list of resources provided in the jedii_an_regex/metadata/setenv.txt file.
(Note: When CVD is started by the PEAR installer, the classpath is
automatically set, but not when CVD is started directly from the command line.) With
the augmented classpath in place, CVD is started using the cvd command. To load the annotator, select Run >
Load TAE from the CVD menu, then browse for the descriptor file of the Regular Expression Annotator jedii_an_regex/desc/jregex.xml.
Sample text can be typed or copied into the content field on the right-hand side of
the CVD window and the processing by the regular expression annotator started ("Run",
"Run Regular Expression Annotator"). The left-hand Analysis Result window then
displays the annotation instances created by the annotator on the sample text, sorted
by annotation type. If a particular annotation is selected, then the original text
covered by this annotation is highlighted. This is particularly useful when small
differences have to be investigated (for example, is the trailing white space part of
the annotation?). The rules can now be modified or augmented as needed in order to
receive the desired result. More information on regular expressions is available in
several online resources, including many tutorials. Rules can be changed and
evaluated without restarting CVD -- simply reload the annotator descriptor, and rerun the annotator.
PEAR packaging
The installed version of the regular expression annotator now has the desired set of
rules configured. In order to deploy this configuration, you must build a PEAR file with the new
rules. The simplest way to do this is to make a copy of the original regular
expression annotator PEAR and insert the two XML files into the subdirectory xml of
the archive. Any ZIP archive tool that is capable of replacing files in a ZIP archive
can be used (for, the JAR utility that is provided with prevalent Java JDK
distributions). When inserting the new file versions of_sample_regex_typesystem.xml
and of_sample_regex_rules.xml, it is important that the file names are not altered
but that the original versions are overridden. You can now use the administrative
console to upload the new PEAR file to the OmniFind system and assign it to collections.
Annotations to index mapping
In order for OmniFind to not only create but also store the new annotations in the
search index, the Common Analysis Structure to Index mapping file also needs to be
augmented with rules for the new annotation types. You need to add an
<indexBuildItem>, that maps the annotation to either a searchable span
(annotation style), or a field (field style), or both for each annotation. The
example in Listing 4 creates a searchable span for every
"thinkpad" annotation as well as a field that will hold all product numbers that were
located in a particular document. You need to add both examples in Listings 4 and 5 to the of_sample_regex_cas2index.xml (found in the packages/uima/regex directory of the OmniFind installation).
Listing 4. Mapping of the Thinkpad annotation to a searchable span as well as a field in the index
<indexBuildItem>
<name>com.ibm.es.uima.Thinkpad</name>
<indexRule>
<style name="Annotation">
<attribute name="fixedName" value="thinkpad"/>
</style>
<style name="Field">
<attribute name="fixedName" value="thinkpad"/>
<attribute name="fieldSearchable" value="true"/>
<attribute name="returnable" value="true"/>
</style>
</indexRule>
</indexBuildItem>
|
The second example in Listing 5 creates searchable spans for
"country" annotations, including a mapping of value provided by the feature "group."
Listing 5. Mapping of the Country annotation to a searchable span in the index
<indexBuildItem>
<name>com.ibm.es.uima.Country</name>
<indexRule>
<style name="Annotation">
<attribute name="fixedName" value="country"/>
<attributemappings>
<mapping>
<feature>Group</feature>
<indexName>group</indexName>
</mapping>
</attributemappings>
</style>
</indexRule>
</indexBuildItem>
|
You need to upload the extended of_sample_regex_cas2index.xml file to the parser
configuration of each collection that has the new regular expression annotator
activated using the administrative console. Now the collection is ready to process
documents with the customized annotator and store the results in its search index.
This is a good time to perform end-to-end tests by processing a few documents that
are known to contain detected product numbers and groups (for example, from the tests
with the CVD tool during rule refinement). Once the documents are indexed, an XML
fragments query, like one of the following examples, can be used to validate that
annotations were created and indexed:
-
@xmlf2::'<#phonenumber/>' to search for annotations from the sample telephone number rule
-
@xmlf2::'<#country>/>' to search for a country
-
@xmlf2::'<#country group="G8"/>' to search only for countries belonging to the G8 group
-
@xmlf2::'<#thinkpad/>' to search for thinkpad annotations
As thinkpad annotations are also mapped to returnable and field-searchable fields,
you can issue a fielded search for "thinkpad:" to receive all documents with a
thinkpad annotation. If the detailed view ("Show Details" link above the search results) is activated, the field values (the detected product numbers) are also visible.
Writing semantic synonyms
To make the new annotations available with keyword search, the synonym dictionary
for this collection must be extended. The XML source for the sample synonym
dictionary (of_sample_synonym_dic.xml) is also provided with the Omnifind
installation and can be used as a starting point. The set of <synonymgroups>
in the file needs to be extended with three more synonym groups, as illustrated in Listing 6.
Listing 6. Definition of semantic synonyms.
<synonymgroup>
<synonym>thinkpad</synonym>
<synonym>laptop</synonym>
<synonym>@xmlf2::'<#thinkpad/>'</synonym>
</synonymgroup>
<synonymgroup>
<synonym>country</synonym>
<synonym>nation</synonym>
<synonym>@xmlf2::'<#country/>'</synonym>
</synonymgroup>
<synonymgroup>
<synonym>g8 country</synonym>
<synonym>g8 nation</synonym>
<synonym>@xmlf2::'<#country group="G8"/>'</synonym>
</synonymgroup>
|
Once the synonym file is extended, a binary dictionary needs to be built from it. It
is important to keep the brackets of the XML fragment expression encoded with
< and > in the dictionary XML file. The essymdictbuilder tool available
on the Omnifind server must be used to create the dictionary (for example, the command essyndictbuilder.sh of_sample_synonym_dic.xml of_sample_synonym_dic.dic on a Linux or AIX OmniFind server). The resulting dictionary can then be uploaded to the Omnifind system using the administrative console and associated with the search function.
Summary
This article provides an overview of the steps necessary to enable and customize the
Easy Semantic Search functionality that allows end-users to issue powerful semantic
queries though the well-known keyword search paradigm. Many text analysis tasks can
be addressed by extending the rules of the regular expression annotator. For more
complex text analysis tasks, the development of a custom annotator may be of
interest. The tutorial "Semantic
search
in IBM OmniFind Enterprise Edition, Part 2: Semantic search with UIMA and OmniFind"
(developerWorks, December 2006) provides step-by-step instructions on how to develop
a custom annotator. It also covers the incorporation of custom annotators into the Omnifind system and the process of mapping annotations to the index in more depth. The Easy Semantic Search functionality introduced in this article can be used together with any annotator, including custom developments. A custom-developed annotator would replace or be deployed together with the provided regular-expression annotator.
Resources Learn
Get products and technologies
-
UIMA
SDK, Version 1.4.4: Download, run, and evaluate the regular expression annotator rules.
-
Build your next development project with
IBM
trial software, available for download directly from developerWorks.
Discuss
About the author  | 
|  | Markus Lorch has led a development team to implement and improve text analysis and semantic search functions for OmniFind Enterprise Edition 8.4. Previously he has been involved with performance and scalability engineering for IBM enterprise search products. Before joining IBM in 2005 he was researching and developing authorization mechanisms for Grid computing environments and was actively involved in the standardization of Grid security architectures and protocols. Dr. Lorch holds a Ph.D. from Virginia Tech. |
Rate this page
|