Level: Advanced Michael Baessler (mbaessle@de.ibm.com), Software Engineer, IBM
22 Feb 2007 Get hints and tips about how to use the
Unstructured Information Management Architecture (UIMA) in IBM® OmniFind™ Enterprise Edition 8.4.
This article targets advanced developers who use UIMA for building applications on top of OmniFind Enteprise Edition.
Practice some of the described techniques for an advanced use of UIMA inside of IBM OmniFind Enterprise
Edition 8.4.
Content
The following topics are included in this article. It is not necessary to read the whole
article; just read the topics you are interested in:
Using custom CAS consumers in OmniFind
Besides running custom annotators in OmniFind, it is also possible to run custom CAS consumers.
To run such an analysis component in OmniFind, it must be packaged into a PEAR file in the same
way as custom annotators. You can upload the PEAR file to OmniFind as an analysis engine and
then later associate it with one or more collections.
This article does not explain the details of the uploading process to OmniFind. For more
information about this, please refer to the OmniFind documentation book
Administering Enterprise Search, in the section on "Custom Text Processing," in the main
chapter "Working with collections and external sources/Enterprise Search parser
administration."
Package a CAS consumer into a pear file
When packaging a CAS consumer component as a PEAR file, you have to wrap the CAS consumer component into an
aggregate analysis engine descriptor. You always have to do that,
even when you only have a single CAS consumer in the PEAR package. This step is necessary so that the UIMA
components can call and handle the CAS consumer descriptor. A sample aggregate analysis engine descriptor
containing only a single CAS consumer component is shown in Listing 1 below:
Listing 1. Aggregate analysis engine descriptor
containing a single CAS consumer component
<?xml version="1.0" encoding="UTF-8" ?>
<taeDescription xmlns="http://uima.watson.ibm.com/resourceSpecifier">
<frameworkImplementation>com.ibm.uima.java</frameworkImplementation>
<primitive>false</primitive>
<delegateAnalysisEngineSpecifiers>
<delegateAnalysisEngine key="myCASConsumer">
<import location="./casConsumer.xml"/>
</delegateAnalysisEngine>
</delegateAnalysisEngineSpecifiers>
<analysisEngineMetaData>
<name>CAS consumer wrapper</name>
<description>sample CAS consumer wrapper</description>
<version>1.0</version>
<vendor>IBM Corporation</vendor>
<flowConstraints>
<fixedFlow>
<node>myCASConsumer</node>
</fixedFlow>
</flowConstraints>
<operationalProperties>
<modifiesCas>false</modifiesCas>
<multipleDeploymentAllowed>false</multipleDeploymentAllowed>
</operationalProperties>
</analysisEngineMetaData>
</taeDescription>
|
For the CAS consumer to be called correctly, you have to use a fixedFlow as flow constraint (as shown
in the sample descriptor above). If you don't use such a flow, the CAS consumer isn't called because
there are no capabilities that a CAS consumer can provide. For more information about the different
flow constraints that are available for UIMA, please refer to the
"UIMA SDK User's Guide Reference," chapter 4.5.5: "Result Specification Setting."
Other important parts of the sample aggregate descriptor above are the operational properties.
These properties must be set according to the settings of the CAS consumer descriptor.
For more information about operational properties, please refer to the "UIMA SDK User's
Guide Reference,"
chapter 18.3.1: "Primitive Analysis Engine Descriptors."
In the Listing 1 above, the aggregate analysis engine only contains a single CAS consumer component, but it is possible
to add annotator components to the same aggregate analysis engine. It's just important that the CAS consumer is wrapped
in an aggregate analysis engine descriptor and that the flow where the CAS consumer is
called is a fixedFlow.
For examples where an aggregate analysis engines contains CAS consumers and annotators, please
refer to the "UIMA SDK User's Guide Reference," chapter 4.3.2: "Aggregate Engines can
also contain CAS Consumers."
Note: The optional CAS consumer methods collectionProcessComplete() and batchProcessComplete() will
not be called for custom CAS consumers in OmniFind that are deployed as a PEAR file and running in the
fenced box.
Access OmniFind document metadata in custom CAS consumers
Custom CAS consumer may want to access the OmniFind document metadata called
esDocumentMetaData. Listing 2, below, shows how that can be
accomplished using
the UIMA CAS API. In Listing 2, the document URL and the document ID is retrieved from
the OmniFind document metadata. For more information about the esDocumentMetaData feature,
please refer to the OmniFind type system extension section of
this article.
Listing 2. CAS consumer accessing OmniFind document metadata
try {
// get DocumentAnnotation from the CAS
AnnotationFS documentAnnotation = aCAS.getTCAS().getDocumentAnnotation();
// get DocumentAnnotation type to retrieve the esDocumentMetaData feature
Type documentAnnotType = documentAnnotation.getType();
// get esDocumentMetaData feature
Feature esDocumentMetaDataFeature = documentAnnotType
.getFeatureByBaseName("esDocumentMetaData");
// get esDocumentMetaData feature structure for the current document
FeatureStructure esDocumentMetaData = documentAnnotation
.getFeatureValue(esDocumentMetaDataFeature);
String documentURL = null;
int documentId = -1;
if (esDocumentMetaData != null) {
// to retrieve some features from the esDocumentMetaData feature structure
// we need the esDocumentMetaData type
Type esDocumentMetaDataType = esDocumentMetaData.getType();
// get url feature
Feature urlFeature = esDocumentMetaDataType.getFeatureByBaseName("url");
// get url feature value
documentURL = esDocumentMetaData.getStringValue(urlFeature);
// get document ID feature
Feature documentIdFeature = esDocumentMetaDataType
.getFeatureByBaseName("documentId");
// get document ID feature value
documentId = esDocumentMetaData.getIntValue(documentIdFeature);
}
|
The lines in the code sample relating to creating type and feature objects can be
moved to the typeSystemInit() method
of your analysis component because they do not change with each document.
The URL information that is retrieved in Listing 2 is the same used by the SIAPI (Search and Index API)
and shown in the OmniFind search application. It can be used to link document results from text analysis with search results.
A simple way to do this is to issue a query url:<url string
returned from sample>.
Externalized messages in custom analysis components
When writing a custom analysis component for OmniFind, you can decide if you would like to externalize the log and exception
messages for your component or not. Both ways are possible and work fine in OmniFind.
When you decide to externalize your messages, the messages are only referenced by key and resource bundle
in your source code, while the real messages appear in the resource file outside the code.
In some cases, when using externalized exception messages, it can happen that the messages cannot be resolved
from the resource bundle file as expected. This is caused by the class loading strategy used in OmniFind
for the fenced box process that dynamically loads the custom analysis components. To
avoid this issue, follow the guideline below and your component will work fine inside and outside of OmniFind.
Note: When you only use externalized log messages and no externalized exception
messages,
you don't have to follow the guideline below. In that case, the UIMA framework can resolve the externalized messages
without any problems.
How to use externalized exception messages
The message resolving an error occurs when you use the approach shown in Listing 3 in your analysis component source code to throw an exception
with an externalized exception message text. The example throws an exception in the annotator
process() method using the standard exception type defined by the UIMA framework.
Listing 3. Throwing an exception
with an externalized exception message text
public void process(TCAS aTCAS, ResultSpecification aResultSpec)
throws AnnotatorProcessException {
String text = aTCAS.getDocumentText();
//check if the document text is null
if(text == null) {
//if the text is null throw an exception
//AnnotatorProcessException(ResourceBundle, MessageKey, MessageArguments)
throw new AnnotatorProcessException("com.ibm.uima.annotator.ExceptionMessages",
"SAMPLE_EXCEPTION_MESSAGE_KEY", null);
}
}
|
To solve this issue, just define your own exception class that extends the UIMA exception class
thrown in your method. For the sample in Listing 3, you can create a SampleAnnotatorProcessException
that extends AnnotatorProcessException, which is defined by the UIMA framework.
Listing 4. Custom AnnotatorProcessException implementation
public class SampleAnnotatorProcessException extends AnnotatorProcessException {
public SampleAnnotatorProcessException(String aResourceBundleName,
String aMessageKey, Object[] aArguments) {
super(aResourceBundleName, aMessageKey, aArguments);
}
}
|
When you use the SampleAnnotatorProcessException in your
sample, instead of the
AnnotatorProcessException, all works fine, and all exception message texts can be
resolved correctly.
Listing 5. Exception message texts resolved correctly
public void process(TCAS aTCAS, ResultSpecification aResultSpec)
throws AnnotatorProcessException {
String text = null;
//do any kind of processing with the test string
processText(text);
//check after text processing if the text in not null
if(text == null) {
//if the text is null throw an exception
//SampleAnnotatorProcessException(ResourceBundle, MessageKey, MessageArguments)
throw new
SampleAnnotatorProcessException("com.ibm.uima.annotator.ExceptionMessages",
"SAMPLE_EXCEPTION_MESSAGE_KEY", null);
}
}
|
Modifying PEAR timeout values
When running custom annotators in OmniFind, they are executed in a separate process
called the "annotator fenced box."
The processing in this scenario has a timeout so that the base OmniFind text analysis can detect if the custom annotator
runs as expected or if it should drop the document if the annotator does not respond in the given time frame.
In some cases, it is necessary to modify the standard time-out configuration setting. For example,
if an annotator performs very complex text analysis, then perhaps the default time-out value
of 30 seconds is too low. To change the time-out value, the snippet in Listing 6,
below, shows the custom annotator settings
in the file EsCpeDescriptor.xml. The file is located in the
<ES_NODE_ROOT>/master_config/<COLID>.parserdriver/specifiers directory.
Search for the CAS processor with the deployment="remote" and with the name of
your custom annotator.
Listing 6. custom annotator settings in the custom annotator settings in the EsCpeDescriptor.xml file
<casProcessor deployment="remote" name="<MyCustomAnnotator>">
<descriptor>
<include href="/home/esadmin/config/col1.parserdriver/specifiers/
EsSocketService.xml"/>
</descriptor>
<filter/>
<errorHandling>
<errorRateThreshold action="continue" value="0/100"/>
<maxConsecutiveRestarts action="continue" value="3"/>
<timeout max="30000"/>
</errorHandling>
<checkpoint batch="1"/>
<deploymentParameters>
<parameter name="transport" type="string"
value="com.ibm.es.control.casprocessor.server.CasProcessorSocketTransport"/>
</deploymentParameters>
</casProcessor>
|
The time-out value <timeout max="30000"/> is specified in milliseconds in the
error handling section of the casProcessor.
If the annotator processing takes longer than that, increase the time-out value in order to trigger a time-out event later.
After increasing the time-out value for the custom annotator, it is also necessary to increase the time-out
value for the CPM output queue. The necessary setting is at the end of the EsCpeDescriptor.xml file.
Listing 7. Sample fragment of a Collection Processing Manager to increase the dequeueTimeout value
<cpeConfig>
<numToProcess>-1</numToProcess>
<deployAs>single-threaded</deployAs>
<checkpoint batch="1000" file="" time="100s"/>
<outputQueue dequeueTimeout="100000"
queueClass="com.ibm.uima.reference_impl.collection.cpm.engine.SequencedQueue"/>
<timerImpl/>
</cpeConfig>
|
Increase the time-out value dequeueTimeout="100000"
by the same value as used for the custom annotator.
Changing the heap size for custom annotators
When running custom annotators in OmniFind, they are executed in a separate process called the annotator fenced box or
CAS processor.
In some cases, it is necessary to increase the heap size for this process(when the annotator needs a lot
of memory, or when the collection containing a custom annotator is running
multi-threaded, for example). In OmniFind 8.4,
the default memory size for the fenced box depends on the installed OmniFind memory model. The memory size values are
shown in Table 1 below.
Table 1. Memory sizes for fenced box based on OmniFind memory model
| OmniFind memory model | Fenced box memory size |
|---|
| Small | 100MB |
|---|
| Medium | 450MB |
|---|
| Large | 750MB |
|---|
In order to change the heap size for the fenced box, you have to modify the following configuration file:
<ES_NODE_ROOT>/master_config/<COLID>_config.ini
Within the file, search for an expression such as
session<N>.type=casprocessor
to get the session number <N> for the current collection's CAS processor.
After heaving the session number, change the heap size in the following setting:
session<N>.max_heap=<size in MB>
OmniFind must be restarted so the changes become effective. Be careful with increasing the heap size. In some
cases the JVM cannot start properly if the heap size is too big.
For additional help, see the memory recommendations in the OmniFind installation guide.
For details about how to restart the OmniFind system, please refer to the OmniFind
documentation book Administering Enterprise Search, "Starting and stopping an
enterprise search system."
Viewing custom annotator log messages in OmniFind
When running a custom annotator in OmniFind, the custom annotator log messages are written to the OmniFind audit log.
The audit log file where the log messages can be found is provided at
<ES_NODE_ROOT>/logs/audit/<COLID>.casprocessor_audit_<current_date>.log.
By default, the OmniFind audit log level for audit log messages is set to
Informational and cannot be changed to another value. This means that by default all audit log messages are logged to the
log file.
The OmniFind logging system provides three different log levels; these are: Error,
Warning, and Informational.
Within the UIMA logging architecture, there are seven log levels available:
Severe, Warning, Info,
Config, Fine, Finer, and
Finest.
By default, only some of the UIMA log levels are mapped to the OmniFind logging system.
Please see Table 2, below, for details:
Table 2. OmniFind and UIMA log level mapping
| OmniFind log level | UIMA log level |
|---|
| Error | Severe |
|---|
| Warning | Warning |
|---|
| Informational | Info |
|---|
| not mapped | Config , Fine, Finer, Finest |
|---|
With the default level mapping, the custom annotator
log messages with the log levels Info, Warning,
and Severe are written to the log file. To change the
default mapping behavior,
additional UIMA log levels below Info can be mapped to the OmniFind
Informational log level. The necessary steps that must be
performed are as follows:
-
Locate the configuration file tokenizer.properties in the
<ES_NODE_ROOT>/master_config/parserservice/ directory.
-
Search inside the file for the log level configuration setting, as shown below.
If no such setting exists, create one according to step 3.
trevi.tokenizer.jedii.casprocessor.InformationalLevelMapping=Info |
-
In order to see more detailed log messages than UIMA
Info messages in the log file,
replace the log level mapping value with the desired UIMA log level.
For example, use the following command in order to see all UIMA annotator log
messages in the OmniFind audit log:
trevi.tokenizer.jedii.casprocessor.InformationalLevelMapping=Finest |
Note: The log level mapping for Warning and Error
messages cannot be changed.
OmniFind collection specific UIMA descriptors
Each collection created in OmniFind has its own set of UIMA descriptors. These descriptors
are necessary to start up and run the text analysis components used in the OmniFind parser and runtime component.
The set of descriptors is located in a collection-specific directory that
is called specifiers and is
located at <EsNodeRoot>/master_config/<COLID>.parserdriver/ or
<EsNodeRoot>/master_config/<COLID>.runtime.node<N>/. The first
location (.parserdriver) is used in the OmniFind parser component for document
analysis; the second location
(.runitme.node<N>) is used in the OmniFind runtime for query analysis.
When a new collection is created, all necessary descriptors are copied from the
<EsInstallRoot>/default_config/parserdriver/specifiers directory to
the collection-specific specifier directories, where they are updated with
the current collection settings specified during collection creation.
When collection settings are changed from the admin UI, the descriptors are updated accordingly.
Note: If descriptors are modified directly without using the admin UI, the changes may get lost or
will be overwritten if the collection settings are changed in the admin UI.
So if you need to manually change descriptors, you either should no longer use the admin
GUI to change collection settings for that collection, or you should have a way to re-apply
you modifications after admin UI changes.
For a collection, the following specifiers are relevant:
Table 3. Specifiers
| Specifiers | Description |
|---|
| EsCpeDescriptor.xml |
Main text-analyis-processing descriptor used for the document processing -- contains the information to start up
the Collection Processing Manager (CPM) by referring to some of the descriptors below
|
|---|
| EsCollectionReader.xml |
Describes the OmniFind document collection reader configuration settings
|
|---|
| EsLanguageIdentifier.xml |
Describes the configuration settings for the language identification annotator
|
|---|
| EsIndexCasConsumer.xml |
Describes the configuration settings for the OmniFind index CAS consumer component
that prepares the documents for indexing
|
|---|
| EsCasInitializer.xml |
Describes the configuration settings for the OmniFind document parser
|
|---|
| es_tok_no_stw.xml |
Main descriptor for the document-text-analysis processing. The descriptor
refers to the text-analysis components used for document processing and
contains common settings
|
|---|
| es_tok_with_stw.xml |
Main descriptor for the search-query-analysis processing. The descriptor
refers to the text-analysis components used for query-analysis processing and
contains common settings.
|
|---|
| es_backend_specifier_with_rbcat.xml |
Main descriptor for the document-text-analysis processing with additional rule-based categorization. The descriptor
refers to the text-analysis components used for document processing and
contains common settings and additional parameters for the rule-based categorization.
|
|---|
| es_backend_specifier_with_mbcat.xml |
Main descriptor for the document-text-analysis processing with additional model-based categorization. The descriptor
refers to the text-analysis components used for document processing and
contains common settings and additional parameters for the model-based categorization.
|
|---|
| j_mbcategorizer_es.xml |
Analysis component descriptor for model-based categorization
|
|---|
| j_rbcategorizer_es.xml |
Analysis component descriptor for rule-based categorization
|
|---|
| cas2jdbc.xml |
Describes the configuration settings for the OmniFind CAS to JDBC CAS consumer component
that prepares the documents for storing in an external database
|
|---|
| jfrost.xml |
Analysis-component descriptor for dictionary-based text segmentation
|
|---|
| jfrost_ngram.xml |
Analysis-component descriptor for dictionary-based text segmentation without
CJK languages
|
|---|
| jfrost_dict_lookup.xml |
Analysis component for pretokenized-dictionary lookup of
synonyms and boost terms used in query analysis
|
|---|
| jtok.xml |
Analysis-component descriptor for non-dictionary-based text and ngram segmentation
|
|---|
| tt_core_typesystem.xml |
UIMA core-type system -- defines the basic types for text analytics
|
|---|
| of_typesystem.xml |
OmniFind-type system extension -- defines additional types for OmniFind based on
the UIMA core-type system
|
|---|
| tt_extension_typesystem.xml |
UIMA extension-type system -- defines additional text analytics types for
advanced text analytics
|
|---|
| dlt_extension_typesystem.xml |
Advanced type system for dictionary-based text segmentation
|
|---|
 |
OmniFind type system extension
The IBM OmniFind Enterprise Edition has extended the UIMA typesystem with specific types and features.
For details about the UIMA type system, which is the base for the OmniFind-type
system, please refer to
the OmniFind documentation "Custom Text Analysis." The main types and features of
the extended OmniFind-type
system are shown in Table 4, below. For a complete list of types, please refer to
the OmniFind-type system
definition file of_typesystem.xml, which is located in the directory
<EsInstallRoot>/default_config/parserdriver/specifiers.
Table 4. Main types and features of the extended OmniFind-type
system
| Types and features | Description |
|---|
|
uima.tcas.DocumentAnnotation
|
The default UIMA DocumentAnnotation has been extended in OmniFind with an additional feature
|
|---|
| esDocumentMetaData | Contains document metadata of the type com.ibm.es.tt.DocumentMetaData. |
|---|
|
com.ibm.es.tt.DocumentMetaData
|
The features of the com.ibm.es.tt.DocumentMetaData are connected to the
DocumentAnnotation feature esDocumentMetaData
|
|---|
| crawlerId | The crawler name; the feature value is of type uima.cas.String |
|---|
| dataSource | Contains the data source type of the document; the feature value is of type
uima.cas.String |
|---|
| dataSourceName | The name of the crawler (data source); the feature value is of type uima.cas.String |
|---|
| docType | Contains the document type of the document; the feature value is of type uima.cas.String |
|---|
| date | Contains the document date; the feature value is of type uima.cas.String |
|---|
| baseUri | Contains the base URI of the document; the feature value is of type uima.cas.String |
|---|
| metaDataFields | The feature value is of type uima.cas.FSArray;
each element in this array is of type com.ibm.es.tt.MetaDataField |
|---|
| documentName | The name of the document if available; the feature value is of type uima.cas.String |
|---|
| documentId | The unique and sortable document ID of the document; the feature value is of type uima.cas.Integer |
|---|
| title | The title of the document; the feature value is of type uima.cas.String |
|---|
| redirectUrl | Contains the redirected URL; the feature value is of type uima.cas.String |
|---|
| mimeType | Mime type or document type of the document (for example, XML); the feature value is of
type uima.cas.String |
|---|
| url | The URL of the document; the feature value is of type uima.cas.String |
|---|
|
com.ibm.es.tt.CommonFieldParameters
| |
|---|
| searchable | A flag indicating if the field is free-text searchable; the feature value is of type
uima.cas.Integer |
|---|
| fieldSearchable | A flag indicating if the field is searchable as a field; the feature value is of type
uima.cas.Integer |
|---|
| parametric | A flag indicating parametric search; the feature value is of type uima.cas.Integer |
|---|
| showInSearchResult | A flag indicating if annotated data is included in the search result details; the feature value is of type uima.cas.Integer |
|---|
| name | The name of the field -- you can search for this field using the field name;
the feature value is of type uima.cas.String |
|---|
| sortable | A flag indicating that the field is string searchable; the feature value is of type uima.cas.Integer |
|---|
| exactMatch | A flag indicating that the search must be an exact match; the feature value is of type uima.cas.Integer |
|---|
|
com.ibm.es.tt.ContentField
|
The default UIMA DocumentAnnotation has been extended in OmniFind with an additional feature
|
|---|
| parameters | The content field parameters are of type com.ibm.es.tt.CommonFieldParameters |
|---|
|
com.ibm.es.tt.MetaDataField
|
MetadataField data is not part of the document content but is stored in the text feature
|
|---|
| parameters | MetadataField parameters of type com.ibm.es.tt.CommonFieldParameters |
|---|
| text | The metadata text is stored in this feature; the feature value is of type uima.cas.String |
|---|
|
com.ibm.es.tt.Anchor
|
The anchor annotation for anchor text in HTML documents
|
|---|
| uri | The target URI of the anchor text; the feature value is of type uima.cas.String |
|---|
|
com.ibm.es.tt.MarkupTag
|
The markup information annotations (for example, of an XML tag); the markup information is stored in the features
|
|---|
| name | The name of the markup tag; the feature value is of type uima.cas.String |
|---|
| depth | The nesting depth; the feature value is of type uima.cas.Integer |
|---|
| attributeName | The name for the feature attribute; the feature value is of type uima.cas.StringArray |
|---|
| attributeValues | A string of values for the attribute; the feature value is of type uima.cas.StringArray |
|---|
Resources Learn
Get products and technologies
Discuss
About the author  | 
|  |
Since joining the IBM in 2003, Michael Baessler works in the Enterprise Search Development team located in Boeblingen, Germany.
Michael works on the integration of the Unstructured Information Management Architecture (UIMA) into OmniFind. Within OmniFind UIMA
is used for the base linguistic analysis and enables customers to plug in custom analysis components. Besides his development effort
for OmniFind Michael also works on new features and requirements for the UIMA framework itself.
|
Rate this page
|