Skip to main content

skip to main content

developerWorks  >  Information Management  >

Hints and tips for using the Unstructured Information Management Architecture SDK with OmniFind

Advanced UIMA configuration in IBM OmniFind Enterprise Edition 8.4

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Advanced

Michael Baessler (mbaessle@de.ibm.com), Software Engineer, IBM

22 Feb 2007

Get hints and tips about how to use the Unstructured Information Management Architecture (UIMA) in IBM® OmniFind™ Enterprise Edition 8.4. This article targets advanced developers who use UIMA for building applications on top of OmniFind Enteprise Edition. Practice some of the described techniques for an advanced use of UIMA inside of IBM OmniFind Enterprise Edition 8.4.

Content

The following topics are included in this article. It is not necessary to read the whole article; just read the topics you are interested in:



Back to top


Using custom CAS consumers in OmniFind

Besides running custom annotators in OmniFind, it is also possible to run custom CAS consumers. To run such an analysis component in OmniFind, it must be packaged into a PEAR file in the same way as custom annotators. You can upload the PEAR file to OmniFind as an analysis engine and then later associate it with one or more collections. This article does not explain the details of the uploading process to OmniFind. For more information about this, please refer to the OmniFind documentation book Administering Enterprise Search, in the section on "Custom Text Processing," in the main chapter "Working with collections and external sources/Enterprise Search parser administration."

Package a CAS consumer into a pear file

When packaging a CAS consumer component as a PEAR file, you have to wrap the CAS consumer component into an aggregate analysis engine descriptor. You always have to do that, even when you only have a single CAS consumer in the PEAR package. This step is necessary so that the UIMA components can call and handle the CAS consumer descriptor. A sample aggregate analysis engine descriptor containing only a single CAS consumer component is shown in Listing 1 below:


Listing 1. Aggregate analysis engine descriptor containing a single CAS consumer component

<?xml version="1.0" encoding="UTF-8" ?>
<taeDescription xmlns="http://uima.watson.ibm.com/resourceSpecifier">

   <frameworkImplementation>com.ibm.uima.java</frameworkImplementation>
   <primitive>false</primitive>
   <delegateAnalysisEngineSpecifiers>
       <delegateAnalysisEngine key="myCASConsumer">
          <import location="./casConsumer.xml"/>
       </delegateAnalysisEngine>
    </delegateAnalysisEngineSpecifiers>

   <analysisEngineMetaData>
      <name>CAS consumer wrapper</name>
      <description>sample CAS consumer wrapper</description>
      <version>1.0</version>
      <vendor>IBM Corporation</vendor>

      <flowConstraints>
         <fixedFlow>
             <node>myCASConsumer</node>
         </fixedFlow>
      </flowConstraints>

      <operationalProperties>
          <modifiesCas>false</modifiesCas>
          <multipleDeploymentAllowed>false</multipleDeploymentAllowed>
      </operationalProperties>

   </analysisEngineMetaData>
</taeDescription>
      

For the CAS consumer to be called correctly, you have to use a fixedFlow as flow constraint (as shown in the sample descriptor above). If you don't use such a flow, the CAS consumer isn't called because there are no capabilities that a CAS consumer can provide. For more information about the different flow constraints that are available for UIMA, please refer to the "UIMA SDK User's Guide Reference," chapter 4.5.5: "Result Specification Setting."

Other important parts of the sample aggregate descriptor above are the operational properties. These properties must be set according to the settings of the CAS consumer descriptor. For more information about operational properties, please refer to the "UIMA SDK User's Guide Reference," chapter 18.3.1: "Primitive Analysis Engine Descriptors."

In the Listing 1 above, the aggregate analysis engine only contains a single CAS consumer component, but it is possible to add annotator components to the same aggregate analysis engine. It's just important that the CAS consumer is wrapped in an aggregate analysis engine descriptor and that the flow where the CAS consumer is called is a fixedFlow. For examples where an aggregate analysis engines contains CAS consumers and annotators, please refer to the "UIMA SDK User's Guide Reference," chapter 4.3.2: "Aggregate Engines can also contain CAS Consumers."

Note: The optional CAS consumer methods collectionProcessComplete() and batchProcessComplete() will not be called for custom CAS consumers in OmniFind that are deployed as a PEAR file and running in the fenced box.

Access OmniFind document metadata in custom CAS consumers

Custom CAS consumer may want to access the OmniFind document metadata called esDocumentMetaData. Listing 2, below, shows how that can be accomplished using the UIMA CAS API. In Listing 2, the document URL and the document ID is retrieved from the OmniFind document metadata. For more information about the esDocumentMetaData feature, please refer to the OmniFind type system extension section of this article.


Listing 2. CAS consumer accessing OmniFind document metadata

try {
   // get DocumentAnnotation from the CAS
   AnnotationFS documentAnnotation = aCAS.getTCAS().getDocumentAnnotation();
     
   // get DocumentAnnotation type to retrieve the esDocumentMetaData feature
   Type documentAnnotType = documentAnnotation.getType();
     
   // get esDocumentMetaData feature
   Feature esDocumentMetaDataFeature = documentAnnotType
       .getFeatureByBaseName("esDocumentMetaData");
     
   // get esDocumentMetaData feature structure for the current document
   FeatureStructure esDocumentMetaData = documentAnnotation
       .getFeatureValue(esDocumentMetaDataFeature);

   String documentURL = null;
   int documentId = -1;

   if (esDocumentMetaData != null) {
      // to retrieve some features from the esDocumentMetaData feature structure
      // we need the esDocumentMetaData type
      Type esDocumentMetaDataType = esDocumentMetaData.getType();

      // get url feature
      Feature urlFeature = esDocumentMetaDataType.getFeatureByBaseName("url");
      // get url feature value
      documentURL = esDocumentMetaData.getStringValue(urlFeature);

      // get document ID feature
      Feature documentIdFeature = esDocumentMetaDataType
          .getFeatureByBaseName("documentId");
      // get document ID feature value
      documentId = esDocumentMetaData.getIntValue(documentIdFeature);
   }
      

The lines in the code sample relating to creating type and feature objects can be moved to the typeSystemInit() method of your analysis component because they do not change with each document.

The URL information that is retrieved in Listing 2 is the same used by the SIAPI (Search and Index API) and shown in the OmniFind search application. It can be used to link document results from text analysis with search results. A simple way to do this is to issue a query url:<url string returned from sample>.



Back to top


Externalized messages in custom analysis components

When writing a custom analysis component for OmniFind, you can decide if you would like to externalize the log and exception messages for your component or not. Both ways are possible and work fine in OmniFind.

When you decide to externalize your messages, the messages are only referenced by key and resource bundle in your source code, while the real messages appear in the resource file outside the code. In some cases, when using externalized exception messages, it can happen that the messages cannot be resolved from the resource bundle file as expected. This is caused by the class loading strategy used in OmniFind for the fenced box process that dynamically loads the custom analysis components. To avoid this issue, follow the guideline below and your component will work fine inside and outside of OmniFind.

Note: When you only use externalized log messages and no externalized exception messages, you don't have to follow the guideline below. In that case, the UIMA framework can resolve the externalized messages without any problems.

How to use externalized exception messages

The message resolving an error occurs when you use the approach shown in Listing 3 in your analysis component source code to throw an exception with an externalized exception message text. The example throws an exception in the annotator process() method using the standard exception type defined by the UIMA framework.


Listing 3. Throwing an exception with an externalized exception message text

public void process(TCAS aTCAS, ResultSpecification aResultSpec)
  throws AnnotatorProcessException {

  String text = aTCAS.getDocumentText();

  //check if the document text is null
  if(text == null) {
    //if the text is null throw an exception
    //AnnotatorProcessException(ResourceBundle, MessageKey, MessageArguments)
    throw new AnnotatorProcessException("com.ibm.uima.annotator.ExceptionMessages",
      "SAMPLE_EXCEPTION_MESSAGE_KEY", null);
  }
}
	

To solve this issue, just define your own exception class that extends the UIMA exception class thrown in your method. For the sample in Listing 3, you can create a SampleAnnotatorProcessException that extends AnnotatorProcessException, which is defined by the UIMA framework.


Listing 4. Custom AnnotatorProcessException implementation

public class SampleAnnotatorProcessException extends AnnotatorProcessException {
   
   public SampleAnnotatorProcessException(String aResourceBundleName,
     String aMessageKey, Object[] aArguments) {
     super(aResourceBundleName, aMessageKey, aArguments);   
   }
}
	

When you use the SampleAnnotatorProcessException in your sample, instead of the AnnotatorProcessException, all works fine, and all exception message texts can be resolved correctly.


Listing 5. Exception message texts resolved correctly

public void process(TCAS aTCAS, ResultSpecification aResultSpec)
  throws AnnotatorProcessException {	
  String text = null;
     
  //do any kind of processing with the test string
  processText(text);
     
  //check after text processing if the text in not null  
  if(text == null) {
    //if the text is null throw an exception
    //SampleAnnotatorProcessException(ResourceBundle, MessageKey, MessageArguments)
    throw new 
      SampleAnnotatorProcessException("com.ibm.uima.annotator.ExceptionMessages",
      "SAMPLE_EXCEPTION_MESSAGE_KEY", null);
  }
}	
	



Back to top


Modifying PEAR timeout values

When running custom annotators in OmniFind, they are executed in a separate process called the "annotator fenced box." The processing in this scenario has a timeout so that the base OmniFind text analysis can detect if the custom annotator runs as expected or if it should drop the document if the annotator does not respond in the given time frame. In some cases, it is necessary to modify the standard time-out configuration setting. For example, if an annotator performs very complex text analysis, then perhaps the default time-out value of 30 seconds is too low. To change the time-out value, the snippet in Listing 6, below, shows the custom annotator settings in the file EsCpeDescriptor.xml. The file is located in the <ES_NODE_ROOT>/master_config/<COLID>.parserdriver/specifiers directory. Search for the CAS processor with the deployment="remote" and with the name of your custom annotator.


Listing 6. custom annotator settings in the custom annotator settings in the EsCpeDescriptor.xml file

<casProcessor deployment="remote" name="<MyCustomAnnotator>">
  <descriptor>
    <include href="/home/esadmin/config/col1.parserdriver/specifiers/
      EsSocketService.xml"/>
    </descriptor>
    <filter/>
    <errorHandling>
      <errorRateThreshold action="continue" value="0/100"/>
      <maxConsecutiveRestarts action="continue" value="3"/>
      <timeout max="30000"/>
    </errorHandling>
    <checkpoint batch="1"/>
    <deploymentParameters>
      <parameter name="transport" type="string"
        value="com.ibm.es.control.casprocessor.server.CasProcessorSocketTransport"/>
    </deploymentParameters>
</casProcessor>

The time-out value <timeout max="30000"/> is specified in milliseconds in the error handling section of the casProcessor. If the annotator processing takes longer than that, increase the time-out value in order to trigger a time-out event later. After increasing the time-out value for the custom annotator, it is also necessary to increase the time-out value for the CPM output queue. The necessary setting is at the end of the EsCpeDescriptor.xml file.


Listing 7. Sample fragment of a Collection Processing Manager to increase the dequeueTimeout value

<cpeConfig>
  <numToProcess>-1</numToProcess>
  <deployAs>single-threaded</deployAs>
  <checkpoint batch="1000" file="" time="100s"/>    
  <outputQueue dequeueTimeout="100000" 
   queueClass="com.ibm.uima.reference_impl.collection.cpm.engine.SequencedQueue"/>
  <timerImpl/>
</cpeConfig>

Increase the time-out value dequeueTimeout="100000" by the same value as used for the custom annotator.



Back to top


Changing the heap size for custom annotators

When running custom annotators in OmniFind, they are executed in a separate process called the annotator fenced box or CAS processor. In some cases, it is necessary to increase the heap size for this process(when the annotator needs a lot of memory, or when the collection containing a custom annotator is running multi-threaded, for example). In OmniFind 8.4, the default memory size for the fenced box depends on the installed OmniFind memory model. The memory size values are shown in Table 1 below.


Table 1. Memory sizes for fenced box based on OmniFind memory model
OmniFind memory modelFenced box memory size
Small 100MB
Medium 450MB
Large 750MB


In order to change the heap size for the fenced box, you have to modify the following configuration file:

<ES_NODE_ROOT>/master_config/<COLID>_config.ini

Within the file, search for an expression such as

session<N>.type=casprocessor

to get the session number <N> for the current collection's CAS processor. After heaving the session number, change the heap size in the following setting:

session<N>.max_heap=<size in MB>

OmniFind must be restarted so the changes become effective. Be careful with increasing the heap size. In some cases the JVM cannot start properly if the heap size is too big. For additional help, see the memory recommendations in the OmniFind installation guide. For details about how to restart the OmniFind system, please refer to the OmniFind documentation book Administering Enterprise Search, "Starting and stopping an enterprise search system."



Back to top


Viewing custom annotator log messages in OmniFind

When running a custom annotator in OmniFind, the custom annotator log messages are written to the OmniFind audit log. The audit log file where the log messages can be found is provided at <ES_NODE_ROOT>/logs/audit/<COLID>.casprocessor_audit_<current_date>.log. By default, the OmniFind audit log level for audit log messages is set to Informational and cannot be changed to another value. This means that by default all audit log messages are logged to the log file.

The OmniFind logging system provides three different log levels; these are: Error, Warning, and Informational. Within the UIMA logging architecture, there are seven log levels available: Severe, Warning, Info, Config, Fine, Finer, and Finest. By default, only some of the UIMA log levels are mapped to the OmniFind logging system. Please see Table 2, below, for details:


Table 2. OmniFind and UIMA log level mapping
OmniFind log levelUIMA log level
Error Severe
Warning Warning
Informational Info
not mapped Config , Fine, Finer, Finest


With the default level mapping, the custom annotator log messages with the log levels Info, Warning, and Severe are written to the log file. To change the default mapping behavior, additional UIMA log levels below Info can be mapped to the OmniFind Informational log level. The necessary steps that must be performed are as follows:

  1. Locate the configuration file tokenizer.properties in the <ES_NODE_ROOT>/master_config/parserservice/ directory.
  2. Search inside the file for the log level configuration setting, as shown below. If no such setting exists, create one according to step 3.

    trevi.tokenizer.jedii.casprocessor.InformationalLevelMapping=Info

  3. In order to see more detailed log messages than UIMA Info messages in the log file, replace the log level mapping value with the desired UIMA log level. For example, use the following command in order to see all UIMA annotator log messages in the OmniFind audit log:

    trevi.tokenizer.jedii.casprocessor.InformationalLevelMapping=Finest

Note: The log level mapping for Warning and Error messages cannot be changed.



Back to top


OmniFind collection specific UIMA descriptors

Each collection created in OmniFind has its own set of UIMA descriptors. These descriptors are necessary to start up and run the text analysis components used in the OmniFind parser and runtime component. The set of descriptors is located in a collection-specific directory that is called specifiers and is located at <EsNodeRoot>/master_config/<COLID>.parserdriver/ or <EsNodeRoot>/master_config/<COLID>.runtime.node<N>/. The first location (.parserdriver) is used in the OmniFind parser component for document analysis; the second location (.runitme.node<N>) is used in the OmniFind runtime for query analysis.

When a new collection is created, all necessary descriptors are copied from the <EsInstallRoot>/default_config/parserdriver/specifiers directory to the collection-specific specifier directories, where they are updated with the current collection settings specified during collection creation. When collection settings are changed from the admin UI, the descriptors are updated accordingly.

Note: If descriptors are modified directly without using the admin UI, the changes may get lost or will be overwritten if the collection settings are changed in the admin UI. So if you need to manually change descriptors, you either should no longer use the admin GUI to change collection settings for that collection, or you should have a way to re-apply you modifications after admin UI changes.

For a collection, the following specifiers are relevant:


Table 3. Specifiers
SpecifiersDescription
EsCpeDescriptor.xml Main text-analyis-processing descriptor used for the document processing -- contains the information to start up the Collection Processing Manager (CPM) by referring to some of the descriptors below
EsCollectionReader.xml Describes the OmniFind document collection reader configuration settings
EsLanguageIdentifier.xml Describes the configuration settings for the language identification annotator
EsIndexCasConsumer.xml Describes the configuration settings for the OmniFind index CAS consumer component that prepares the documents for indexing
EsCasInitializer.xml Describes the configuration settings for the OmniFind document parser
es_tok_no_stw.xml Main descriptor for the document-text-analysis processing. The descriptor refers to the text-analysis components used for document processing and contains common settings
es_tok_with_stw.xml Main descriptor for the search-query-analysis processing. The descriptor refers to the text-analysis components used for query-analysis processing and contains common settings.
es_backend_specifier_with_rbcat.xml Main descriptor for the document-text-analysis processing with additional rule-based categorization. The descriptor refers to the text-analysis components used for document processing and contains common settings and additional parameters for the rule-based categorization.
es_backend_specifier_with_mbcat.xml Main descriptor for the document-text-analysis processing with additional model-based categorization. The descriptor refers to the text-analysis components used for document processing and contains common settings and additional parameters for the model-based categorization.
j_mbcategorizer_es.xml Analysis component descriptor for model-based categorization
j_rbcategorizer_es.xml Analysis component descriptor for rule-based categorization
cas2jdbc.xml Describes the configuration settings for the OmniFind CAS to JDBC CAS consumer component that prepares the documents for storing in an external database
jfrost.xml Analysis-component descriptor for dictionary-based text segmentation
jfrost_ngram.xml Analysis-component descriptor for dictionary-based text segmentation without CJK languages
jfrost_dict_lookup.xml Analysis component for pretokenized-dictionary lookup of synonyms and boost terms used in query analysis
jtok.xml Analysis-component descriptor for non-dictionary-based text and ngram segmentation
tt_core_typesystem.xml UIMA core-type system -- defines the basic types for text analytics
of_typesystem.xml OmniFind-type system extension -- defines additional types for OmniFind based on the UIMA core-type system
tt_extension_typesystem.xml UIMA extension-type system -- defines additional text analytics types for advanced text analytics
dlt_extension_typesystem.xml Advanced type system for dictionary-based text segmentation




Back to top


OmniFind type system extension

The IBM OmniFind Enterprise Edition has extended the UIMA typesystem with specific types and features. For details about the UIMA type system, which is the base for the OmniFind-type system, please refer to the OmniFind documentation "Custom Text Analysis." The main types and features of the extended OmniFind-type system are shown in Table 4, below. For a complete list of types, please refer to the OmniFind-type system definition file of_typesystem.xml, which is located in the directory <EsInstallRoot>/default_config/parserdriver/specifiers.


Table 4. Main types and features of the extended OmniFind-type system
Types and featuresDescription
uima.tcas.DocumentAnnotation The default UIMA DocumentAnnotation has been extended in OmniFind with an additional feature
esDocumentMetaDataContains document metadata of the type com.ibm.es.tt.DocumentMetaData.
com.ibm.es.tt.DocumentMetaData The features of the com.ibm.es.tt.DocumentMetaData are connected to the DocumentAnnotation feature esDocumentMetaData
crawlerIdThe crawler name; the feature value is of type uima.cas.String
dataSourceContains the data source type of the document; the feature value is of type uima.cas.String
dataSourceNameThe name of the crawler (data source); the feature value is of type uima.cas.String
docTypeContains the document type of the document; the feature value is of type uima.cas.String
dateContains the document date; the feature value is of type uima.cas.String
baseUriContains the base URI of the document; the feature value is of type uima.cas.String
metaDataFieldsThe feature value is of type uima.cas.FSArray; each element in this array is of type com.ibm.es.tt.MetaDataField
documentNameThe name of the document if available; the feature value is of type uima.cas.String
documentIdThe unique and sortable document ID of the document; the feature value is of type uima.cas.Integer
titleThe title of the document; the feature value is of type uima.cas.String
redirectUrlContains the redirected URL; the feature value is of type uima.cas.String
mimeTypeMime type or document type of the document (for example, XML); the feature value is of type uima.cas.String
urlThe URL of the document; the feature value is of type uima.cas.String
com.ibm.es.tt.CommonFieldParameters
searchableA flag indicating if the field is free-text searchable; the feature value is of type uima.cas.Integer
fieldSearchableA flag indicating if the field is searchable as a field; the feature value is of type uima.cas.Integer
parametricA flag indicating parametric search; the feature value is of type uima.cas.Integer
showInSearchResultA flag indicating if annotated data is included in the search result details; the feature value is of type uima.cas.Integer
nameThe name of the field -- you can search for this field using the field name; the feature value is of type uima.cas.String
sortableA flag indicating that the field is string searchable; the feature value is of type uima.cas.Integer
exactMatchA flag indicating that the search must be an exact match; the feature value is of type uima.cas.Integer
com.ibm.es.tt.ContentField The default UIMA DocumentAnnotation has been extended in OmniFind with an additional feature
parametersThe content field parameters are of type com.ibm.es.tt.CommonFieldParameters
com.ibm.es.tt.MetaDataField MetadataField data is not part of the document content but is stored in the text feature
parametersMetadataField parameters of type com.ibm.es.tt.CommonFieldParameters
textThe metadata text is stored in this feature; the feature value is of type uima.cas.String
com.ibm.es.tt.Anchor The anchor annotation for anchor text in HTML documents
uriThe target URI of the anchor text; the feature value is of type uima.cas.String
com.ibm.es.tt.MarkupTag The markup information annotations (for example, of an XML tag); the markup information is stored in the features
nameThe name of the markup tag; the feature value is of type uima.cas.String
depthThe nesting depth; the feature value is of type uima.cas.Integer
attributeNameThe name for the feature attribute; the feature value is of type uima.cas.StringArray
attributeValuesA string of values for the attribute; the feature value is of type uima.cas.StringArray




Resources

Learn

Get products and technologies

Discuss


About the author

Michael Baessler

Since joining the IBM in 2003, Michael Baessler works in the Enterprise Search Development team located in Boeblingen, Germany. Michael works on the integration of the Unstructured Information Management Architecture (UIMA) into OmniFind. Within OmniFind UIMA is used for the base linguistic analysis and enables customers to plug in custom analysis components. Besides his development effort for OmniFind Michael also works on new features and requirements for the UIMA framework itself.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top