 | Level: Intermediate Jochen Dörre (doerre@de.ibm.com), Software Engineer, IBM Josemina Magdalen (josemina@il.ibm.com), Software Engineer, IBM Wendi Pohs (wpohs@infoclearonline.com), Principal, InfoClear Consulting Bob St. Clair (bobs@schemalogic.com), Senior Product Manager, SchemaLogic Inc.
08 Feb 2007 Employ professional tools
for taxonomy management and auto-classification to enhance
enterprise search solutions built with IBM® OmniFind™
Enterprise Edition. Use SchemaLogic's
Enterprise Suite to centrally maintain a consolidated enterprise
taxonomy and the IBM Classification Module for automatic
classification of documents into taxonomic categories. Consider this article your
step-by-step guide to the practical
integration of the three applications.
Introduction
As there is an ever-increasing volume of online
documents in the enterprise, a systematic organization of
enterprise content through classification becomes more and more
important. This need is amplified by the advancing integration
of formerly separate systems and comprehensive enterprise search
systems, like IBM OmniFind Enterprise Edition, that allow
end-users to retrieve documents from disparate sources through a
single point of access. The use of an enterprise taxonomy (a hierarchically
organized set of relevant categories, in other words) and the
classification of document content, is a powerful approach to
address this need.
This paper leads you through the individual tasks required to
successfully create and deploy a high-performance enterprise
taxonomy using the following tools:
- SchemaLogic Enterprise Suite to manage, model, and
maintain a consolidated enterprise taxonomy
- IBM Classification Module for OmniFind Discovery Edition (ICM)
to automatically classify documents
- IBM OmniFind Enterprise Edition (henceforth called OmniFind)
to exploit category information in document search
This article assumes you have a basic knowledge of the functionality
of an enterprise search system and OmniFind in particular. This article
takes the viewpoint of how OmniFind can be integrated with
SchemaLogic Suite and ICM into a true enterprise search system
leveraging the power of taxonomies and auto-classification.
The structure of the article is as follows:
Uses for an enterprise
taxonomy
An enterprise taxonomy is a set of terms and the associated
relationships that exist between them. It can be as simple
as a list of products, or it can be a complex structure that
supports the relationships between companies, their suppliers,
and their customers. Enterprise taxonomies are used to
describe the content that is generated by a business. Using the
standard language that is present in these taxonomies, both
business users and the software that supports their work can
describe similar content the same way.
Broken down into its component parts, an enterprise taxonomy
is:
- A multi-purpose, often hierarchical, list of terms that
describes content
- Centrally managed or distributed with a strong governance
model
It typically includes a combination of the following:
- Controlled vocabularies
- Allowed values for defined metadata fields
- Preferred terms
- Synonyms, acronyms, and abbreviations
- Attributes
- Standard thesaurus relationships
- Other named relationships
Enterprise taxonomies provide the following benefits:
- Taxonomies can describe content across applications
- Taxonomies are focus-agnostic; they have no explicit "point of view"
- Taxonomies can be used to control the values in metadata
- Taxonomies can enhance search queries by adding synonyms, acronyms, and abbreviations
- Taxonomies can assist site navigation
More recently, enterprise taxonomies are also being used to
enhance corporate search. New search techniques, like semantic
and actionable search, are greatly improved by adding the
knowledge that is already built into an enterprise taxonomy
structure.
Typical examples include:
-
Faceted navigation: An active interface -- a dynamic
combination of search and taxonomy browse
-
Search results clustering: Automatic grouping of
documents into spontaneously labeled categories
-
Categorized search results: Search results are grouped
into a meaningful and stable classification defined by the
taxonomy
-
Actionable search: Allows users to do something
directly from a search result
-
Semantic search: Search enhanced by relevant data from
different sources; this data is described by using an enterprise
taxonomy
Taxonomy
management
Taxonomies are designed by looking at content and talking to
subject matter experts so that an appropriate model of the data
can be created. Typically, taxonomists look at existing
databases, Web sites, organizational charts, product lines, other legacy
databases, term lists, and product documentation to tease out
existing categories. Often a good, representative category
system can be drawn from other systems.
Once the majority of this up-front analysis has been completed
and the requirements for the enterprise taxonomy have been
determined, a taxonomy-modeling software system must be put in
place. Up until the late 1990s, large organizations either
built their own applications for managing taxonomies, relied on
modeling packages that shipped with individual systems, such as
search engines or auto-classification systems, or used simple
desktop applications, such as spreadsheets. Each approach presents
challenges to the enterprise.
Home-grown systems can be very expensive to develop and
maintain, particularly in environments that rely on large,
complex, and dynamic taxonomies that are central to the
business, such as driving high quality search results or high
quality auto-classification. As a result, either organizations
have continued to make the investment in their own systems, or,
in a non-critical environment, the high cost contributes to the
demise of taxonomy projects.
Modeling tools that are part of a system, such as a search or
auto-classification system, are built to interact specifically
with that system. As such, the taxonomy models managed in these
system-specific tools are not easily integrated with other
systems without extensive custom software development
efforts. Additionally, these tools are often simple and cannot
be used in more complex modeling, large scale, or distributed
ownership scenarios.
Recently, however, another option has become available to
organizations. A number of software vendors have developed and
released taxonomy modeling applications that are designed for
generic enterprise use and are not tied to specific
systems. These systems run the gamut of cost and capabilities,
from single-user desktop modeling applications, to robust
enterprise-grade semantic management systems. These systems are
designed for modeling and are generally more usable than
system-specific tools. The real power of these applications is to
provide a centralized modeling environment that multiple users,
user types, and systems throughout the enterprise can access,
both for modeling and consuming models.
Choosing the right system for your enterprise requires
mapping the requirements gathered in the up-front analysis with
features, capabilities, and costs of each option.
Taxonomy system features
To determine which route to go for managing your taxonomy
(whether to build a system, license a commercial system, or use a
taxonomy package that ships with an existing system), a number of
considerations need to take place. The primary drivers for your
decision should be based on the needs of the organization and
how you plan to manage and use your taxonomy.
General requirements
The general requirements include technological requirements,
user and usability requirements, and taxonomic requirements,
such as the size and level of activity of the taxonomy. Some of
these key requirements include:
- Level of activity: This describes how dynamic the
taxonomies will be. Taxonomies that change very little typically
have few editors or owners, and the model updates are not
required to flow immediately to other systems. In these cases, a
simpler modeling application may suffice. However, highly
dynamic taxonomies, ones that are changing daily or even hourly,
will require a system with enough performance to support rapid changes
to models from multiple users, multiple systems, or both.
- Size and complexity of vocabularies: Taxonomies
with a large number of terms or a large number of
relationships between those terms will require a modeling system
that can scale with the growth of the taxonomy.
-
Taxonomy integration: The more systems that
centralized taxonomy models can be leveraged in, the more
powerful and cost effective the taxonomy creation and
maintenance process will be. Broadly, other systems interact
with the modeling system in one or both of two ways:
- Subscribing systems: These are systems that consume
the whole or a sub-set of the taxonomy. How a taxonomy can be
used is usually limited by how subscribing systems can use it
and what structures they can utilize. These systems are on the
receiving end of the taxonomy and may consume anything from
flat lists to complex hierarchies all the way to complex
ontologies.
- Publishing systems: In some environments, other
systems may publish to the modeling application. In one case,
the "taxonomy of record" for certain subsets of
the overall enterprise taxonomy might be managed in another
system. For example, a product list may be managed in an ERP
system, but that list can be utilized by other systems (such as
an ECM, Auto-classification, or Search system). In this case,
the modeling application may be used as a "clearing
house," receiving the product list from the ERP system,
then re-distributing all or sub-sets to different systems.
In another case, another application might generate terms
that should be incorporated into the enterprise taxonomy. These
types of systems include advanced natural language processing
systems that can discover new terms and relationships by
analyzing content (such as document text in ECM systems). Other
examples include terms generated from Social tagging systems
and terms generated from search analytic systems.
- Subscribing and publishing systems: In some
cases, a system can both subscribe and publish to the taxonomy
modeling system. An example of this is an auto-categorization
system that can consume a taxonomy for the purpose of
categorization and can also discover new terms and
relationships to feed back into the model. Iterative taxonomy
maintenance helps auto-categorization systems become
iteratively more accurate.
Active enterprise environments, with multiple subscribing systems,
publishing systems, or both require a modeling application that
can readily and reliably connect to multiple systems in such a
manner.
- Distribution of ownership: How broadly,
geographically, and organizationally model ownership is
distributed throughout an enterprise will dictate how robust the
modeling tool must be to support a diverse population of
users. This includes the ability for users to connect to the
system easily and for client-side application management to be
minimal.
- Number and types of users: In addition to model
owners and taxonomy editors, an enterprise may have many users
who are stake holders and interested parties to the taxonomy or
taxonomy sub-sets and may be contributors to the taxonomy
development process. These users need to be able to access and
view the models in ways that fit with their needs and
authority. They may need to contribute to the taxonomy
development process directly in-situ within a subscribing
system. For example, an author tagging a document in an ECM
system may need to suggest a new term for a particular option
list managed in the taxonomy modeling tool. Ideally, that
process would be seamlessly integrated with the ECM UI so that
users would not even realize they interacting with the
taxonomy management system.
- How users will use the tools: Different types of
users will use the taxonomy modeling system in different
ways. Taxonomy editors need powerful editing capabilities to
quickly and easily make individual and bulk model changes. They
need powerful search and browse capabilities to quickly locate
terms and taxonomy branches. Business owners of subsets may
need to view just their portion of the taxonomy and not need as
powerful capabilities. Thirdly, stake holders and users who
occasionally suggest new terminology may need an even smaller
subset of capabilities.
- Usability: The modeling Graphical User
Interfaces must be easy to use by editors and contributors, and
the capabilities must match the user's
role. Capabilities to search and navigate the taxonomy, create,
edit, and delete terms and branches must be easy and intuitive
to use.
- Technical architecture: The system must fit within
enterprise technical architecture guidelines, such as supported
platforms and databases. Additionally, if custom connections to
the modeling system are to be developed and maintained by the
enterprise, the system must have well-documented APIs
and a comprehensive SDK.
- Multilingual capability: Many organizations need
to maintain their taxonomies in many languages. Simple
multi-lingual environments may need the ability to map languages
one to one. More complex environments may need to accurately
model complex interrelationships amongst languages, such as the
ability to map a single complex term in one language to multiple
terms or even a Boolean type statement in another language.
- Scalability: The modeling system may need to
scale in a number of different ways, such as the volume of
terms, the number of users, and geographical distribution of
servers.
- Security and permissions: The security model of
the system must meet the needs of the organization, including
the ability to tie into enterprise security systems such as LDAP
systems. The permissions model must allow for sufficient levels
of ownership, viewing, editing, and collaboration rights.
Modeling capabilities
A second category of requirements has to do with the actual
logical modeling capabilities and the type and complexity of
the models. These considerations include:
- Term relationships: The types of taxonomies and
the complexity of the term relationships need to be
determined. These relationships can range the gamut from flat
lists (no relationships), to simple hierarchies
(for example, parent/child relationships), to thesaurus relationships
(for example, those conforming to the NISO construction standards; see Resources), all
the way to highly specified, defined relationships, and
ontologies.
- Sub-setting taxonomies: The ability to define
subsets of the taxonomy, often based on facets, term
relationships, or other attributes is often needed for
integrating with downstream systems, which have their own
constraints on required terminology or complexity of model.
-
System- and user-defined attributes: Attributes
of terms in your system, such as when a term was created or
modified, by whom, and so on, are often required for managing a
taxonomy. Most of these attributes will come with the modeling
system "out of the box." However, particularly
for integrating taxonomies with other systems, creation and
management of user-defined attributes is necessary. User-defined
attributes are those attributes that can be established and
configured by the customer. Often when integrating with other
systems, information specific to that system must be published
along with the terms or the taxonomy. Being able to record,
store, and manage that information in the modeling system is
highly advantageous.
- Other modeling capabilities: Additionally, there
are a number of other considerations to be made concerning
modeling capabilities. These should be based on known
requirements and include:
-
Polyhierarchy: The ability for terms to have multiple parents
-
Topic maps: The ability to model topic maps
(for example, those conforming to ISO/IEC 13250:2003)
-
RDF: The ability to model information in a manner
compliant with RDF (Resource Description Framework, a family of
World Wide Web Consortium specifications)
-
OWL (Web Ontology Language): The ability to render a
taxonomic model in an OWL-compliant format (see Resources)
 |
Taxonomy deployment
Integrating the taxonomy with other systems
The technical method for integrating taxonomies and taxonomy
subsets is typically dictated by the capabilities of both the
taxonomy modeling system and by the target system.
Typically, integration falls into one of two types, with
different variations on the theme. At the simplest or shallowest
level, the integration can occur by transforming and
transferring files, such as XML files. Most systems can now
readily export taxonomies in an XML format, and subscribing
systems can import taxonomies as XML.
At the deepest level, the modeling system can be directly
connected to subscribing systems using APIs. On the
modeling system side, this allows the exposure of the modeling
system's capabilities directly to other systems,
including their user interfaces. Additionally, this allows the
subscribing system to use the models stored in the modeling
system without having to store another representation of it, such
as in an XML file or in a database structure. Whenever the
subscribing system needs the taxonomy or a subset, either for
display or other purposes such as building a search index or
analyzing a query string, it accesses the data in real time from
the modeling system. This prevents the problem of having
multiple versions of the taxonomy in different systems and the
versions getting out of synchronization.
Synchronizing the taxonomy with other systems
In all but the deepest API to API integrations, processes
must be put in place to synchronize changes made in the modeling
system to subscribing systems. Synchronization methods typically
fall into one or more of the following categories:
- Manual: A user-activated batch process,
typically executed through a UI or a command line interface
- Scheduled: An automated synchronization set up
to occur on a specified schedule
- Event driven: An automated process that occurs
whenever specified changes are made to the model
SchemaLogic Enterprise Suite
One such enterprise-grade taxonomy management system is the
SchemaLogic Enterprise Suite.
At the core of the SchemaLogic Enterprise Suite is
SchemaServer, a central repository where business model
standards are gathered, created, refined, and reconciled, and
from which the standards are distributed to subscribing
systems. The Suite is capable of managing both semantic,
including taxonomic, and structural models.
For semantic models, SchemaServer allows an organization to
capture and manage the standard business terminology, code
systems, and semantics that must be used consistently across the
enterprise as vocabularies, terms, and relationships. This
powerful and flexible system can model simple but critical lists
like sales regions or marketing segmentation models up to
complex multi-faceted taxonomies, thesauri, or ontologies
describing complex business systems or product and services
networks with hundreds of thousands of terms.
Additionally, SchemaServer describes the structural models
used to store and exchange information as a hierarchy of
information classes. This logical modeling capability allows you
to capture a consistent and easy-to-understand model of all of
the information systems in the enterprise and easily view and
understand how they interrelate with each other. Relational,
Object-Oriented, XML, SOA, and other technical models can be
unified and brought under a business-oriented management model
appropriate for business and technical participants. The
relationships between semantic and structural models are also
captured, enabling comprehensive governance and impact analysis
to be implemented.
SchemaServer is accessed through powerful GUIs that
allow you to quickly and easily create and maintain
taxonomies. Workshop is a powerful modeling application used by
expert modelers and taxonomy editors to administer corporate
taxonomies. It includes powerful editing tools for importing,
exporting, and making large-scale or bulk changes to the
model.
Workshop Web is a zero footprint, browser-based UI to
manage the objects in SchemaServer. It is designed for everyday
business users of the system.
Figure 1. Geography hierarchy in Workshop Web
The SchemaLogic solution is built around four key
capabilities to enable organizations to manage business
semantics and taxonomies within the context of everyday
operations:
- Model the structure and information relationships
- Govern and manage the changes
- Publish to subscribing systems
- Collaborate to expand and maintain
These key capabilities enable participation and contribution
across organizational, corporate, and industry boundaries to
facilitate the development of business semantics in a dynamic,
constantly changing environment.
SchemaLogic Enterprise
Suite: Key features and capabilities
The SchemaLogic Enterprise Suite includes many of the key
features required for taxonomy development and deployment
projects:
- Ease of use: Workshop Web emphasizes ease of use
by general business users who typically do not have deep
taxonomy development experience and is highly
graphical. Workshop, in addition to the capabilities found in
Workshop Web, provides additional powerful editing and
administrative capabilities.
-
Import/export: Workshop users can perform manual
file-based imports and exports utilizing XML or CSV-formatted
files. Imports and exports can be performed against any defined
subset or the entire model. Non-file-based imports and exports
can be accomplished with the API or through product Adapters.
- Collaboration: The SchemaLogic Suite includes a
customizable governance system of contracts,
permissions, and rights that allows all users to collaborate on
the development of the semantics model.
- Permissions model: The permissions model allows
user roles to be applied to any in the system. Changes can be
suggested but not committed until the appropriate owners and
stake holders have approved the changes.
- Impact analysis: The impact analysis
feature allows users to graphically see which objects would
be affected by a proposed change as well as all the owners,
stake holders, and subscribers affected by a change.
- Defined relationships: Customers can define any
number of custom term relationships,
allowing organizations to model a full range of semantic
relationships, including flat lists, simple hierarchical,
thesaurus, and ontological.
- Configurable attributes: Customers can define
any number of custom attributes for any object in the
system. This is useful when integrating
taxonomies with other systems, as those systems often require
specific term attributes in order to integrate taxonomies.
- SDK: The SchemaLogic Suite includes a fully
documented Web services SOAP API and a Java API to allow
customers to write custom applications against the modeling
server. This allows the modeling server to be
integrated with existing line-of-business applications, exposing
the modeling capabilities to users through their
line-of-business UI's.
- Integration service and adapters: The
SchemaLogic Suite includes a small footprint integration server,
an integration architectural framework accessible through Web
services designed to give organizations the ability to quickly
build and deploy API to API Adapters to publish taxonomies or
taxonomy subsets to subscribing systems. Additionally,
SchemaLogic has pre-built product adapters for systems such as
IBM Content Management suite and IBM
OmniFind suite of systems.
Integrating SchemaLogic Suite
with OmniFind Enterprise Edition, Version 8.4
OmniFind Enterprise Edition V8.4 exposes a number of
capabilities that can be leveraged by an organization's
taxonomy to finely tune and significantly enhance search
results.
- Rule-based taxonomy: To simplify enterprise search
deployments, OmniFind Enterprise Edition V8.4 provides the
ability to configure a taxonomy of categories and category
rules. The taxonomy serves two purposes. First, when the search
index is created, taxonomy categories are applied to documents
based on whether a document satisfies the rule. Secondly, once
categories are applied to documents, the taxonomy can be used to
create a browsing interface to the collection. Unlike many
navigation-only solutions, OmniFind Enterprise Edition does not
require a pre-defined taxonomy in order to deliver highly
relevant search results. However, it can take advantage of
taxonomy tags to influence both the results and interface of a
search application.
Using the SchemaLogic Suite, organizations can apply
OmniFind-specific rules to existing taxonomic terms and publish those
taxonomies, or subsets, to OmniFind. This allows organizations
to leverage existing taxonomies in OmniFind and to manage their
OmniFind taxonomy as part of their overall taxonomy management
system and processes.
- Linguistic dictionaries: OmniFind allows
organizations to manage a number of different dictionaries to
fine tune results. In the SchemaLogic Suite, each of these
dictionaries can be managed seamlessly within the overall
taxonomy of the organization and periodically published to
OmniFind.
- Synonyms: This dictionary can be used to expand terms in the
query string sent to the search engine to include specified
synonyms. For example, this allows organizations to tune the
search engine to search for the complete spellings of common
enterprise acronyms. If a user searches on "WAS," the search
engine can also automatically return results for "WebSphere
Application Server," and the other way around.
- Boost words: This dictionary specifies terms and
phrases that raise or lower the rank value of the document in
which the term appears. This allows organizations to manipulate
the ranking of search results to provide more highly relevant
documents to users.
- Stop words: This dictionary specifies a list of enterprise-specific
terms that are removed from query strings to improve
the relevancy of search results. Typically, stop words are
commonly occurring words or phrases whose inclusion in a query
string may cause a large quantity of poor results.
 |
Automatic text classification
Background
There are various approaches to building document
taxonomies. Some of the approaches are rule-based, mostly created and
maintained by human experts. Others rely on automatic text
classification techniques. Various text classification
methodologies may yield different types of "models," or
statistical descriptions of the world. Models may be complex or
simple in the sense that the "classifiers," or the software
components that determine if a text belongs to a category, can
be architecturally simple or complex. Typically there is a trade
off between sophisticated models and simple models, or between
variance and bias.
Techniques such as Bayesian networks or neural networks use
highly expressive models, which try to produce a non-biased
classifier in order to "describe" a corpus of documents. Their
results tend to have very high variance, which can be
reduced only by large training sets and very static data. But in
most real-world situations, and always in the customer
interaction space, the databases, or corpora, are small,
heterogeneous, and tend to change rapidly. Therefore, it cannot
be assumed that there will be enough data to train these
"complex" structures and reduce the variance. This typically
results in what is called an over-fitted system, perhaps
performing well in artificial tests, but not in the real world.
The ICM RME approach
ICM relies on a proprietary and unique algorithm
(Relationship Modeling Engine, RME) to create an optimal
trade-off between variance and bias. This approach is superior
to the apparently "complex" methods such as Bayesian networks or
neural networks. ICM RME's sophistication is not in the
architecture of the classifiers, but rather in how these
classifiers are fine-tuned and built in real time.
 |
The ICM RME advantages
- Robust and accurate across noisy, imperfect,
multi-intent, and ambiguous content
- Semantic understanding of text results in high accuracy,
the ability to serve multiple channels, and cross learn
- Supports both statistical and rule-based
classification; rules may be applied as required to guide the
Concept Modeling phase (for example, identify different intents based
on channels of communication)
- Easy bootstrapping techniques and configuration tools
make it simple to deploy
- Elegant and embeddable architecture make it easy to
integrate
- Multiple languages supported, including a language
identification module
|
|
Most of the
algorithmic effort in the development of ICM RME was invested in
how to automatically create and tune classifiers. As a result,
ICM RME classifiers provide superior accuracy and the ability
to generalize and learn from small training sets. In addition,
these classifiers are highly intelligent in the way they are
created dynamically and tuned, with either training or
incremental learning.
ICM RME's algorithmic infrastructure is a unique
self-learning engine, capable of classifying textual information,
even in imperfect and noisy situations. It incorporates new
knowledge on the fly, without the need to reconfigure or
re-train the system. ICM RME's technology is different
from standard classification techniques; it emphasizes cleanness
and transparency. Using Concept Modeling techniques, ICM RME
has the unique capability of serving multiple applications from
a single knowledge base. This is the original premise of
knowledge management -- push knowledge from the more human-intensive channels to the more automated or unattended channels
automatically. ICM RME provides a mature technology that can
adapt to real-world changes and continuously provide accuracy
levels that make it valuable in a variety of real-world, mission-critical applications.
ICM RME provides services to applications that need to
understand text or correlate between text and certain objects
(for example, personalization or general data classification
applications).
Typically, an application sends raw data to ICM RME for
analysis and expects to receive a quick and accurate response
based on the data content. Note that about half of the content
is irrelevant to the actual analysis of the message and that
the message contains shorthand (abbreviations), potential
spelling errors, and other imperfect characteristics.
ICM RME receives this message and processes it in two main
phases: a multilingual Natural Language Processing (NLP) phase,
and a language-independent statistic Concept Modeling phase. The
first step of NLP processing primarily consists of finding
portions of the text that contain relevant data, and extracting
key features or linguistic events from the text that will be
used later by the Concept Modeling engine (and possibly by the
calling application directly).
The NLP engine processes input text, regardless of channel,
and creates a Concept Model. A Concept Model is a
computer-readable data structure containing the primary concepts
that appear in the original text and some of their
relationships. This structure is then fed into the Concept
Modeling engine for pattern matching.
 |
ICM RME employs these break-through technologies
- ICM RME classifiers are built on a proprietary algorithm
that performs a unique bias variance trade off
- A special processing phase at the end of the semantic
modeling process uses nonlinear warping techniques in order to
express classifier results directly as actual statistical
probabilities
- Real-time learning: ICM RME has a unique capability to
learn on the fly. The learning process is adaptive and
constantly changes its own characteristics based on feedback and
the knowledge it gains. It uses an adaptive variable memory
based on automatically-collected characteristics and knowledge
of the world.
- Generic multilingual support: The engine was built in a
language-independent way, capturing the true semantic
characteristics of a category, regardless of the language
- Robust NLP analysis of imperfect language
|
|
NLP processing addresses the fact that many different
variants of the same "word" can appear in the text. Some of
these variations are morphological variants (for example, "go,"
"goes," and "went" are linked to the same concept), and some are
due to spelling errors or other naturally-occurring variations
in expression. Concepts can be words, short sentences,
multi-word tokens, numbers, dates, URLs, e-mail addresses, or any
other meaningful patterns that appear in the document.
This highlights one of the ICM RME's differentiators: The
system is looking for patterns in higher-level semantic
structures, leveraging automatically collected domain and
language knowledge rather than finding text patterns directly.
The result of ICM RME processing is a list of categories or
intents embedded in the original text. The system may also
extract certain features or patterns and create metadata fields
if configured to do so. The system can also flag all messages
with certain categories over a pre-defined threshold for special
processing. It is important to note that the certainty factor is
actually an estimate of the statistical likelihood that the
category was identified correctly. This is another unique
feature of ICM RME, which makes its configuration much more
straightforward and provides companies with much greater
control over how and when fully-automated actions are taken by
applications -- a critical requirement in most
environments, but absolutely essential in customer
interactions.
Note that ICM RME performs very well in multi-intent
scenarios; the feedback can be provided as a list of
categories. In addition, the feedback process is very simple for
the calling application; it only has to tell ICM RME which
categories were correct -- the system does the rest. There
is no need to say why or express a degree of confidence in the
feedback. ICM RME automatically finds out why the feedback was
given, and, in case of erroneous feedback, it will quickly
nullify its effect (or "unlearn" it).
ICM server and tools
IBM Classification Module for OmniFind Discovery Edition
IBM Classification Module for OmniFind Discovery Edition is
a cross-platform server application for writing client
applications that interact with the Relationship Modeling
Engine. The Relationship Modeling Engine is a full suite of
language processing technologies targeted at analyzing,
understanding, and classifying the natural language of customer
interactions and other types of everyday communication. This
functionality is easily embedded with the Classification Module,
which exposes all the
functionality necessary to develop applications that harness the
power of the Relationship Modeling Engine. The Classification
Module provides several client
API libraries to enable rapid development of various client
applications in several programming languages, in particular in
Java. Ease-of-use and maintenance is combined with high
availability and scalability. It is designed to run on multiple
machines and provides the ability to scale according to customer
load by making optimal use of hardware and software
resources. The system is configured and maintained using the
Classification Manager application.
About the Relationship Modeling Engine (RME)
The Relationship Modeling Engine uses natural language
processing and sophisticated semantic analysis techniques to
analyze and categorize text. When an application sends input
text to the Relationship Modeling Engine for analysis, the
system identifies the categories that are most likely to match
this text. The Relationship Modeling Engine works together with
an adaptive knowledge base -- a set of collected data used
to analyze and categorize texts. The knowledge base reflects the
kinds of text that the system is expected to
handle. Relationship Modeling Engine-enabled applications use
categories to denote the intent of texts. When text is sent to
the Relationship Modeling Engine for matching, the knowledge
base data is used to select the category that is most likely to
match the text. Before the knowledge base can analyze texts, it
must be trained with a sufficient number of sample texts that
are properly classified into categories. A trained knowledge
base can take a text and compute a numerical measure of its
relevancy to each category. This process is called matching or
categorization. The numerical measure is called relevancy or
score. The accuracy of a knowledge base can be maintained and
improved over time by providing it with
feedback -- confirmation or correction of the current
categorization. The feedback is used to automatically update and
improve the knowledge base. This process of automatic
self-adjustment is called learning.
Classification Workbench
Classification Workbench is an application that allows you to
create a knowledge base (KB) for use with IBM Classification
Module for OmniFind Discovery Edition (ICM), analyze the KB,
and evaluate its accuracy using reports and graphical
diagnostics. The result is a KB that can be used in conjunction
with applications powered by ICM.
Prior to using Classification Workbench, you'll collect
pre-categorized sample data (for example, documents) representative of
the data you expect to classify using ICM. You'll import this
data into Classification Workbench to create a corpus
file. Classification Workbench provides a variety of features
and techniques that allow you to fine-tune the corpus to
optimize KB accuracy. Using the corpus as input, you can
create and test the KB. Then you can evaluate the KB using
Classification Workbench reports and graphical diagnostics and
improve its accuracy by editing the corpus you use to create
the KB. The final product is a production-ready KB, for use with
ICM-based applications.
An ICM RME KB is represented as a tree of nodes, with each node
containing statistical knowledge or rules that assist the system
in classifying text. Categories are the names of the nodes in
the KB. The simplest way to organize nodes in a KB is a flat
knowledge base structure, so that all nodes are on the same
level. Classification Workbench builds such KBs automatically
from a categorized corpus, and you do not have to explicitly
specify its structure. In some cases, you may want to build a
hierarchical knowledge base, consisting of nodes at
multiple levels in the hierarchy.
One important advantage of ICM RME KB is the ability to mix
rules and statistics. This way you can effectively
apply business logic, external non-statistical information
usually defined through metadata, in the classification
process. You can easily craft such a KB using the
Classification Workbench interactive KB Editor. Alternatively,
the KB structure can be specified in an external textual format
and imported to Classification Workbench.
Figure 2 illustrates a possible hierarchical KB
structure. Squares represent rule nodes that work on metadata
(for example, "language = French" or "Products = Servers"). Ellipses
represent statistical nodes.
Figure
2. KB structure example
A typical workflow of using Classification Workbench to
create a KB would be:
- Gather pre-categorized data that will form the basis of a
corpus.
- Convert this data into a format recognized by
Workbench (for example, Workbench recognizes CSV or XML obeying a
certain pattern). Writing an
application that will already produce the format recognized by
Workbench can be a good option.
- Create the KB structure. Workbench recognizes an XML
format for KB.
Then you'll use Workbench to:
- Import the data and create a corpus file
- Import the KB (if available)
- Edit and categorize corpus items, as required
- Create and analyze a KB, and generate analysis results
- Evaluate KB accuracy by viewing summary reports and
graphs. The best way to evaluate the KB accuracy is using the
"KB Tune-Up Wizard."
- As required, improve KB accuracy by editing the corpus
and retraining
- Export the KB to the IBM Classification Module for OmniFind
Discovery Edition (ICM) Server
For the training task, the Classification Workbench reports
present a lot of
information both on the overall KB accuracy and on a
per-category basis. The evaluation should start from the overall
KB accuracy verification, by generating the: "KB Data Sheet,"
"KB Summary," and "Cumulative Success" reports. The "KB Data
Sheet" will provide a highlight of the potential
problems. Measures like "Total cumulative success," "Top
performing categories," "Poorest performing categories,"
"Categories that may be determined by external factors," and
"Pairs of categories with overlapping intents" are very
informative for the general KB accuracy measurement, but they
represent only informative indications. The final decision
has to be taken by the KB administrator who understands the data
and the business logic of the project.
"Categories that may be determined by external factors" may
indicate that the user should add external information to the
documents using metadata and rules to the KB that refer to the
metadata.
"Pairs of categories with overlapping intent" may indicate
that categories should be redefined, either split the
"overlapping" categories into several non-overlapping ones, or
combine several categories into
one. These are possible indications, but the decision has to be
made according to the project data and business logic needs.
If the nature of the data changes over time, the KB accuracy can be
verified periodically, using Classification Workbench reporting
tools. If needed, a retraining will be done.
To conclude, IBM Classification Module for OmniFind Discovery
Edition (ICM) is a powerful tool that uses natural language
processing and sophisticated semantic analysis techniques to
analyze and categorize text. ICM works together with an adaptive
knowledge base/taxonomy (KB) that uses categories to denote the
intent of texts. When text is sent to ICM for matching, the
knowledge base data is used to select the category that is most
likely to match the text. The KB can be hierarchical and can
combine rule-based and statistical information. The
Classification Workbench tool allows easy creation, analysis, and
tuning of a knowledge base from representative data. Its
reporting tools are very powerful, allowing editing and tuning
of the data and of the KB to increase the accuracy of the
classification.
Integrating SchemaLogic
Suite with ICM
Using the SchemaLogic Suite, organizations can publish
existing taxonomic terms to the Classification Workbench so that
the subset of the taxonomy that is imported into the
Workbench becomes the KB structure, and hence the set of
categories, which the classifier is trained upon.
Thus, the auto-categorizer will tag documents or text streams with
enterprise-specific categories that are actively managed within
the organization. This ensures that a consistent set of approved
terminology is used for auto-categorization.
Integrating
classification and search
Search and classification are often integrated together in a
single system. They fit together nicely, because of several
reasons.
First, they provide complementary mechanisms for
describing documents. Search describes the document based on a
small set of words supplied by the user (such as the query
"fat"), whereas
classification attempts to describe the overall document based
on a set of descriptors supplied by the taxonomy (for example,
in a subject taxonomy, one of the subjects). This means that if
a search engine supplies the category to the user, it can be
extremely easy for the user to distinguish which search results
are really relevant. For example, if the user query is "fat,"
some of the results will be marked as "dieting" or "nutrition,"
but others will be marked as "file systems" (because FAT is also
File Allocation Table, used by the DOS operating system). A user
seeing this mixture of topics can then refine the query to
select just the ones intended by this ambiguous query.
Secondly, the processing of the data required by search and
classification, (in other words, document fetching, tokenization,
lemmatization, and so on) is the same to a large extent. Hence, a
system that couples them together can take advantage of common
processing steps.
Search and classification can be paired in a number of ways:
- Search within a category: You can select a
category and then search only documents that are both within the
category and that match your query.
- Faceted search: In this method, you are allowed to
specify several different facets (or characteristics) of a
document to a search engine (for example, "search for all PDF
documents about databases from last year"). This is actually a
generalization of "search within a category," where multiple
criteria that may or may not be categories from a taxonomy can
be combined.
- Taxonomy browsing: Some or all of the documents on a Web
site are displayed as a taxonomy that can be navigated, with
each document assigned to one or more nodes of the taxonomy.
- Classifying search results: The results of a
search are displayed together with their assigned
categories. Categories can be used to group or sort result sets.
Integrating the three applications
To address these usage scenarios in an optimal way,
organizations can leverage the power of all three
applications together by using the SchemaLogic Suite to
centrally manage the enterprise taxonomy and publish
appropriate subsets to both OmniFind and the Classification
Workbench. This results in the use of a consistent, actively
managed set of semantics for auto-categorization and search,
significantly enhancing results and ensuring that these systems
are automatically kept up to date with the ever-evolving
enterprise taxonomy.
The following section demonstrates how the integration works
in practice.
How to fit the systems
together step by step
This section gives detailed instructions for using
the three systems in concert in the following scenario:
- A taxonomy is centrally managed using the SchemaLogic Suite
- Based on the taxonomy, a KB is trained for
auto-classification using ICM
- The taxonomy is deployed within OmniFind, which uses a
plug-in in its document-processing pipeline to connect to the
Classification Module server and receive classifications from the
taxonomy for each document it processes
The description assumes the following software versions: ICM Version 8.3
(previous name: "IBM Classification Module for WebSphere® Content
Discovery Version 8.3) and OmniFind Version 8.4.
Please note that the focus here is on the steps that realize
the integration of the three systems and does not present details for
the tasks that are accomplished within the tools.
Setting up the integration
Step 1: Create an OmniFind collection on which you want to
employ auto-classification
Create an OmniFind collection with "rule-based categorization."
The configuration option "rule-based
categorization" is needed to allow categories obtained later from
ICM to be stored in the OmniFind index and to allow the Search
Application to browse the category tree.
Step 2: Deploy BNSCategoryAnnotator in OmniFind
As mentioned above, the integration of the ICM server with
OmniFind requires an extension module to be loaded into
OmniFind. The extensibility of OmniFind is based on the
Unstructured Information Management Architecture (UIMA) (see
the Resources section for more information). In this architecture extension modules,
( UIMA plug-ins, in other words) are also called annotators. The annotator
used here is contained in the
UIMA PEAR package BNSCategoryAnnotator.pear
(a simplified version of it is attached in the
Download section). Figure 3 gives an
architectural overview of this integration:
Figure
3. BNSCategoryAnnotator provides the bridge between OmniFind and ICM
The package contains a configuration file
BNS.xml (BNSSample.xml in the downloadable version),
which contains a number of configuration
parameters that need to be set before deploying the plug-in. The
most important parameters are listed in Table
1.
Table 1. BNSCategoryAnnotator
configuration parameters.
| Parameter | Meaning | Example |
|---|
| ServerURL | The URL of the ICM server | http://127.0.0.1:8081/Listener/mod_gsoap.dll |
|---|
| KBName | The name of the KB in ICM | EnterpriseTaxonomy |
|---|
| DefaultBodyFieldName | The KB field in ICM that is expected to contain the
document body | text |
|---|
| MinRelevanceScore | A float between 0 and 1; categories with a relevancy
score below this threshold are ignored | 0.5 |
|---|
| MaxCategories | The maximum number of categories that may be assigned to
a document | 3 |
|---|
We recommend that you use the Eclipse-based Configuration
Description Editor that comes with the UIMA SDK to adapt these
parameters as required by your application (a description of how to
install the UIMA SDK Eclipse
tooling and how to use the editor is contained in the
UIMA
SDK User's Guide and Reference; see, also, the Resources section). At a minimum, the
ServerURL parameter needs to
be adapted to your ICM server installation so that the annotator
can connect to the ICM server. Also, in the simplified version, the
parameter CategoryDirectory needs to be
set to the following path which contains the CategoryTree.xml file on the OmniFind
controller node: <ES_NODE_ROOT>/master_config/<CollectionId>.parserdriver/
(replace <ES_NODE_ROOT> by the value of the respective
environment variable when logged on as the OmniFind administrator; to
find the <CollectionId>, go to the collection's General tab).
For parameter DefaultBodyFieldName, you can choose some name, like
"text," "body," or "contents." This name must be used again in
the Classification Workbench, where you have to choose a field that contains
the document text of your training data. Finally, the value for
parameter KBName should be left to the
default (empty string). This ensures that the name of the root
node of the taxonomy is taken as the KBName, which is true for KBs
developed with Workbench.
Figure
4. Editing parameters in BNS.xml using UIMA SDK's Component Descriptor Editor
Then the PEAR package must be uploaded onto the OmniFind controller
node. In the OmniFind administration console, use the System:Parse page
in Edit mode to add the PEAR package as a new text analysis
engine. Please refer to the
tutorial "Semantic
Search with UIMA and OmniFind"
(developerWorks, December 2006) for details about
deploying and using custom analysis engines with OmniFind.
Step 3: Associate the custom analysis engine with your
OmniFind collection
To have the collection use the
auto-classifier, it needs to be associated with the new text
analysis engine. This setting is available in the Text Processing
Options page for your collection (see the collection's Parse page
in Edit mode). More details about configuring text processing can
be found in the tutorial "Semantic
Search with UIMA and OmniFind."
Step 4: Create and publish a taxonomy with SchemaLogic
Enterprise Suite
In the described setup, you need to publish a taxonomy both to ICM
and OmniFind. Publication to ICM is done using the SchemaLogic Adapter for
ICM, and publication to OmniFind is done using the SchemaLogic
Adapter for OmniFind. The configuration and running of those
adapters is done through the
Workshop UI and the Integration Service. Use CSV format in the ICM adapter
to publish a taxonomy (subset) for ICM and publish the same taxonomy
subset directly to OmniFind.
Configuration of the adapters includes specifying:
-
The directory to write the CSV file for ICM server, respectively the
connection information to the OmniFind controller node
-
The taxonomy or taxonomy subset in the SchemaLogic modeling
server to be published to ICM and OmniFind
-
Any terms that should be excluded or included based on term
attributes or term relationship types
Figure 5 shows how the SchemaLogic Adapter for OmniFind can be
configured:
Figure
5. Editing configuration settings in the SchemaLogic Adapter
for OmniFind
The adapters can be run by any of the following methods:
-
A manual process, where an administrator executes the publication
from the Workshop UI
-
A scheduled process, where publication is configured to occur with
a specified frequency
-
A Web services call to the Adapter made by another application or system
After successful publication, the taxonomy is available as a KB
(structure only) for ICM and as a category tree that can be
browsed in OmniFind.
Step 5: In Classification Workbench, import the taxonomy as a KB structure and train it for auto-classification
Using the Import Wizard, in what to import, select the
option Knowledge base, and in what type of knowledge base, select the
option KB configuration. Provide the path of the configuration file
describing the KB structure in the following screen, and click Finish
(for details, please refer to section "Importing and Exporting a KB
Structure" in the Classification Workbench User's Guide).
To get a high-quality classifier, it is important to
carefully select enough training samples for each category that
should be recognized in the taxonomy. A training sample needs to
be pure text, extracted from a sample document of the category in
question.
Because the OmniFind/ICM integration for the
auto-categorization runtime requires that all document text is
provided within a single field (of NLP usage type Body), each
training sample should also be formed in this way: all document
text is contained within a fixed single field of type Body.
To simplify the extraction of document text, the OmniFind/ICM
integration can be run in "training mode" (not included in the sample annotator). This mode simplifies
the task of collecting and preprocessing training data for KB
training with the Workbench considerably because you can use
OmniFind crawlers to fetch training documents and the OmniFind parser
for document preprocessing and content extraction in the same
way documents would be preprocessed for categorization.
For the training itself, import the training samples into
Workbench and make sure that the categories associated with the
samples are correct. Please refer to the Workbench documentation
for the details on how to train a KB.
Step 6: Export the taxonomy to the ICM server
 |
Alternative setup not involving SchemaLogic
Suite
If you do not use SchemaLogic
for taxonomy maintenance, you can still integrate ICM
auto-classification with OmniFind. In that case, you will be
maintaining the taxonomy within Workbench (ignore step 4 above, and
in step 5, define a KB from scratch inside Workbench). Hence, you
will need to
export the KB structure to OmniFind whenever you modify
the KB in Workbench. The following special export step of the
KB structure is required for that case to produce a category tree
file for downstream use in OmniFind.
Important: This step
needs to be done before you export the KB to the ICM
server (step 6).
In the Workbench main window:
- Select Export (the Export Wizard), then click Next.
- Check Knowledge base, then click Next.
- Check KB XML file, and click Next.
- Enter a path for export, then click Finish.
- Answer Yes to the question of whether you want to
overwrite the User Field properties of each node.
Then, transfer the exported file to the OmniFind controller
node, and use the following OmniFind administration commands to
import it into OmniFind (assuming your collection ID is col_tax1 and the exported KB structure is
/tmp/taxonomy.xml):
>esadmin taxonomy add -cid col_tax1 -fname /tmp/taxonomy.xml
>esadmin configmanager syncComponent -sid col_tax1.parserdriver |
|
|
When satisfied with the classification quality of the
trained KB, you need to export the KB to the ICM server.
You deploy the KB with the ICM server using the Export
Wizard: Select Knowledge base in what to export, and use the KB
format "IBM Classification Module."
This export step needs to be repeated each time you
change anything in
the KB structure, like when you have add a category, or change a
name. When you maintain the taxonomy within SchemaLogic
Suite, you will not perform such changes locally within Workbench,
but rather on the original taxonomy that you then re-import to
Workbench before re-training.
Now, OmniFind is ready to process documents. Start crawling and
parsing and build an index. The sample OmniFind Search Application
lets you browse through the taxonomy to view the documents
associated with any given category. You can also use the Search
and Indexing API (SIAPI) to enhance sophisticated queries with
restrictions to categories. Note that category constraints need to
specify the string "rulebased" as a taxonomy ID in that case.
Completing the
development cycle
Whenever the taxonomy changes, steps 4 (publish to ICM and
OmniFind), 5 (import into Workbench and re-train), and 6 (export
to ICM server) must be repeated.
Note that taxonomy changes may invalidate any
categorization of documents processed by OmniFind
previously. Hence, whenever
you update the taxonomy, categories stored in the OmniFind index
for a document may be wrong until you re-process (in other words, re-crawl,
re-parse, and re-index) that document.
Conclusion
This article has motivated the use of
- Centrally maintained and consolidated taxonomies and
- Automatic text classification
for enterprise search applications. It has shown how to set up and
use the three-fold integration of OmniFind combined with both
SchemaLogic Enterprise Suite to address the first item and IBM
Classification Module for OmniFind Discovery Edition to address the
second. This integration exploits the plug-in architecture UIMA that is built
into OmniFind, and a version of the required plug-in is provided as
sample code.
Download | Description | Name | Size | Download method |
|---|
| Sample annotator to connect OmniFind to ICM | BNSCategoryAnnotatorSample.zip | 3.7MB | HTTP |
|---|
Resources Learn
-
SchemaLogic® home page: Find more information on SchemaLogic.
-
OmniFind Enterprise Edition product home page: Find more information on OmniFind
Enterprise Edition.
-
IBM Classification Module for OmniFind Discovery Edition home page: Find more
information on Classification Module for OmniFind Discovery Edition.
-
Online documentation for OmniFind products: Find information about installing, administering, and developing content integration and enterprise search and discovery solutions.
.
-
"Semantic Search with UIMA and OmniFind" (developerWorks, December 2006): This
tutorial is a good starting point
for learning how to use custom text analysis and semantic search in
IBM OmniFind Enterprise Edition.
-
ANSI/NISO standard for thesauri:
-
Resource Description Framework (RDF) Standards of the World Wide
Web Consortium (W3C): An integration of a variety of applications from library catalogs and world-wide directories to syndication and aggregation of news, software, and content to personal collections of music, photos, and events using XML as an interchange syntax.
-
OWL Web Ontology Language: A W3C Recommendation for describing
Web content.
-
developerWorks resource
page for IBM OmniFind: Find articles and tutorials and connect to other resources to expand your OmniFind skills.
- Unstructured
Information Management Architecture SDK: Learn more about the Unstructured Information
Management Architecture (UIMA). This Java SDK supports the implementation, composition,
and deployment of applications working with unstructured information.
- developerWorks Information Management
zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks
technical events and webcasts.
- Technology bookstore: Browse for books on these and other technical topics.
Get products and technologies
- Download
UIMA SDK: The free UIMA SDK comes as a self-extracting
installer for Windows and Linux or a zip file for all other
platforms.
-
The full BNSCategoryAnnotator.pear includes
a "training mode" and is not limited to only one collection. It is
available from the OmniFind EMEA Center of Excellence as part of a
service engagement, which you can inquire about by e-mail.
- Build your next
development project with IBM
trial software, available for download directly from developerWorks.
Discuss
About the authors  | 
|  | Dr. Jochen Dörre is a Software Engineer at IBM Böblingen Laboratory with a background in text search and text mining technology. He joined IBM in 1997 and has worked on several software development projects in those fields specializing on text categorization, text analytics integration, search over XML documents, as well as core search engine design and performance issues. Prior to joining IBM, Jochen has worked in natural language processing research for several years. He received his PhD from the University of
Stuttgart. Jochen is a member of the World-Wide Web Consortium (W3C) XQuery Working Group, where he co-develops the extension of the XML query language XQuery with full-text search operations. |
 | 
|  | Josemina Magdalen is a Software Development Team Leader at
Israel Software Group (ILSL) . She has a background in Natural Language
Processing (text classification and search, as well as text mining
technologies). Josemina joined IBM in 2005 and has worked in the
Content Discovery Engineering Group doing software development projects
in text categorization and search, as well as text analytics. Prior to
joining IBM, Josemina has worked in Natural Languages Processing
research and development (Machine Translation, Text Classification and
Search, Data Mining), for over ten years. Josemina is working on her PhD at
the Hebrew University in Jerusalem. |
 | 
|  | Wendi Pohs has designed and developed taxonomy and search
applications for large organizations for the past 20 years. She has
served on development teams for Lotus Development Corporation's
Notes/Domino and Discovery Server products, and most recently
managed Search and Taxonomy Integration for IBM's Corporate
Intranet, w3. Author of a book on knowledge management practices,
she specializes in advanced taxonomy applications, built with an
experienced practitioner's point of view. Currently, as CTO of
Infoclear Consulting, Wendi has provided taxonomy consulting services
to a large government contractor, a major news provider, a leading
financial institution, and an innovative public health Web site. |
 | 
|  | Bob St. Clair is a Senior Product Manager for SchemaLogic
responsible for products integrating the SchemaLogic Enterprise Suite
with other enterprise systems. Since joining SchemaLogic in 2005, he
has designed and built several taxonomy and metadata integration
solutions with Search, Portal, and Enterprise Content Management
products. Prior to joining SchemaLogic, he worked for Corbis, one of
the largest stock photo companies in the world, where he designed
thesaurus construction, content cataloging, and Media Asset
Management systems. Bob holds a Masters of Library and Information
Science degree for the University of Washington in Seattle. |
Rate this page
|  |