 | Level: Intermediate Greg Flurry (flurry@us.ibm.com), Senior Technical Staff Member, IBM Jim Conallen (jconallen@us.ibm.com), Senior Solution Engineer, IBM Kyle Brown (brownkyl@us.ibm.com), Distinguished Engineer, IBM Dr. Guenter Sauter (gsauter@us.ibm.com), Senior IT Architect and Manager,
IBM
Mei Selvage (meis@us.ibm.com), SOA Data Architect, IBM Eoin Lane (eoinlane@us.ibm.com), Senior Solution Engineer, IBM
10 May 2007 In this article, the authors look closely at the Preferred Data Source
Pattern, a Service-Oriented Architecture (SOA) pattern that allows a client to
retrieve information from a set of information sources, without knowing (at least at
a high level) that multiple sources exist.
Value proposition and
context
The Preferred Data Source Pattern, or Preferred Source Pattern, is a microflow
pattern for service aggregation. The pattern allows a client to retrieve
information from a group of information sources without the need to understand, at
least at a high level, that multiple sources exist.
Consider the following situations where multiple data sources must appear as one:
- A company has multiple sources of information, some of which are more
expensive to access than others (for example, a local parts database and a
remote parts database).
- A company upgrades its IT systems and, in doing so, introduces new sources of
information that it must use in conjunction with old sources (for example,
customers).
- One or more similar businesses merge, and all have somewhat dissimilar data
representing the same entities, such as customers.
- Any individual entity has some enterprise-unique identifier that's part of the
record (for example, a customer number or SKU).
Assume that the above scenarios are integrated in the context of information
management in an SOA Web services environment.
Problem
How can a client retrieve information from a set of disparate information sources
without the need to understand that multiple sources exist?
Solution
The Preferred Data Source Pattern identifies one of the data sources as the
preferred source and considers the others alternate sources, used only when the
preferred source can't provide the desired information. Figure 1 shows the
relationship between the facade and the adapters.
Figure 1. Relationship of facade
and adapters
The pattern assumes that information obtained from any source comes in the form
of records that describe entities, such as customers or parts. Further, it
assumes that any individual entity has some enterprise-unique identifier that's
part of the record, such as a customer number or SKU.
The pattern contains a facade that hides the fact that multiple sources exist;
the client interacts only with this facade. The facade interface matches that of
the preferred source, and the preferred interface contains one or more operations
that allow the client to find (read) information matching various criteria. A find
operation returns 0..n records that match the criteria.
It's important to understand that no matter which source provides the
information, none of the returned records may be the desired record. Consider a
scenario in which a store clerk searches in a nationwide company database for
customers with the name John Smith. The find operation could return 20 John
Smiths, but none of them represent the John Smith standing in front of the clerk.
The client must depend on additional interactions with the user to determine
whether any of the returned records are the desired one.
The Preferred Source Pattern assumes that an information source has one or more
find operations that return zero or more instances of the entity record, or
perhaps a subset of the entity record. The information source may have one or more
write operations that allow a client to create and update entity records.
Find operations
Figure 2 shows a sequence diagram for a find operation in the pattern. The client
invokes the facade, which then invokes the preferred information source. If that
source provides no matches, the facade invokes the alternate information sources
in a predefined order until matches are found or until it exhausts all the
alternative sources. After it finds a match or exhausts all sources, the facade
returns to the client. Note: For the sake of clarity, we haven't shown the
synchronous returns.
Figure 2. Find operations
In its simplest form, the preferred source, and thus the pattern, supports only
find (read) operations. A virtual catalog capability might leverage such a
read-only pattern, as there's no need (or perhaps no ability) to update the
preferred source.
The description for the simplest form must include a Web Services Description
Language (WSDL) document that describes the preferred source and all alternate
sources. The facade and all alternate sources use the preferred source's interface
(port type). If an alternate source doesn't natively expose the same interface,
you can apply a transform pattern to the source; however, this pattern is out of
this article's scope. The WSDL for the alternate sources must differ from the
preferred source, at least in the endpoint address; it may also differ in the
binding(s) with a bit more work. The interface uses the schema describing the
entity record and any other parameters. Note: The WSDL document will define
or import the schema.
As indicated earlier, assume that an entity record includes a unique ID. You
create this identification for the find operations to which the pattern will be
applied. Treat all other operations as pass-through operations. Then create a list
that shows the order in which the alternate sources are invoked. You can, of
course, have a single list of WSDL documents for the services; the first in the
list is the preferred source.
In a more general case, the preferred source interface may contain additional
operations that allow the client to create, update, or delete (in some cases
delete may take the form of deactivate). Obviously, the pattern facade must also
support the additional operations.
When information resulting from a read operation doesn't come from the preferred
source, you may need to add the information to the preferred source. The pattern
should support efficient updates of the preferred source, but this is somewhat
problematic. Consider this customer information scenario: If the desired customer
record doesn't exist in the preferred source (the local store database), it may be
located in some legacy database at the store, or it may be undefined in the
enterprise so that you must use an external information source, such as Acxiom,
which finds information based on a phone number.
The facade's actions depend on the IDs in the entity records returned from the
source. The real alternate source may provide valid IDs, invalid IDs, or no IDs. A
valid ID is acceptable as an ID in the preferred source. Assume the facade finds
the information in the legacy database, and four records match the search
criteria. Further assume that all records have a valid ID. In this case, the
facade should add none, one, or all four records to the preferred source,
depending on the circumstances. If none of the records represents the person in
front of the clerk, obviously the facade wouldn't save those in the preferred
source. If one of the records does represent the person in front of the
clerk, then it's highly likely that the facade should save the record in the
preferred source. But when should you do this?
Certainly the facade has no way of knowing which record is right, so the client
must initiate the create or update (write) operation. However, in this case, if
there's no new information about the customer, the client may not perform any
write operations at all! At best, the client might do something like write a new
time stamp that indicates the last time the identified customer visited. If there
is a migration relationship between the preferred source and one or more alternate
sources, you might want to automatically add all the records from the alternate
source to the preferred source without explicit action on the part of the client.
In a migration scenario, legacy sources might contain entity records with IDs that
are referenced in other parts of the enterprise. Thus, when you put the legacy
record into the preferred source, you should use the same ID to preserve the
integrity of references. To do this, you need an operation to create an entity
record with an existing ID.
You may require a different set of actions when the matching records from an
alternate source don't have a valid ID or have no ID; this would be the case for
external sources like Acxiom. If the records have no IDs and are returned to the
client, the client can easily determine that a record matching the person in front
of the clerk must be created in the preferred source, without assistance from the
facade. The client can add the record to the preferred source (through the facade,
of course) using an operation to create an entity record without an ID; you can
assume that such an operation exists as a pattern requirement. If the returned
records have invalid IDs and are returned to the client, the client cannot easily
determine that a record matching the person in front of the clerk must be created
in the preferred source. In fact, the idea of giving the client invalid IDs seems
flawed. Because the client can't detect that the ID is invalid, it can use the ID
in another context to link to the entity record. This means that an alternate
source used by the facade must return either valid IDs or no IDs. You may have to
produce either a valid ID or no ID in the transform pattern wrapper if the real
source doesn't do so. Creating a valid ID may require some sort of ID correlation
service that allows you to create valid IDs in multiple environments; that is out
of this pattern's scope.
It might be interesting to define a service provider interface for an ID creation
service or function that the facade can optionally call to create valid IDs. The
facade can use that service in the case where an alternate source provides no IDs
and no create method returns the ID of newly inserted records. This can be even
more useful in the On Write policy described later. This
policy highlights the need to:
- Identify the ID field in the entity record.
- Understand whether an alternate source provides valid IDs or no IDs.
The information helps drive the actions that the facade can take under various
circumstances. Another important aspect of the relationship between find and
update operations is the nature of the value type(s) returned from the find
operation(s) as well as the value type(s) used to drive the create operation(s).
The current thinking is that the pattern's initial implementation requires that
the value type always equal the entity record. This eliminates the need for
additional information on how to identify the entity record's subsets and
minimizes mapping code.
The subject of update leads to the need for policies, per alternate source, for
handling the relationship between write (either creates or updates) operations to
the preferred source in relation to the returned entity records from read (find)
operations on an alternate source. To drive these per alternate source policies,
you must identify the find, create, and update operations in the preferred source
interface. The following subsections describe some of these policies.
Nothing policy
For find operations, the facade only returns the entity records and does nothing
for write operations. The client must detect the ID's validity to deduce when it
needs a create operation to insert a record into the
preferred source, as well as to explicitly invoke a create operation on the
facade. The facade, in turn, invokes the create operation on the preferred source.
(see Figure 3).
Figure 3. Nothing policy
This may be the best policy when the alternate source provides no IDs. For this
policy, all read operations are simply passed on to the sources, and results are
returned. All other operations are passed through to the preferred source.
Add All policy
The facade adds all the entity records to the preferred source as a side effect
of the find operation. The client behaves identically to the way it does when the
records come from the preferred source. There are two subpolicies:
- Where the entity records have a valid ID.
- Where the entity records don't have a valid ID (have no ID).
For the first subpolicy (valid ID), the facade creates the records in the
preferred source using the identified create_withID
operation and returns the records from the alternate source to the client. To
validate this subpolicy, the preferred source must support a create operation that
allows existing IDs.
For the second subpolicy (no ID), the facade creates the records in the preferred
source using the identified create_noID operation. To
prevent performing additional read operations, the facade places the IDs created
by the preferred source in the records obtained from the alternate source; these
are subsequently returned to the client. To validate this subpolicy, the preferred
source must support a create operation that returns the IDs as a result of the
preferred source's create operation (see Figure 4).
Figure 4. Add All policy
Alternatively, if no such operation exists, you can perform the create operations
on the preferred source, then perform another read on the preferred source using
the original criteria. This returns only the entities just added, because nothing
matching the criteria was initially found in the preferred source. These entities
can be returned to the client. A possibly valuable variant for second (no ID)
subpolicy is to allow a choice between inserting all the records, as described, or
inserting none of the records; this would result in an Add All policy with a valid
ID and nothing with invalid ID subpatterns.
For this policy, you must identify the find operations. You don't need to
identify write operations, which are simply passed on to the preferred source, and
results are returned. You can determine which subpolicy to use (the first or the
second) by knowing whether or not the alternate source returns valid IDs.
Note: This may be a deployment time problem in that a source could return
valid IDs, but not support the appropriate create operation (with IDs); you can
use the alternative for the second subpolicy in this case. This policy may be most
useful in migration scenarios when coupled with some sort of clean-up mechanism
associated with the alternate source. The mechanism (out of scope for this
pattern) would run periodically to remove entity records that shouldn't be
migrated from an alternate source to the preferred source. For example, customers
who haven't visited a store for six months might be removed from the legacy store
database. The drawback, of course, is that if the customer returns to the store,
the record won't be found anywhere in an existing database. Thus the customer will
appear as a new customer, and you will likely have to re-enter information about
him or her.
On Write policy
If the alternate source provides valid IDs, or the facade can create valid IDs
for the records, the facade remembers or caches the entity records retrieved from
the alternate source using the ID as the key. The facade then returns the records
to the client. If the source doesn't provide valid IDs, the facade simply returns
the records to the client. If the client invokes any operation that updates a
record from the preferred source (it has a valid ID and is in the cache), the
operation may only provide partial information. The facade matches the ID that
must be part of the parameters for the operation against the cached records. It
then merges the information from the client with the cached record and invokes the
create operation (with ID) on the preferred source (see Figure 5).
Figure 5: On Write policy
This policy suffers from a set of problems:
- To cache the records from the alternate source, you need a cache per value
type returned.
- You must specify some cache characteristics, like number of entries or
lifetime, which you could derive from the caching pattern. You can consider the
Requester Side Caching Pattern in this case.
- The timing between putting an entry in the cache and removing it due to
timeout may be problematic in some situations. This might result in the client
updating an ID that has been removed from the cache. There seems to be no
recourse except to cause an exception or set the timeout so long that you risk a
cache blowup.
- Under normal circumstances, a read from the preferred source succeeds, and you
don't need to use the cache; but, when an update operation occurs, you need the
ID to look in the cache anyway, and a cache miss should occur. The update
operation would proceed as if it were a simple pass-through. The caching in this
case impacts performance.
- For this policy, you must identify the find and update operations in the
interface (you must apply the policy to all update and create operations).
Developers must also supply operation-specific merge logic on update operations
as custom code.
Custom policy
It's unlikely that the three policies above cover all situations, and in this
case, you must use a Custom policy. You can first use the Nothing policy to
support creating a policy that works for your situation.
Forces at work
Consider some of the forces at work when dealing with these patterns:
- Generally, the various data sources provide all the persistence; this is
certainly true for read-only situations. When write is allowed, you can employ a
temporary cache.
- Two important performance considerations are the order in which you access
data sources and the number of those sources. In most situations, to maximize
performance it makes sense to make the preferred source the most likely source
of requested data. In other situations, such as in a data migration, the new
source (which may be the least likely to contain the requested data, at least
initially) is the preferred source.
Related patterns
The most common pattern used with the
Preferred Data Source Pattern is a wrapper pattern that makes disparate sources of
information look the same; it presents the same WSDL port type.
In some situations, you can use the Preferred Data Source Pattern recursively.
For example, a service in a store may be implemented with the Preferred Data
Source Pattern, and one of the alternate sources is a service at the enterprise.
You may implement that enterprise service with the Preferred Data Source Pattern.
You can use the Requester Side Caching Pattern to implement the On Write policy.
Summary
The Preferred Data Source Pattern is considered an Enterprise Application
Integration (EAI) pattern in an SOA context; thus the strengths and weaknesses of
typical EAI patterns apply here. The Preferred Data Source Pattern works best when
the data in multiple sources are relatively consistent, clean, and
straightforward, and the returned result sets are small to medium. It provides
many advantages, including flexibility, extensibility, implementation simplicity, and
cost savings. However, you should be cautious about performance
implications, because the Preferred Data Source Pattern doesn't use parallel processing and query optimization.
Resources Learn
Get products and technologies
- Innovate your next
development project with
IBM trial software,
available for download or on DVD.
Discuss
About the authors  | 
|  | Greg Flurry is a senior technical staff member in the IBM Enterprise Integration Solutions group. His responsibilities include working with customers on service-oriented solutions and advancing the IBM service-oriented products. |
 | 
|  | Jim Conallen is a software engineer in IBM Software Group's Rational® Model-Driven Development Strategy team, where he is actively involved in applying the Object Management Group's (OMG) Model-Driven Architecture (MDA) initiative to the IBM Rational model tooling. Jim is a frequent conference speaker and article writer. His areas of expertise include Web application development, where he developed the Web Application Extension (WAE) for UML, an extension to the UML that lets developers model Web-centric architectures with UML at appropriate levels of abstraction and detail. This work served as the basis for IBM Rational Rose® and IBM Rational XDE™ Web Modeling functionality. |
 | 
|  | Kyle Brown is a distinguished engineer with IBM Software Services for WebSphere. Kyle is the author and contributor to several books and articles on patterns, including the Design Patterns Smalltalk Companion (Addison-Wesley, 1998) and Enterprise Integration Patterns (Addison-Wesley, 2003). Kyle was a program chair of the Pattern Language of Programs Conference and actively promotes and teaches pattern applications within IBM and to IBM customers. |
 | 
|  | Dr. Guenter Sauter, senior IT architect and manager, leads a team that is working on information service patterns, which address the linkage between information management and SOA. He is also the demo architect for information management, demonstrating capabilities across the complete IBM Information Management portfolio. |
 | 
|  | Mei Selvage is an SOA data architect with extensive hands-on experience in various information management areas and SOA. Her mission is to bridge the gap between SOA and information management. Her research interests include information management and integration patterns (both structured and unstructured data), data modeling, metadata, faceted search, human collaboration, and SOA. |
 | 
|  | Dr. Eoin Lane, senior solution engineer, is the lead for harvesting and developing an application pattern from key IBM SOA engagements and driving those patterns through IBM pattern governance process to accelerate adoption. Eoin also specializes in Model-Driven Development (MDD), asset-based development, and Reusable Asset Specification (RAS) to facilitate SOA development. |
Rate this page
|  |