Skip to main content

skip to main content

developerWorks  >  Open source | Java technology | SOA and Web services  >

Build a life sciences collaboration network with LSID

A common protocol provides the foundation

developerWorks
Document options

Document options requiring JavaScript are not displayed


New site feature

Check out our new article design and features. Tell us what you think.


Rate this page

Help us improve this content


Level: Advanced

Ben Szekely (bhszekel@us.ibm.com), Software Engineer, IBM

15 Aug 2003

If widely adopted, the Life Sciences Identifier protocol will enable scientists and researchers across multiple organizations to share data and collaborate in ways never before considered. You can build services that implement the LSID protocol using a combination of J2EE components that abstract away the protocol handling itself, leaving only the necessity of writing the service logic.

Life Sciences Identifier (LSID) is a new naming standard and data-access protocol being developed in the Interoperable Informatics Infrastructure Consortium (I3C.org) along with help from IBM and other technology organizations such as Oracle, Sun Microsystems, and the Massachusetts Institute of Technology. A client application resolves an LSID against a special server called an authority to discover data and information about the data (metadata).

The admittedly idealistic goal of LSID Resolution is for all biotech, pharmaceutical, and other life sciences organizations to build LSID Resolution Services in front of their data. With a common standard for data retrieval, scientists across these organizations may then easily share data, facilitating collaboration on such vital projects as drug discovery and disease research. The LSID Server Framework enables this LSID utopia by allowing organizations to provide their data using a service implementation that best matches their data source. Certain data sources will require only mapping from LSID to URL, if each piece of data has a URL that can retrieve it in a standard format. If the data source is a relational database, a more complex service will need to be written.

This article shows how to build Resolution Services using Java 2 Enterprise Edition (J2EE)-based components. We'll look at the LSID Client Stack, which provides LSID connectivity within applications; the LSID Server Framework, which enables rapid development and deployment of LSID Resolution Services; and select resolution service implementations. Finally, we'll see how enterprises might integrate these components to form an Enterprise LSID Resolution Network. The figures in this article illustrate the architecture of individual components as well as the interaction between them. The red text and arrows show the involvement of key Java classes.

The LSID Client Stack

The LSID Client Stack is a simple yet crucial component of the LSID Resolution Network. It allows life sciences applications in the network to easily consume data provided via LSID. In addition to Java, the LSID Client Stack has C++/COM and Perl implementations, allowing integration with virtually any application. The three APIs expose the same functionality using programming methodologies and design patterns specific to the host language. Figure 1 shows the LSID Client Stack embedded in an application.


Figure 1. The LSID Client Stack architecture
LSID Client Stack architecture

Given an LSID of the form urn:lsid:authority:namespace:object, the stack resolves the actual host of "authority" first against a local list of authority endpoint URLs and then against DNS if necessary. When requested, the stack retrieves the WSDL document containing the data and metadata locations via the getAvailableOperations Web service call against the resolved authority. It then parses metadata and data locations from the WSDL and uses them for retrieval via HTTP, FTP, or SOAP. These locations do not need to be on the same server as the authority, and in general they will not be. The user may choose a specific location, specify only a certain protocol, or allow the stack to perform the entire selection. This flexibility abstracts the WSDL from the user, but provides the user with more granular control if necessary.

The client stack contains a file-based caching module that drastically improves response times for repeated data, metadata, and WSDL requests. When the user makes a request, the client stack first checks the file cache for a response before going over the network. After a network request has been completed, the stack writes the response to the cache. The lifecycle of a cache item is governed by local policy and response expiration. The local policy defines how long an item may live and the maximum size of a cache directory. In addition, metadata services and authorities may return expiration headers to advise the client of the expiration of metadata and WSDL. The data itself is immutable, per the LSID specification, so cached data never expires but may be removed by local cache policy enforcement.



Back to top


The LSID Server Framework

Building LSID Resolution Services from scratch can be difficult and time consuming. Furthermore, the work of parsing requests and marshalling responses is common to all implementations. The LSID Server Framework facilitates development of LSID services by providing common protocol handling on the server and separating the LSID protocol from service implementation.

The system has three components: the Authority Service, the Data Service, and the Metadata Service. Each component defines a Java interface containing the applicable methods. In the case of the Authority Service, the methods are getKnownURIs and getAvailableOperations. getKnownURIs returns a list or LSIDs that the authority knows about. getAvailableOperations returns a WSDL document describing locations at which data and metadata for the LSID may be retrieved. Note that the third authority operation, getAuthorityVersion, is not included in this interface, because the authority version refers to the version of the protocol. The protocol version is abstracted from the service implementation, because it is based on hidden details such as HTTP/SOAP headers and SOAP return types. The Data Service and the Metadata Service interfaces contain methods for retrieving data (getData) and metadata (getMetaData), respectively.

Each service is driven by a corresponding HTTP servlet: AuthorityServlet, DataServlet, and MetaDataServlet. Each servlet parses the method, arguments, and target LSID (which may be a SOAP parameter or part of the URL) from a request. Using the target LSID, the servlet looks in the service registry to determine which configured service implementation to invoke. The lookup procedure compares the authority and namespace components of the LSID to mappings in the registry. For example, urn:lsid:pdb.org:pdb:1aft might be handled by PDBAuthorityImpl, whereas urn:swiss-prot.org:swiss-id:hv20_mouse-sprot might be handled by SWISSAuthorityImpl.

The AuthorityServlet exposes its operations only by HTTP SOAP. For getAuthorityVersion, the servlet immediately returns the version of the protocol it is using. In the current release, the protocol version is 3. For getKnownURIs, the servlet invokes getKnownURIs on each authority implementation for which a mapping is registered, to provide a complete list of all LSIDs it knows about. For getAvailableOperations, the servlet uses the LSID argument to look up the authority implementation from which to retrieve a WSDL of data and metadata locations.

The DataServlet exposes its single getData operation via HTTP Get and HTTP SOAP. For SOAP, the LSID is contained in the single SOAP parameter. For HTTP Get, the LSID is expected in the query string, for example: http://www.myappserver.com/lsid/data?lsid=urn:lsid:foo:bar. The MetadataServlet works the same way but has an optional argument that the metadata service implementation can use as a hint to retrieve the metadata. For SOAP, this argument is embedded in the URL; for example: http://www.myappserver.com/lsid/metadata/metadata-hint. For HTTP Get, we use two request parameters for the LSID and the hint; for example: http://www.myappserver.com/lsid/metadata?lsid=urn:lsid:foo:bar&hint=metadata-hint.

A given LSID Resolution Service need only consist of an Authority Service. The Data and Metadata Services are, in a sense, only utilities for providing data and metadata endpoints. A given data provider, for example, might already have convenient HTTP locations for the data, which can be referenced directly in the getAvailableOperations WSDL.

Security

Security in the LSID Resolution Network is handled at the protocol level. For HTTP services, HTTP Basic Authentication is used. For SOAP services, authentication for the underlying transport protocol is used (HTTP, in current implementations). The LSID Client Stack and Server Framework handle these two cases.



Back to top


LSID Resolution Service Implementations

The following examples of LSID Resolution Services show how the LSID Server Framework can be used.

The Caching Proxy Resolution Service

The Caching Proxy Resolution Service -- the LSID analogue of an HTTP proxy -- is a server that sits on the edge of a network (lab, department, organization, etc.), and proxies all LSID traffic. Furthermore, the proxy can cache all of its requests so that scientists using the same LSID working set will experience rapid respond time to requests. The Caching Proxy Resolution Service also serves as a method to monitor LSID traffic in an organization. Figure 2 shows the Caching Proxy Resolution Service.


Figure 2. The Caching Proxy Resolution Service architecture
Caching Proxy Resolution Service architecture

The Caching Proxy Resolution Service uses the client stack to proxy requests to other services. The caching functionality of the client enables the proxy to respond quickly to requests it has cached. The caching proxy can process any LSID that is resolvable via DNS, so its list of known URIs is technically the global space of LSIDs. However, to pare this down, we return only LSIDs for which we have WSDL cached when getKnownURIs is invoked.

The Caching Proxy Resolution Service is composed of an Authority Service, a Data Service, and a Metadata Service. For getAvailableOperations, the proxy uses the client stack to call getAvailableOperations itself and builds another WSDL based on the response it receives. This new WSDL contains locations of data and metadata services in the proxy. When the proxy receives a request for getData, it makes a request to an arbitrary data location from the original WSDL, since the data itself is identical across all locations. However, each metadata location may contain different metadata, and so each original location must be exposed through the proxy. We encode the metadata port name in the hint in the URL we return in our WSDL. Thus, when we receive a request for getMetaData, we can relay the call to a specific metadata location.

The Gateway Resolution Service

The Gateway Resolution Service provides an XML-based language to explicitly describe the behavior of the Authority Service called Authority Service Description Language, or ASDL. An ASDL document contains a list of available LSIDs and their corresponding data and metadata locations. Currently, this document should be auto-generated from a relational database or flat-file store. A version is in development that allows mappings to be specified via Java-based regular expressions so that an entire authority may be described in a few lines of hand-written XML.

The gateway has two use cases. The first, illustrated in Figure 3, is to provide an LSID-based view of a local data store. The developer must provide data and metadata service implementations that scrape the local data store. For example, these implementations might utilize JDBC to read the tables of a relational database. The entries in the ASDL file will reference the location of the DataServlet and MetaDataServlet.


Figure 3. The Local Gateway Resolution Service
Local Gateway Resolution Service

The second use case, shown in Figure 4, is perhaps more powerful. If a life sciences data provider exposes data by static or dynamic URLs, as many do, a third-party developer (with permission of course) may create an ASDL document that assigns virtual LSIDs to the data. The data and metadata locations in the ASDL will point to pre-existing URLs.


Figure 4. The Remote Gateway Resolution Service
Remote Gateway Resolution Service

In practice, a combined approach may be used for providing third-party data. The ASDL file may contain both URLs pointing to the original data source and URLs pointing to a Data Service and/or a Metadata Service. In general, these services will relay the requests to the original data source URLs. This architecture will ensure that the data is provided over both SOAP and HTTP. This might be necessary in case the data provider allows only FTP access. FTP is not likely to be supported by all clients. However, the relays may need to do more than bridge two protocols. For example, the original URLs of the third-party data source may reference formatted HTML pages. The relay services might have to scrape the actual data from these pages in order to provide it in a standard format.

Custom Resolution Services

If ASDL is not descriptive enough to completely describe data and metadata mappings, a developer may provide a custom implementation as illustrated in Figure 5.


Figure 5. Architecture of a Custom Resolution Service
Architecture of a Custom Resolution Service

The most involved aspect of building an Authority Service is creating the WSDL in response to getAvailableOperations. The server framework provides simpler interfaces via the abstract classes SimpleAuthority and SimpleResolutionService. Using SimpleAuthority, the developer need only implement methods that return the locations of metadata and data for a given LSID. This information is used to construct WSDL. In Figure 5, LSIDAuthorityImpl could be written to extend SimpleAuthority. SimpleResolutionService provides a further abstraction for the common use case in which the Metadata Service and Data Service are hosted together with the Authority Service. LSIDMetaDataServiceImpl and LSIDDataServiceImpl could be merged into a single class that extends SimpleResolutionService. This derived class is also an Authority Service implementation that directs data and meta data requests to itself.

The Protein Database

An example Custom Resolution Service is the Protein Data Bank (PDB) Authority. The PDB authority returns a mix of HTTP and FTP data locations, as well as SOAP endpoints that proxy data from those locations. The PDB authority also offers a metadata service that generates comprehensive RDF that relates LSIDs to each other and provides links to external resources.

Figure 6 shows the architecture of the PDB Authority. For convenience, a complete resolution service is often referred to simply as an authority. The WSDL returned by this authority provides both FTP and HTTP direct data locations and SOAP data locations via the PDB Data Service.


Figure 6. The Protein Database Authority
The Protein Database Authority

Hybrid authorities

Because many authority implementations may be hosted by a single servlet, a host of hybrid authorities are possible. Consider a Caching Proxy Resolution Service that was also the direct authority for a certain set of LSIDs. From the perspective of a client application in a biology lab, the data discovery and retrieval process will appear uniform and seamless, regardless of whether or not the service had to go outside to resolve an LSID. Such a service could be used as the central authority for an enterprise, such as a pharmaceutical company, where researchers needed to integrate data from internal as well as external sources. As the service handles more requests for external LSIDs, the cache grows, allowing increasingly rapid access to external data.

Figure 7 illustrates this Hybrid Resolution Service. The curved arrows show how the servlets dispatch requests to different service implementations. All LSIDs with authority myauth will be handled by the Local LSID Services. All other LSIDs will be handled by the Caching Proxy.


Figure 7. Hybrid Resolution Service architecture
Hybrid Resolution Service Architecture


Back to top


The Enterprise LSID Resolution Network

The LSID utopia described at the beginning of this article could be realized if each organization builds an Enterprise Resolution Network based on the Hybrid Caching Resolution Service with the assumption that the internal services are available to the outside world as well. Consider two research labs, each with such a network. Client applications in both laboratories have the same view of the federated data. In addition to this symmetry, the client applications can access external LSID-based data sources such as PDB and Gateway-based services. Independent external users may also access the resolution network with a client application. These relationships are illustrated in Figure 8.


Figure 8. Enterprise LSID Resolution Network
Enterprise LSID Resolution Network


Back to top


The future

Utopic visions aside, LSID Resolution will become more useful when more providers use LSIDs to expose their data. LSID Resolution Services exist for many well-known life sciences data sources including PDB, Genbank, Pubmed, Swissprot, GeneOntology, Locuslink, and Ensembl. Any organization with a bit of work or outsourcing could provide their existing databases via LSID.

The remaining question to be answered is, how do scientists and researchers dynamically assign LSIDs to their personal data, results, papers, lab reports, and other documents? This capability is a prerequisite to fully achieving the level of collaboration and data federation enabled by the LSID Resolution Network. The LSID development team in Cambridge, Mass., is currently working to solve this new problem.



Resources



About the author

Ben Szekely is a Software Engineer with IBM Internet Technology in Cambridge, Mass. Ben participated in the Extreme Blue internship program in 2000 before graduating from Cornell University in May 2002. Ben is responsible for the LSID client and server design and its implementation in Java. You can contact Ben at bhszekel@us.ibm.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top