 | Level: Intermediate Mine Altunay (maltuna@unity.ncsu.edu), Student, North Carolina State University Daniel Colonnese (dcolonn@ncsu.edu), Student, North Carolina State University Chetna Warade (warade@us.ibm.com), Developer, IBM Healthcare & Life Sciences
25 May 2004 Current bioinformatics workflows require screen-scraping the results of different bioinformatics tools on several Web sites. High-throughput services integrated with Web services allow researchers to access a virtual organization by providing seamless access to vast computational and storage resources. In this article you can learn the details of integrating Open Grid Services Architecture (OGSA), Web services, and the NC BioGrid.
High-throughput services
There are considerable costs associated with running a high-throughput application including hardware, storage, maintenance, and bandwidth. Researchers are now taking advantage of economies of scale by building large shared systems for bioinformatics processing. Some researchers have invested in special-purpose hardware or configurable Field Programmable Gate Arrays (FPGA) for specific applications. The methods of submitting a job from within a grid are well-established. The process of consuming Web services from within a grid is addressed in the Open Grid Services Architecture (OGSA). This article explains the process of accessing a high-throughput application remotely via Web services.
The convergence of several trends, including grid technologies and Web services, has made a new model for bioinformatics possible. Processing power, storage, and network bandwidth have all advanced to the point where it is now feasible to provide high-throughput applications as Web services.
The size of XML output reports is a hindrance to the integration of high-throughput bioinformatics applications with Web services. Since processing power is usually less scarce than bandwidth, most high throughput applications benefit from file compression. XML documents are particularly well-suited for compression. For example, a series of BLAST output reports can be reduced to less than 1 percent of the original size using the zlib lossless data-compression library. The representation of DNA sequence data takes up the most space. The genetic coding data is represented as a combination of the characters ACGT. You can compress each 8-bit character down to 2 bits with the conversion (A=00, C=01, G=10, T=11). Since most existing programs expect character input, you are trading CPU time for bandwidth.
Security can also be a major concern, in regards to both protecting sensitive data as well as preventing abuse of computational resources. Encrypting the network traffic over the Secure Sockets Layer (SSL) is usually sufficient to protect sensitive data. Our Web services use HTTP basic authentication as described in the section Secure Web services and Globus Security Infrastructure.
Running genomics applications on the NC BioGrid
The NC BioGrid (see Resources) has been built by a consortium consisting of over 70 organizations, including universities and colleges, biomedical, biotechnology and information technology companies, nonprofit institutions, and foundations. It allows for terascale computer operations, petascale data storage, and high-speed networking for consortium members. Since Globus Toolkit 2.0 for computational grid and Avaki for data grid are already installed on the NC Biogrid, our design followed the specifications of these underlying middlewares.
There are two main challenges associated with running an application on the grid: submitting jobs remotely via Web services and integrating secure Web services standards with Globus Security Infrastructure (GSI).
Upon receiving a request from Web services, the XML request is parsed and necessary information, such as nucleotide chain and specifications of a sequence, are compiled with respect to the basic local alignment search tool (BLAST) application interface. Since there might be more than one sequence inside one XML document, each of the submitted sequences is parsed separately and a FASTA format file is created for each of them under the specific username and job identification number. Also, the statistical parameters required for the BLAST program are collected from the XML document and passed to BLAST executables on the computational nodes through Resource Specification Language (RSL). Below is an example of how the Globusrun command might be used.
Remote job submission
Globusrun -r bluejay002.ncbiogrid.org -f submit.rsl
|
The RSL invocation script must also set the environmental variables that the Globus Toolkit 2.0 requires. Variables such as the name of the authentication server, the MyProxy server, and User Proxy servers are all configured at runtime by the Web service provider before the actual job submission. The following code is an example of how to set those variables.
Setting environment variables on the Grid
setenv MYPROXY_SERVER_DN "/O=Grid/OU=NCBioGrid/OU=
TestBed/CN=host/bluejay015.ncbiogrid.org"
setenv MYPROXY_SERVER bluejay015.ncbiogrid.org
|
In order to fully exploit the processing power of grid, users may simultaneously submit several jobs. For this reason, RSL is used in a completely generic manner, so that at run time the master node decides how many sequences to submit and produces the required RSL code with different specifications for each job. In order to best utilize resources for each submitted job, different resource managers are assigned to each job at runtime. Below is the code for generating RSL code at runtime.
Generate RSL code at runtime
my $rsl_string="+";
foreach(0..$numSeq){
$rsl_string .="(&(executable=/ncbg/apps/blast/bin/blastall))
(directory = /home/\"$user\"/grid-blast)
(arguments =
-p \"$program\" -d \"$db\" -i \"$inputFiles[$_]\" -m 7 -o \"$outputFiles[$_]\")
(environment = (BLASTDB /ncbg/data/blast/current/n/ecoli.nt))
(count = 1)
(resourceManagerContact=bluejay002.ncbiogrid.org))";
}
|
Most of the bioinformatics applications use databases to search for similarities between different species and their genomes. The National Center for Biotechnology Information (NCBI) maintains a separate database for each of the species' genomes. Based on a bioinformaticists requirement, the proper database must be installed and made available to all the computational nodes that are participating in computation. The Avaki data grid provides virtual shared directories, so each participating node can easily access a local replica of the required database. At runtime, a single primary node generates the RSL code and specifies the location of the virtual directory that contains the necessary database.
Secure Web services and Globus Security Infrastructure
Another challenge of running Web-service-enabled applications on a grid is the integration of Globus Security Infrastructure (GSI) with secure Web services. When you deploy a grid application as a Web service on an HTTP server, such as Apache, three separate security domains must interact. First, users need to have username/password credentials to access the Web server. Second, users need separate credentials to access the specific Web service, or any operations within it. Finally, to run applications on the grid, users must have proper Globus credentials, which consist of username, public/private key, hostname, and the digital signature of the certificate authority.
In the Globus Toolkit GSI, you log into a host machine participating in the grid with a set of credentials. Then you are authenticated based on these credentials. If the authentication is successful, a server-side user proxy is created on your behalf. Whenever you want to get access to resources on the grid, your specific user proxy talks to the resource proxy and confirms your identity. This functionality saves you from typing your password each time you need to access different resources. In other words, single-sign-on is provided through user and resource proxy interaction.
However, exposing an application outside the grid, via Web services, brings some complications to the GSI system. Since user Apache is the job owner, from the Globus environment perspective, the Globus environment expects user Apache to have the proper credentials for job submission. However, only the end user bioinformaticist has such credentials. Therefore, a propagation of credentials to the Globus environment is required.
This propagation of credentials may be summarized as two different steps. The first step is retrieving the username and password from the Web service provider, and the second step is passing those credentials from Apache to Globus.
HTTP basic authentication is the least common denominator for security among Web service providers. While basic authentication is not necessary to secure the Web service, it adds both speed and standardization to the security system. There are several ways to map HTTP basic authentication to grid users. In most systems, valid grid users also have regular user accounts on the machines that participate with the grid. Preferably, these user accounts are stored in a standard identity registry such as LDAP. Most often, one machine is running an LDAP server, such as iPlanet or OpenLDAP. Several Apache modules map valid users from an LDAP directory to valid users in Apache. For example, Apache can use mod_pam, Auth_ldap, LDAP_auth, mod_authz_ldap, and mod_ldap, all of which map system user permissions to Apache permissions. While every grid user is a user on a participating machine, the reverse in not true. Therefore, security policies are defined in a .htconf file for both the groups and users. For most of these modules, everything that can be said about individual users also applies to groups of users.
In Perl, you can retrieve the username and password (at the Web service provider) by calling these functions from a CGI script with mod_perl.
Retrieve authentication information
my $r = shift;
($ret, $password) = $r->get_basic_auth_pw;
$user = $r->user;
|
Once you have the username and password, the next step is to retrieve the grid credentials necessary to run a job.
An open source project called MyProxy from the National Center for Supercomputing Applications (NCSA) provides a mechanism to map users to their grid credentials. These credentials consist of a certificate and private key. The purpose of using the MyProxy tool is so that the certificate and private key files need not be stored on the same machine as the Web service consumer. This provides greater security and allows trusted parties to renew credentials so that long-running tasks do not fail because of an expired proxy credential. The NCSA runs a public MyProxy server and the software is available from the Partnership for Advanced Computational Infrastructure and the NSF Middleware Initiative (see Resources). MyProxy APIs are also available through the Globus Commodity Grid Kits (see Resources).
In order to delegate the proper credentials to an Apache server on behalf of an end user, Apache must authenticate the user with their grid credentials to the MyProxy server and retrieve the users' grid credentials under specified directories with the appropriate user id. An example for the user maltuna is given below:
Delegating credentials to an Apache server
myproxy-get-delegation -s bluejay015.ncbiogrid.org -l maltuna -o /tmp/x509up_maltuna
|
In the case of having many users simultaneously accessing the Apache server, each of these credentials are stored under specific user names. Therefore, no end user can access or overwrite someone else's credential file. By default, this credential is stored under the user Apache ID, such as /tmp/x509up_u6448.
When the job submission to Globus is performed, GSI checks the default user proxy server and credentials under the job owner's name. Since MyProxy is not a part of the GSI system, the Globus environment is not aware of any delegated rights to Apache. In order to link them together, the Apache server needs to set up environment variables called USER_PROXY and a credential file directory.
Linking delegated rights and the Globus environment
setenv X509_USER_PROXY /tmp/x509up_$3
|
An example: Web service for (BLAST)
The NC BioGrid is now hosting a Web service for BLAST that lets you execute thousands of BLAST operations in parallel.
The BLAST Web service is a SOAP::Lite service using SOAP over HTTP transport. Authentication for the Web service is handled with the basic HTTP Authentication from Apache bound to an iPlanet LDAP server with module mod_pam. The Globus framework handles authorization. To use GridBlast, you must be a member of the NC Biogrid.
The GridBlast service on the NC BioGrid provides a model for how document-style services can allow existing applications to access remote processing power. Currently the NC Biogrid has Globus Toolkit 2.0 installed. Although our existing infrastructure does not support grid services, the existing document-style service currently deployed will become a grid service when the grid is upgraded to Globus Toolkit 3.0.
Summary
This article addressed several issues relating to the high-throughput Web services such as large data sets and security. It demonstrated the BLAST Web services deployed on the NC BioGrid, describing various problems and possible solutions.
Acknowledgements
This paper describes the joint work of the Extreme Blue team Summer 2003, Fungal Genomics Lab at NC State University and the North Carolina Biogrid. Our team has set up a framework for deploying bioinformatics applications as high-throughput Web Services on the North Carolina BioGrid. The intern team consists of: Mine Altunay (maltuna@unity.ncsu.edu), Daniel Colonnese (dcolonn@ncsu.edu), Chetna Warade (warade@us.ibm.com), and Lindsay Wilber (WilberL04@darden.virginia.edu). The team was advised by members of the IBM Life Sciences Group, including Virinder Batra (batra@us.ibm.com), Madhu Gombar (mgombar@us.ibm.com), Rick Runyan (runyan@us.ibm.com), Prasad Vadlamudi (prasadv@us.ibm.com) and Doug Brown (debrown@unity.ncsu.edu).
Resources
- Get more information on Apache AXIS.
- Read Part 1 and Part 3 of the "Web services for bioinformatics" series (developerWorks, May & June 2004).
- Read over the BioPerl 1.2 Module Documentation.
- Check out the JAX-RPC Specification v1.0.
- Visit the NC BioGrid.
- Take a look at a "Web Service for Bioinformatic Analysis Workflow on alphaWorks.
- See a demo or download the "Bioinformatic Workflow Builder Interface on alphaWorks.
- Read the article Web Services for Life Sciences, which has an example set of Web services that offers standard bioinformatics applications and demonstrates the technology (alphaWorks, February 2003).
- Find the data compression library, zlib Canonical, at the zlib homepage.
- Download the Globus Toolkit from Globus.org.
- Find out what the Globus Commodity Grid Kits are all about.
- Store your Grid credentials in the MyProxy Online Credential Repository.
- Find the MyProxy related software at the Partnership for Advanced Computational Infrastructure Web site and the NSF Middleware Initiative site.
- Read the article "Reap the benefits of the document-style Web services" (developerWorks, June 2002).
- Browse through the PDF by I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke, "A Security Architecture for Computational Grids." In Proceedings of the 5th ACM Conference on Computer and Communications Security, pages 83-92, November 1998.
- Check out the Globus Security Infrastructure (GSI).
About the authors  | |  | Mine Altunay: Mine is currently pursuing her PhD at the Computer Engineering Department of North Carolina State University. Her studies focus on grid computing and workflow management in OGSA, with a strong emphasis on authorization and trust management issues. She is also a member of the Fungal Genomics Laboratory, where she has worked on several bioinformatics projects, as well as the establishment and integration of their computational and data grids with North Carolina BioGrid. You can contact Mine at maltuna@unity.ncsu.edu. |
 | |  | Daniel Colonnese: Daniel has recently completed his master’s degree in computer science from NC State University. He has worked on a number of projects in ecommerce, life sciences, and grid computing. His interests include software reliability and service-oriented architectures. He will be joining Lotus/Portal technical sales in June 2004. You can contact Daniel at dcolonn@ncsu.edu. |
 | |  | Chetna Warade: Since 1999, Chetna has worked on a wide range of projects varying from systems programming to bioinformatics. She has a strong interest and aptitude in software architecture and development, systems programming, and various emerging technologies such as Web services, life sciences, and the new breed of Internet technologies. You can contact Chetna at warade@us.ibm.com. |
Rate this page
|  |