slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services PowerPoint Presentation
Download Presentation
The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

Loading in 2 Seconds...

play fullscreen
1 / 50

The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services - PowerPoint PPT Presentation

  • Uploaded on

The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services Research Computing University Information Technology Services Indiana University at Indianapolis January 2007. Outline. Basic genome science processes and vocabulary Basic relational algebra

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The Centralized Life Sciences Data (CLSD) service

Michael Grobe

Scientific Data Services

Research Computing

University Information Technology Services

Indiana University at Indianapolis

January 2007


Basic genome science processes and vocabulary

Basic relational algebra

Simple SQL as an expression of the relational algebra

DB2 and the Federated Server

CLSD data sources: “relationalized”, mirrored, and federated

Accessing CLSD

Directions for possible future work:

Adding data sources

Integrating more completely with the TeraGrid

Integrating with other Grids

Questions, suggestions

the chemistry
The chemistry

A “polymer” is a chemical composed of many similar units, e.g. polyvinyl chloride, starches, etc.

DNA is a (usually double-stranded) polymer composed of nucleotides:

Thymine, Adenosine, Cytosine, and Guanine

DNA carries genetic information. Individual units of genetic information are stored in individual (possibly quite long) segments of DNA.

RNA is a (usually single-stranded) polymer composed of nucleotides:

Uracil, Adenosine, Cytosine, Guanine

There are many varieties of RNA (mRNA, snRNA, rRNA, snoRNA,etc.), and they serve different functions within a cell. For example, RNA “transfers” genetic information, catalyses reactions, and otherwise assists or interferes with reactions.

the chemistry ii
The chemistry II
  • Polymers are synthesized by catalysts called “polymerases” in a process called “polymerization.”
  • Proteins are polymers composed of (over 20 different kinds of) amino acids, such as:
    • Methionine (M), Isoleucine (I), Cysteine(C), Histidine (H), Alanine(A), Glutamic acid (E), Leucine (L), etc.
  • Proteins:
    • provide structure:
      • microfilaments (polymers of actin),
      • microtubules (polymers of tubulins),
      • channels thru the cell wall, etc.
    • catalyse and co-catalyse reactions, as “enzymes,”
    • bind with DNA to enhance or inhibit “transcription” and “translation”,
    • are sometimes marked for transport or degradation.
  • Protein primary, secondary and tertiary structures are important.
  • Proteins are degraded within proteasomes..
the central model of molecular genetics
The central model of molecular genetics

DNA can be reliably replicated during the process of cell division, by DNA-dependent DNA polymerases.

DNA can be “transcribed” to messenger RNA (mRNA) by DNA-dependent RNA polymerases. Transcription takes place in the nucleus (or equivalent).

mRNA is transported to the cytoplasm where it is used as a template for creating proteins by “ribosomes” in a process called “translation.”

The translation process encodes 1 amino acid for each 3 DNA bases in a sequence (“triplet”).

The function mapping each of the 64 possible triplets to an amino acid is the “genetic code.”

Ribosomes are complexes of RNA and protein.

the central model within the cell
The central model within the cell

Diagram from:

(Don’t forget about degradation and recyling of AAs.)

the central model in more detail
The central model in more detail

(Graphics of DNA and RNA from Atherly, et al. 1999)

the central model in even more detail
The central model in even more detail

(Graphics of DNA and RNA from Atherly, et al. 1999)

mutations and polymorphisms
Mutations and polymorphisms

Nucleotide sequenceTranslated AA sequence

Wildtype: ACTGAACTGATT Thr–Glu–Leu-Ile

Substitution: ACTGACCTGATT Thr-Asp-Leu-Ile

Deletion: ACTCTGATT Thr-Leu-Ile

Insertion: ACTGAACCTGAACTGATT Thr-Glu-Pro-Gly-Leu-Ile

If mutations like these occur in genetic material within oocytes, they may be transmitted to offspring, and define “polymorphic” gene variations.

A Single Nucleotide Polymorphism (SNP) is a variation where one base is changed and passed on to offspring (and occurs with sufficient frequency).

A Deletion/Insertion Polymorphism (DIP) is a variation where multiple bases have been removed or inserted into a sequence.

dbSNP is a database of SNPs and DIPs containing millions of entries, and over 120K unique sequences that are inserted or deleted.

exons introns and isoforms ii
Exons, introns and isoforms II

Alternative splicing products (isoforms) can be derived from the same gene, so that one gene can code for multiple proteins.

Both protein-coding and non-protein-coding genes may be embedded within introns, and may be “co-expressed.”

The spliceosome is composed of a collection of protein and small nuclear RNA molecules (snRNA).

Almost every human gene is thought to have at least 2 isoforms.

The set of all isoforms is sometimes called the “transcriptome.”

scale of human genome data
Scale of human genome data

Total number of bases: 3.2Gbp

(DNA from one half of one chromosome (chromatid) from each of 24 chromosomes: 22 autosomal chromosome pairs plus the sex chromosomes.)

Percentage of genome consisting of protein coding genes: < 2%

Average gene length: ~3Kbp (but up to 2.4Mbp)

Average exon length: 200bp

Average protein length: 500-600AA

Percentage of “junk” DNA: often said to be ~50%

Percentage of “junk” DNA now suspected to be transcribed (the “dark matter” of the genome): ~50 to 100%

Some of that junk is mRNA that negatively regulates translation.

transcription factors activators enhancers what is a gene
Transcription factors, activators, enhancers: What is a “gene”

Such sites may be several thousand base pairs upstream of a start site, and even downstream of a start site. Some are even in introns.

Control of cell processes occurs at every step in the protein lifecycle: transcription, translation, transport, degradation.


"We can no longer think of a gene as a simple region of DNA that transcribes RNA for the sole purpose of making proteins,"

"The reality is that a single gene may be a large region of DNA from which a whole cast of RNA molecules are transcribed, all of which are expressed in a coordinated fashion to provide a biological function.“

Tom Gineras, Affymetrix

basic relational algebra
Basic relational algebra
  • The relational algebra operates on relations, which are sets of tuples of the same arity, which is to say, collections of lists of the same length. Here are two 4-tuples:
  • ( 1, 2, 3, 4 )
  • ( 8, 7, 9, 4 )
  • Relations are commonly represented as tables.
  • There are 5 primitive operations within the relational algebra:
    • Projection: extract specific columns from a relation
    • Selection: extract specific rows
    • Set union: create a new table composed of all the rows of
    • two other tables
    • Set difference: remove the rows in one relation that appear in
    • another
    • Cartesian product: “multiply” two tables to create a third
cartesian product in more detail
Cartesian product in more detail

Table1 (arity 4; length 3)

Table2 (arity 3; length 2)

Cartesian product (arity: 4 + 3; length: 3 * 2)

relational databases and query languages
Relational databases and query languages
  • Database management systems based on the relational algebra were described by Edward F. Codd working for IBM in the early 1970s.
  • Codd’s formulation included:
    • indexes and keys,
    • decomposition into normal forms, and
    • integrity constraints.
  • Multiple languages and interfaces were developed to query and modify collections of relations, among them the Structured English Query Language, SEQUEL, developed by Chamberlain and Boyce.
sql as an implementation of the relational algebra
SQL as an implementation of the relational algebra

The most successful such language, SQL, was based on SEQUEL, and maps to the relational primitives as follows:

Projection select fieldname_list from tablename

Selection select * from tablename where <Boolean expression>

Union (select fieldname_list from tablename1)


(select fieldname_list from tablename2)

use ALL to keep duplicates

Set difference select * from (tablename1 except tablename2)

Cartesian product select * from tablename1, tablename2

Note that SQL does not specify how to perform a query; only what the result should be. It is a “declarative,” rather than “procedural,” language.

ibm s db2 and websphere federated server nee information integrator nee discoverylink
IBM’s DB2 and WebSphere Federated Server,nee Information Integrator, nee DiscoveryLink

DB2 is a fully-featured relational database system that can house and serve large databases.

Data is usually imported in relational form, structured as rows composed of individual data values, possibly identified by unique IDs (keys).

DB2 can also access data in tables managed by other, usually physically remote, database management systems, such as Oracle, MySQL or DB2.

This process is known as “data federation.”

DB2 can also federate some external resources that are not normally accessed as relational tables (e.g. Blast). Such resources are transformed, or “relationalized” on-the-fly by “wrappers”.

Once these resources have been registered with their wrappers they may be referred to within SQL queries as is any other resource.

some wfs jargon
Some WFS jargon

Wrapper: a library to access a particular class of data sources or protocols.

Each wrapper contains information about data source characteristics. There are BLAST and PubMed wrappers, and now a “generic Script wrapper” that talks to user scripts.

Server: represents a specific data source (user mappings maybe required for authentication)

Nickname: a local table name (alias) for a data on a server (mapped to rows and columns)

A nickname looks like a table, but links to a server, which links to a wrapper/data source, where the wrapper knows how to process the data from the source.

use of the generic script wrapper
Use of the generic Script wrapper

(Drawing design courtesy of Doug Del Prete)

using ncbi data within db2 more than just mirroring
Using NCBI data within DB2: More than just mirroring
  • Mirroring usually implies maintaining exact copies of data sources.
  • Most data mirrored by CLSD must not only be copied, but also inserted into the CLSD relational structure.
  • This is accomplished by a series of scripts that:
    • Download the data from its external site,
    • Convert it to a form that can be used to update CLSD tables,
    • Insert the data into tables, and
    • Monitor the overall process to identify and log errors.
  • These scripts are run regularly from crontab entries, and monitoring results are examined after every run.
clsd relationalized data sources
CLSD “relationalized” data sources

BIND -- Pathways, Gene interactions

ENZYME -- Enzyme nomenclature

ePCR -- ePCR results of UniSTS vs Homo sapiens

KEGG data sources:

LIGAND -- Pathways, Reactions, & Compounds

PATHWAY -- Pathway map coordinates

NCBI data sources:

LocusLink -- Genetic Loci. (LocusLink has been inactive since

July 1, 2005 when it was retired in favor of UniGene.)

UniGene -- Gene clusters

SGD -- Saccharomyces Genome Database

kegg datasource info
KEGG datasource info

PATHWAY:   42,273 pathways generated from 306 reference pathways

LIGAND: 14,238 compounds,

4,111 drugs,

10,951 glycans,

6,810 reactions ,

7,127 reactant pairs

clsd federated data sources
CLSD federated data sources

Federated NCBI data sources (subject to hit rate throttling):

Nucleotide -- Nucleotide sequences

PubMed -- Journal abstracts

Federated local mirrors of NCBI data sources (not throttled):

Blast (updated monthly) is mirrored by UITS

dbSNP (updated at major builds) is mirroed by IUSM

Some KEGG resources are federated via the FS KEGG user-defined functions

blast both mirrored and federated
BLAST: Both mirrored and federated

NCBI Blast is typically accessed via a web page at NCBI, or some mirrored site.

Data is returned in a typical web interface format suitable for users.

Within CLSD, BLAST is accessed via an SQL query and data is returned as a table that can be manipulated as is any other DB2 table.

For example, here is an SQL query that invokes a blastall process running on libra00 from within DB2:

select GB_ACC_NUM, description, e_value from

ncbi.BLASTN_NT where BlastSeq =


The local version of blastall conducts the search and returns results encoded within XML (by specifying the –m7 parameter).


The DB2 federation software converts the XML encoded results into something like this:



AE003644 Drosophila melanogaster chromosome 0.00666475

2L, section 53 of 83 of the complete sequence

AE003410 Drosophila melanogaster, chromosome 0.00666475

2L, region 34C4-36A7 (Adh region), section

4 of 10 of the comple

AC092228 Drosophila melanogaster, chromosome 0.00666475

2L, region 35X-35X, BAC clone

BACR21J17, complete sequence

AP008207 Oryza sativa (japonica cultivar-group) 0.0263349

genomic DNA, chromosome 1, complete


AP003197 Oryza sativa (japonica cultivar-group) genomic 0.0263349

DNA, chromosome 1, BAC clone:B1015E06

AP003105 Human DNA sequence from chromosome 1, 0.0263349

putative argumentativeness gene GROBE1

modifying blast search settings via sql
Modifying BLAST search settings via SQL

Parameters sent to blastall can be set by using equality comparisons as assignment statements within SQL conditionals, as in:

select Score, E_Value, HSP_Info, HSP_Q_Seq, HSP_H_Seq

from ncbi.BLASTN_NT

where BlastSeq = 'gagttgtcaatggcgagg'

and gapcost=8 and E_Value < .0005

which will pass gapcost and e-value settings on to blastall.

blast data sources available via clsd
BLAST data sources available via CLSD

Here is a list showing which search types are supported by the DB2 BLAST wrapper within CLSD.

BLAST search type: Data sources

BLASTN: NT, EST_HUMAN, EST_MOUSE, and EST_OTHER A nucleotide sequence is compared with the contents of a nucleotide sequence database.


An amino acid sequence is compared with the contents of an amino acid database.


A nucleotide sequence is compared with the contents of an amino acid sequence database. Query is translated in all six reading frames.

user defined functions supplied by ibm
User-defined functions (supplied by IBM)
  • There exist special functions for manipulating sequence patterns:
    • LSPatternMatch
    • LSPrositePattern
  • To get a list of (aspartate aminotranserase) BLAST results filtered by a (pyridoxal phosphate attachment site) pattern specified in PROSITE pattern language:
  • select gb_acc_num, HSP_H_SEQ from ncbi.blastp_nr where
  • and DB2LS.LSPatternMatch(HSP_H_SEQ,
  • DB2LS.LSPrositePattern(
  • '[GS]-[LIVMFYTAC]-[GSTA]-K-x(2)-[GSALVN].' ) ) > 0
  • Note the use of the period (.) to terminate the PROSITE pattern, and that the LSPatternMatch function returns the character position of the left-most substring matching the pattern, or zero if there is no match.
accessing clsd getting an account
Accessing CLSD: getting an account

To access CLSD you must have an account on the Libra Cluster at IU (aka

If you don’t have an account and are associated with Indiana University, request an account by filling out a Research Systems Account Application at

In the comments section of the account request, add that you need a local and persistent password for use with CLSD.

Once you have a Libra account, send email to SDS at data @ and request instructions for defining a local and persistent password for use with CLSD.

TeraGrid users should send e-mail to SDS at data @ explaining how CLSD will be used, and describing their TeraGrid activities. SDS will then arrange for an appropriate Libra account and send instructions for defining a suitable password.

accessing clsd options
Accessing CLSD: options
  • DB2 can be accessed in a variety of ways:
  • DB2 Command Line Processor (Unix, Windows)
  • DB2 Control Center (wherever JRE is running)
  • DB2 driver for Perl DBI
  • DB2 drivers for the Java Database Connectivity (JDBC) Application Program Interface (API), especially the JDBC Universal Driver
  • Demonstration Web pages (invoke a Java servlet that uses JDBC):
  • Demonstration WebService (invoked as a function call via JAX-RPC):
  • Demonstration Web pages (invoke a Java servlet that invokes the CLSD
  • WebService):
  • Experimental WSRF Resource (using WSRF within a GT4 container)
  • Experimental OGSA-DAI service (running within a GT4 container)
jdbc access
JDBC access
  • Connect to the CLSD:
    • Class.forName( "" );
    • con = DriverManager.getConnection(
    • "jdbc:db2://",
    • accountName, accountPassword );
  • Prepare a query, send it to the db, and receive a result:
    • statement = con.createStatement();
    • resultSet = statement.executeQuery( query );
  • Get some query meta-data (column labels and column data types):
    • ResultSetMetaData rsmd = resultSet.getMetaData();
    • result = rsmd.getColumnLabel( colCount );
    • result2 = rsmd.getColumnTypeName( colCount );
jdbc access continued
JDBC access (continued)

Get a row of data:

for( int colCount = 1; colCount <= numcols; colCount++ )


String returnedString = ""; // Must be predefined.

returnedString = resultSet.getString( colCount ) + "";

out.println( "<td>" + returnedString + "</td>\n" );


accessing clsd thru a webservice jax rpc
Accessing CLSD thru a WebService (JAX-RPC)

The Java API for XML-based Remote Procedure Calls, or JAX-RPC, is a specification that defines a system for building distributed services (so-called “WebServices”) within the client-server model.

JAX-RPC makes it possible for a function invocation in a client like:

a_variable = function_name( parameter_list)

to cause the function, “function_name,” to run on a remote server and return a response containing the value to be assigned to the variable “a_variable”,

and a function invocation in a client like:

returnString = queryCLSD( "select * from syscat.tables",

"1", "5", "accountName", "accountPassword", “table” )

will return a (possibly very long) string containing the response to the query (given that various linkages have been prearranged).

outline of the clsdservice
Outline of the CLSDservice

public class CLSDservice

{ // Full source at:


public String queryCLSD( String query, String startingRowToPrint,

String maxRows, String account, String password,

String format )


// Get a query string, etc. from the command line or Web

// browser.

// Declare JDBC drivers and connect to DB2.

// Prepare a JDBC statement containing the SQL query, submit

// it to DB2, and capture the returned JDBC result set.

// Query result set metadata for column names and types to

// return as the first row, and then collect the contents of

// each data row.

return theResponse;

} // end queryCLSD

} // end Class CLSDservice

soap and wsdl
  • JAX-RPC uses SOAP and WSDL to establish the various linkages required to implement remote procedure calls.
  • SOAP messages are usually encoded as XML messages within HTTP requests where:
  • A SOAP request is an HTTP POST request with an XML body.
  • A SOAP response is an HTTP response header followed by an XML body.
  • Such RPC functions are “exposed” as “operations” when described within web pages using the Web Services Description Language (WSDL).
java command line client to access clsd via clsdservice
Java command-line client to access CLSD via CLSDservice

public class testCLSDClient


public static void main(String [] args) {



String endpoint =


Service service = new Service();

Call call = (Call) service.createCall();

call.setTargetEndpointAddress( new endpoint ) );


new QName("", "queryCLSD" ) );

String returnString = (String) call.invoke( new Object[]

{ "select * from syscat.tables",

"1", "5", "accountName", "accountPassword", “table” } );

System.out.println( returnString );


catch (Exception e)






perl command line client to access clsd via clsdservice
Perl command-line client to access CLSD via CLSDservice

#!perl –w

use SOAP::Lite;

# Set up the call to CLSD using SOAP.

$host = “”;

$service = SOAP::Lite -> service(

“http://$host:8421/axis/CLSDservice.jws?wsdl” );

# Make the call to CLSD.

$result = $service->queryCLSD(

“select tabschema,tabname from syscat.tables”,

1, 5, "DB2account", "password""table" );

print $result;

  • The Open Grid Services Architecture (OGSA) is an “architecture” for building computational grids.
  • In particular, OGSA “…defines a set of core capabilities and behaviors that address key concerns in Grid systems.” [2] It does not, however, implement or define how to implement such core capabilities.
  • OGSA is NOT layered or object oriented.
  • However, both will be exploited naturally in some implementations.
  • OGSA provides an architecture for building services such as:
    • “Service-Based distributed query processing,”
    • “Grid Workflow”,
    • “Grid Monitoring Architecture”
    • etc.
ogsa dai

OGSA-Data Access and Integration (OGSA-DAI) is a very flexible and powerful data access framework that can be used within an OGSA grid environment.

It provides various data movement, virtualization, and manipulation services that transform the use of data into a higher-level workflow.

The OGSA-DAI client shown in the next slide uses the OGSA-DAI Client Toolkit to send a hard-coded query to CLSD (here known as the “DB2Resource).

The Toolkit allows clients to use JDBC by creating a JDBC ResultSet object from an OGSA-DAI WebRowSet.

The response is encoded using XML and may be retrieved as a single string, or as individual fields by using individual JDBC calls as shown below.

java command line client to access clsd via ogsa dai
Java command-line client to access CLSD via OGSA-DAI

public class queryCLSD


public static void main(String[] args) throws Exception


// Create an instance of the data service.

String handle =


String id = "DB2Resource";

DataService service =


handle, id);

// Define a request composed of one activity.

SQLQuery query = new SQLQuery(

"select tabschema,tabname from syscat.tables");

WebRowSet rowset = new WebRowSet( query.getOutput() );

ActivityRequest request = new ActivityRequest();

request.add( query );

request.add( rowset );

java command line client to access clsd via ogsa dai 2
Java command-line client to access CLSD via OGSA-DAI 2

// Submit the request and retrieve results.

Response response = service.perform( request );

ResultSet result = rowset.getResultSet();

ResultSetMetaData rsmd = result.getMetaData();

int numCols = rsmd.getColumnCount();

// Display each column from each row.

while( )


for( int colCount = 1; colCount <= numCols; colCount++ )


out.print( “ “ + result.getString( colCount ) );







This client displays a small part of the functionality provided by OGSA-DAI. In addition, an OGSA-DAI service can be configured to:

    • operate on XML or text data sources, as well as relational data sources,
    • perform a series of operations (also known as “activities”) as part of a single request,
    • deliver results to a third party (via FTP, GridFTP, SMTP, etc.) or to another data service,
    • deliver results asynchronously, which can be very useful for long-running requests, and
    • utilize authentication methods supported by WSRF to provide grid-based security.
  • Also, exposing a database via OGSA-DAI makes it available for OGSA Distributed Query Processing (OGSA-DQP), so that its use may be further virtualized within the DQP model.
  • In some cases, however, OGSA-DAI and DQP may introduce performance penalties.

Current and possible directions

  • Adding data sources: mirrored and federated
    • Requests for mirroring or federating will be gladly entertained
    • DB2 now provides a user-configurable script wrapper that connects to a remote DB2 daemon that can start any co-located arbitrary script and return data encoded in XML (restricted to one foreign key per table)
    • Such a script could be built to relay any web resource that returns XML meeting key restrictions.
    • Wrappers could be constructed to relay some OGSA-DAI resources
  • Implementing the OGSA-DAI service in productional mode.
  • Integrating with the TeraGrid
    • CLSD is currently accessible from the TeraGrid, but authentication is local.
    • It may be possible to enforce TeraGrid based X.509 authentication, using either WSRF or OGSA-DAI interfaces.
  • Atherly, Alan G, et al., The Science of Genetics, 1999.
  • Apache Foundation, AXIS User’s Guide,

  • Codd, Edward F., A Relational Model of Data for Large Shared Data Banks,

(See also:

  • CSLD web page:
  • Foster, Ian, et al. “The Open Grid Systems Architecture, Version 1.5”.
  • Sotomayer, Boria and Lisa Childers, Globus Toolkit 4: Programming Java Services
  • Sundaram, Babu, Understanding WSRF,

Questions, comments, suggestions?