The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services Research Computing University Information Technology Services Indiana University at Indianapolis January 2007

Outline Basic genome science processes and vocabulary Basic relational algebra Simple SQL as an expression of the relational algebra DB2 and the Federated Server CLSD data sources: “relationalized”, mirrored, and federated Accessing CLSD Directions for possible future work: Adding data sources Integrating more completely with the TeraGrid Integrating with other Grids Questions, suggestions

The chemistry A “polymer” is a chemical composed of many similar units, e.g. polyvinyl chloride, starches, etc. DNA is a (usually double-stranded) polymer composed of nucleotides: Thymine, Adenosine, Cytosine, and Guanine DNA carries genetic information. Individual units of genetic information are stored in individual (possibly quite long) segments of DNA. RNA is a (usually single-stranded) polymer composed of nucleotides: Uracil, Adenosine, Cytosine, Guanine There are many varieties of RNA (mRNA, snRNA, rRNA, snoRNA,etc.), and they serve different functions within a cell. For example, RNA “transfers” genetic information, catalyses reactions, and otherwise assists or interferes with reactions.

The chemistry II • Polymers are synthesized by catalysts called “polymerases” in a process called “polymerization.” • Proteins are polymers composed of (over 20 different kinds of) amino acids, such as: • Methionine (M), Isoleucine (I), Cysteine(C), Histidine (H), Alanine(A), Glutamic acid (E), Leucine (L), etc. • Proteins: • provide structure: • microfilaments (polymers of actin), • microtubules (polymers of tubulins), • channels thru the cell wall, etc. • catalyse and co-catalyse reactions, as “enzymes,” • bind with DNA to enhance or inhibit “transcription” and “translation”, • are sometimes marked for transport or degradation. • Protein primary, secondary and tertiary structures are important. • Proteins are degraded within proteasomes..

Genetic material: 2 meters of DNA packaged into less than 1.4 microns From Atherly,et al., 1999

The central model of molecular genetics DNA can be reliably replicated during the process of cell division, by DNA-dependent DNA polymerases. DNA can be “transcribed” to messenger RNA (mRNA) by DNA-dependent RNA polymerases. Transcription takes place in the nucleus (or equivalent). mRNA is transported to the cytoplasm where it is used as a template for creating proteins by “ribosomes” in a process called “translation.” The translation process encodes 1 amino acid for each 3 DNA bases in a sequence (“triplet”). The function mapping each of the 64 possible triplets to an amino acid is the “genetic code.” Ribosomes are complexes of RNA and protein.

The central model within the cell Diagram from: http://www.ncbi.nih.gov/About/primer/images/proteinsynth4.GIF (Don’t forget about degradation and recyling of AAs.)

The central model in more detail (Graphics of DNA and RNA from Atherly, et al. 1999)

The central model in even more detail (Graphics of DNA and RNA from Atherly, et al. 1999)

Mutations and polymorphisms Nucleotide sequenceTranslated AA sequence Wildtype: ACTGAACTGATT Thr–Glu–Leu-Ile Substitution: ACTGACCTGATT Thr-Asp-Leu-Ile Deletion: ACTCTGATT Thr-Leu-Ile Insertion: ACTGAACCTGAACTGATT Thr-Glu-Pro-Gly-Leu-Ile If mutations like these occur in genetic material within oocytes, they may be transmitted to offspring, and define “polymorphic” gene variations. A Single Nucleotide Polymorphism (SNP) is a variation where one base is changed and passed on to offspring (and occurs with sufficient frequency). A Deletion/Insertion Polymorphism (DIP) is a variation where multiple bases have been removed or inserted into a sequence. dbSNP is a database of SNPs and DIPs containing millions of entries, and over 120K unique sequences that are inserted or deleted.

Exons, introns and isoforms in eucaryotes

Exons, introns and isoforms II Alternative splicing products (isoforms) can be derived from the same gene, so that one gene can code for multiple proteins. Both protein-coding and non-protein-coding genes may be embedded within introns, and may be “co-expressed.” The spliceosome is composed of a collection of protein and small nuclear RNA molecules (snRNA). Almost every human gene is thought to have at least 2 isoforms. The set of all isoforms is sometimes called the “transcriptome.”

Scale of human genome data Total number of bases: 3.2Gbp (DNA from one half of one chromosome (chromatid) from each of 24 chromosomes: 22 autosomal chromosome pairs plus the sex chromosomes.) Percentage of genome consisting of protein coding genes: < 2% Average gene length: ~3Kbp (but up to 2.4Mbp) Average exon length: 200bp Average protein length: 500-600AA Percentage of “junk” DNA: often said to be ~50% Percentage of “junk” DNA now suspected to be transcribed (the “dark matter” of the genome): ~50 to 100% Some of that junk is mRNA that negatively regulates translation.

The “promoter” region: A landing site for a procaryotic RNA Polymerase

Transcription factors, activators, enhancers: What is a “gene” Such sites may be several thousand base pairs upstream of a start site, and even downstream of a start site. Some are even in introns. Control of cell processes occurs at every step in the protein lifecycle: transcription, translation, transport, degradation.

"We can no longer think of a gene as a simple region of DNA that transcribes RNA for the sole purpose of making proteins," "The reality is that a single gene may be a large region of DNA from which a whole cast of RNA molecules are transcribed, all of which are expressed in a coordinated fashion to provide a biological function.“ Tom Gineras, Affymetrix

Process control: cancer-related reaction pathways from Hanahan, et al.

Basic relational algebra • The relational algebra operates on relations, which are sets of tuples of the same arity, which is to say, collections of lists of the same length. Here are two 4-tuples: • ( 1, 2, 3, 4 ) • ( 8, 7, 9, 4 ) • Relations are commonly represented as tables. • There are 5 primitive operations within the relational algebra: • Projection: extract specific columns from a relation • Selection: extract specific rows • Set union: create a new table composed of all the rows of • two other tables • Set difference: remove the rows in one relation that appear in • another • Cartesian product: “multiply” two tables to create a third

Cartesian product in more detail Table1 (arity 4; length 3) Table2 (arity 3; length 2) Cartesian product (arity: 4 + 3; length: 3 * 2)

Relational databases and query languages • Database management systems based on the relational algebra were described by Edward F. Codd working for IBM in the early 1970s. • Codd’s formulation included: • indexes and keys, • decomposition into normal forms, and • integrity constraints. • Multiple languages and interfaces were developed to query and modify collections of relations, among them the Structured English Query Language, SEQUEL, developed by Chamberlain and Boyce.

SQL as an implementation of the relational algebra The most successful such language, SQL, was based on SEQUEL, and maps to the relational primitives as follows: Projection select fieldname_list from tablename Selection select * from tablename where <Boolean expression> Union (select fieldname_list from tablename1) union (select fieldname_list from tablename2) use ALL to keep duplicates Set difference select * from (tablename1 except tablename2) Cartesian product select * from tablename1, tablename2 Note that SQL does not specify how to perform a query; only what the result should be. It is a “declarative,” rather than “procedural,” language.

IBM’s DB2 and WebSphere Federated Server,nee Information Integrator, nee DiscoveryLink DB2 is a fully-featured relational database system that can house and serve large databases. Data is usually imported in relational form, structured as rows composed of individual data values, possibly identified by unique IDs (keys). DB2 can also access data in tables managed by other, usually physically remote, database management systems, such as Oracle, MySQL or DB2. This process is known as “data federation.” DB2 can also federate some external resources that are not normally accessed as relational tables (e.g. Blast). Such resources are transformed, or “relationalized” on-the-fly by “wrappers”. Once these resources have been registered with their wrappers they may be referred to within SQL queries as is any other resource.

WFS diagram from Del Prete

Some WFS jargon Wrapper: a library to access a particular class of data sources or protocols. Each wrapper contains information about data source characteristics. There are BLAST and PubMed wrappers, and now a “generic Script wrapper” that talks to user scripts. Server: represents a specific data source (user mappings maybe required for authentication) Nickname: a local table name (alias) for a data on a server (mapped to rows and columns) A nickname looks like a table, but links to a server, which links to a wrapper/data source, where the wrapper knows how to process the data from the source.

Use of the generic Script wrapper (Drawing design courtesy of Doug Del Prete)

Using NCBI data within DB2: More than just mirroring • Mirroring usually implies maintaining exact copies of data sources. • Most data mirrored by CLSD must not only be copied, but also inserted into the CLSD relational structure. • This is accomplished by a series of scripts that: • Download the data from its external site, • Convert it to a form that can be used to update CLSD tables, • Insert the data into tables, and • Monitor the overall process to identify and log errors. • These scripts are run regularly from crontab entries, and monitoring results are examined after every run.

CLSD “relationalized” data sources BIND -- Pathways, Gene interactions ENZYME -- Enzyme nomenclature ePCR -- ePCR results of UniSTS vs Homo sapiens KEGG data sources: LIGAND -- Pathways, Reactions, & Compounds PATHWAY -- Pathway map coordinates NCBI data sources: LocusLink -- Genetic Loci. (LocusLink has been inactive since July 1, 2005 when it was retired in favor of UniGene.) UniGene -- Gene clusters SGD -- Saccharomyces Genome Database

KEGG datasource info PATHWAY: 42,273 pathways generated from 306 reference pathways LIGAND: 14,238 compounds, 4,111 drugs, 10,951 glycans, 6,810 reactions , 7,127 reactant pairs

CLSD federated data sources Federated NCBI data sources (subject to hit rate throttling): Nucleotide -- Nucleotide sequences PubMed -- Journal abstracts Federated local mirrors of NCBI data sources (not throttled): Blast (updated monthly) is mirrored by UITS dbSNP (updated at major builds) is mirroed by IUSM Some KEGG resources are federated via the FS KEGG user-defined functions

BLAST: Both mirrored and federated NCBI Blast is typically accessed via a web page at NCBI, or some mirrored site. Data is returned in a typical web interface format suitable for users. Within CLSD, BLAST is accessed via an SQL query and data is returned as a table that can be manipulated as is any other DB2 table. For example, here is an SQL query that invokes a blastall process running on libra00 from within DB2: select GB_ACC_NUM, description, e_value from ncbi.BLASTN_NT where BlastSeq = 'AGTACTAGCTAGCTAGCTACTAGCTGACTGACTGACTGATGCATCGATGATGC‘ The local version of blastall conducts the search and returns results encoded within XML (by specifying the –m7 parameter).

The DB2 federation software converts the XML encoded results into something like this: GB_ACC_NUM DESCRIPTION E_VALUE (VARCHAR) (VARCHAR) (DOUBLE) AE003644 Drosophila melanogaster chromosome 0.00666475 2L, section 53 of 83 of the complete sequence AE003410 Drosophila melanogaster, chromosome 0.00666475 2L, region 34C4-36A7 (Adh region), section 4 of 10 of the comple AC092228 Drosophila melanogaster, chromosome 0.00666475 2L, region 35X-35X, BAC clone BACR21J17, complete sequence AP008207 Oryza sativa (japonica cultivar-group) 0.0263349 genomic DNA, chromosome 1, complete sequence AP003197 Oryza sativa (japonica cultivar-group) genomic 0.0263349 DNA, chromosome 1, BAC clone:B1015E06 AP003105 Human DNA sequence from chromosome 1, 0.0263349 putative argumentativeness gene GROBE1

Modifying BLAST search settings via SQL Parameters sent to blastall can be set by using equality comparisons as assignment statements within SQL conditionals, as in: select Score, E_Value, HSP_Info, HSP_Q_Seq, HSP_H_Seq from ncbi.BLASTN_NT where BlastSeq = 'gagttgtcaatggcgagg' and gapcost=8 and E_Value < .0005 which will pass gapcost and e-value settings on to blastall.

BLAST data sources available via CLSD Here is a list showing which search types are supported by the DB2 BLAST wrapper within CLSD. BLAST search type: Data sources BLASTN: NT, EST_HUMAN, EST_MOUSE, and EST_OTHER A nucleotide sequence is compared with the contents of a nucleotide sequence database. BLASTP: NR, SP An amino acid sequence is compared with the contents of an amino acid database. BLASTX: NR, SP A nucleotide sequence is compared with the contents of an amino acid sequence database. Query is translated in all six reading frames.

User-defined functions (supplied by IBM) • There exist special functions for manipulating sequence patterns: • LSPatternMatch • LSPrositePattern • To get a list of (aspartate aminotranserase) BLAST results filtered by a (pyridoxal phosphate attachment site) pattern specified in PROSITE pattern language: • select gb_acc_num, HSP_H_SEQ from ncbi.blastp_nr where • blastseq='MSQICKRGLLISNRLAPAALRCKSTWFSEVQMGPPDAILGVTE\ • AFKKDTNPKKINLGAGAYRDDNTQPFVLPSVREAEKRVVSRSLDKEYATIIGI\ • PEFYNKAIELALGKGSKRLAAKHNVTAQSISGTGALRIGAAFLAKFWQGNREI\ • YIPSPSWGNHVAIFEHAGLPVNRYRYYDKDT' • and DB2LS.LSPatternMatch(HSP_H_SEQ, • DB2LS.LSPrositePattern( • '[GS]-[LIVMFYTAC]-[GSTA]-K-x(2)-[GSALVN].' ) ) > 0 • Note the use of the period (.) to terminate the PROSITE pattern, and that the LSPatternMatch function returns the character position of the left-most substring matching the pattern, or zero if there is no match.

Accessing CLSD: getting an account To access CLSD you must have an account on the Libra Cluster at IU (aka libra00.uits.iu.edu). If you don’t have an account and are associated with Indiana University, request an account by filling out a Research Systems Account Application at http://rac.uits.iu.edu/rats/forms/application.php. In the comments section of the account request, add that you need a local and persistent password for use with CLSD. Once you have a Libra account, send email to SDS at data @ indiana.edu and request instructions for defining a local and persistent password for use with CLSD. TeraGrid users should send e-mail to SDS at data @ indiana.edu explaining how CLSD will be used, and describing their TeraGrid activities. SDS will then arrange for an appropriate Libra account and send instructions for defining a suitable password.

Accessing CLSD: options • DB2 can be accessed in a variety of ways: • DB2 Command Line Processor (Unix, Windows) • DB2 Control Center (wherever JRE is running) • DB2 driver for Perl DBI • DB2 drivers for the Java Database Connectivity (JDBC) Application Program Interface (API), especially the JDBC Universal Driver • Demonstration Web pages (invoke a Java servlet that uses JDBC): • http://discover.uits.indiana.edu:8421/access/ • Demonstration WebService (invoked as a function call via JAX-RPC): • http://discover.uits.indiana.edu:8421/axis/CLSDservice.jws?wsdl • Demonstration Web pages (invoke a Java servlet that invokes the CLSD • WebService): • http://discover.uits.indiana.edu:8421/access/index-for-service.html • Experimental WSRF Resource (using WSRF within a GT4 container) • Experimental OGSA-DAI service (running within a GT4 container)

JDBC access • Connect to the CLSD: • Class.forName( "com.ibm.db2.jcc.DB2Driver" ); • con = DriverManager.getConnection( • "jdbc:db2://libra00.uits.iu.edu:50000/clsd2", • accountName, accountPassword ); • Prepare a query, send it to the db, and receive a result: • statement = con.createStatement(); • resultSet = statement.executeQuery( query ); • Get some query meta-data (column labels and column data types): • ResultSetMetaData rsmd = resultSet.getMetaData(); • result = rsmd.getColumnLabel( colCount ); • result2 = rsmd.getColumnTypeName( colCount );

JDBC access (continued) Get a row of data: for( int colCount = 1; colCount <= numcols; colCount++ ) { String returnedString = ""; // Must be predefined. returnedString = resultSet.getString( colCount ) + ""; out.println( "<td>" + returnedString + "</td>\n" ); }

Accessing CLSD thru a WebService (JAX-RPC) The Java API for XML-based Remote Procedure Calls, or JAX-RPC, is a specification that defines a system for building distributed services (so-called “WebServices”) within the client-server model. JAX-RPC makes it possible for a function invocation in a client like: a_variable = function_name( parameter_list) to cause the function, “function_name,” to run on a remote server and return a response containing the value to be assigned to the variable “a_variable”, and a function invocation in a client like: returnString = queryCLSD( "select * from syscat.tables", "1", "5", "accountName", "accountPassword", “table” ) will return a (possibly very long) string containing the response to the query (given that various linkages have been prearranged).

Outline of the CLSDservice public class CLSDservice { // Full source at: // http://scidata.iu.edu/CLSD/examples/CLSDservice.jws.txt public String queryCLSD( String query, String startingRowToPrint, String maxRows, String account, String password, String format ) { // Get a query string, etc. from the command line or Web // browser. // Declare JDBC drivers and connect to DB2. // Prepare a JDBC statement containing the SQL query, submit // it to DB2, and capture the returned JDBC result set. // Query result set metadata for column names and types to // return as the first row, and then collect the contents of // each data row. return theResponse; } // end queryCLSD } // end Class CLSDservice

SOAP and WSDL • JAX-RPC uses SOAP and WSDL to establish the various linkages required to implement remote procedure calls. • SOAP messages are usually encoded as XML messages within HTTP requests where: • A SOAP request is an HTTP POST request with an XML body. • A SOAP response is an HTTP response header followed by an XML body. • Such RPC functions are “exposed” as “operations” when described within web pages using the Web Services Description Language (WSDL).

Java command-line client to access CLSD via CLSDservice public class testCLSDClient { public static void main(String [] args) { try { String endpoint = "http://discover.uits.indiana.edu:8421/axis/CLSDservice.jws"; Service service = new Service(); Call call = (Call) service.createCall(); call.setTargetEndpointAddress( new java.net.URL( endpoint ) ); call.setOperationName( new QName("http://soapinterop.org/", "queryCLSD" ) ); String returnString = (String) call.invoke( new Object[] { "select * from syscat.tables", "1", "5", "accountName", "accountPassword", “table” } ); System.out.println( returnString ); } catch (Exception e) { System.err.println(e.toString()); } } }

Perl command-line client to access CLSD via CLSDservice #!perl –w use SOAP::Lite; # Set up the call to CLSD using SOAP. $host = “discover.uits.indiana.edu”; $service = SOAP::Lite -> service( “http://$host:8421/axis/CLSDservice.jws?wsdl” ); # Make the call to CLSD. $result = $service->queryCLSD( “select tabschema,tabname from syscat.tables”, 1, 5, "DB2account", "password""table" ); print $result;

OGSA • The Open Grid Services Architecture (OGSA) is an “architecture” for building computational grids. • In particular, OGSA “…defines a set of core capabilities and behaviors that address key concerns in Grid systems.” [2] It does not, however, implement or define how to implement such core capabilities. • OGSA is NOT layered or object oriented. • However, both will be exploited naturally in some implementations. • OGSA provides an architecture for building services such as: • “Service-Based distributed query processing,” • “Grid Workflow”, • “Grid Monitoring Architecture” • etc.

OGSA-DAI OGSA-Data Access and Integration (OGSA-DAI) is a very flexible and powerful data access framework that can be used within an OGSA grid environment. It provides various data movement, virtualization, and manipulation services that transform the use of data into a higher-level workflow. The OGSA-DAI client shown in the next slide uses the OGSA-DAI Client Toolkit to send a hard-coded query to CLSD (here known as the “DB2Resource). The Toolkit allows clients to use JDBC by creating a JDBC ResultSet object from an OGSA-DAI WebRowSet. The response is encoded using XML and may be retrieved as a single string, or as individual fields by using individual JDBC calls as shown below.

Java command-line client to access CLSD via OGSA-DAI public class queryCLSD { public static void main(String[] args) throws Exception { // Create an instance of the data service. String handle = "http://localhost:8080/wsrf/services/ogsadai/DataService"; String id = "DB2Resource"; DataService service = GenericServiceFetcher.getInstance().getDataService( handle, id); // Define a request composed of one activity. SQLQuery query = new SQLQuery( "select tabschema,tabname from syscat.tables"); WebRowSet rowset = new WebRowSet( query.getOutput() ); ActivityRequest request = new ActivityRequest(); request.add( query ); request.add( rowset );

Java command-line client to access CLSD via OGSA-DAI 2 // Submit the request and retrieve results. Response response = service.perform( request ); ResultSet result = rowset.getResultSet(); ResultSetMetaData rsmd = result.getMetaData(); int numCols = rsmd.getColumnCount(); // Display each column from each row. while( result.next() ) { for( int colCount = 1; colCount <= numCols; colCount++ ) { out.print( “ “ + result.getString( colCount ) ); } out.println(); } } }

This client displays a small part of the functionality provided by OGSA-DAI. In addition, an OGSA-DAI service can be configured to: • operate on XML or text data sources, as well as relational data sources, • perform a series of operations (also known as “activities”) as part of a single request, • deliver results to a third party (via FTP, GridFTP, SMTP, etc.) or to another data service, • deliver results asynchronously, which can be very useful for long-running requests, and • utilize authentication methods supported by WSRF to provide grid-based security. • Also, exposing a database via OGSA-DAI makes it available for OGSA Distributed Query Processing (OGSA-DQP), so that its use may be further virtualized within the DQP model. • In some cases, however, OGSA-DAI and DQP may introduce performance penalties.

Current and possible directions • Adding data sources: mirrored and federated • Requests for mirroring or federating will be gladly entertained • DB2 now provides a user-configurable script wrapper that connects to a remote DB2 daemon that can start any co-located arbitrary script and return data encoded in XML (restricted to one foreign key per table) • Such a script could be built to relay any web resource that returns XML meeting key restrictions. • Wrappers could be constructed to relay some OGSA-DAI resources • Implementing the OGSA-DAI service in productional mode. • Integrating with the TeraGrid • CLSD is currently accessible from the TeraGrid, but authentication is local. • It may be possible to enforce TeraGrid based X.509 authentication, using either WSRF or OGSA-DAI interfaces.

References: • Atherly, Alan G, et al., The Science of Genetics, 1999. • Apache Foundation, AXIS User’s Guide, http://ws.apache.org/axis/java/user-guide.html • Codd, Edward F., A Relational Model of Data for Large Shared Data Banks, http://www.acm.org/classics/nov95/toc.html (See also: http://en.wikipedia.org/wiki/Edgar_F._Codd) • CSLD web page: http://rac.uits.iu.edu/clsd/ • Foster, Ian, et al. “The Open Grid Systems Architecture, Version 1.5”. • Sotomayer, Boria and Lisa Childers, Globus Toolkit 4: Programming Java Services • Sundaram, Babu, Understanding WSRF, http://www-128.ibm.com/developerworks/edu/gr-dw-gr-wsrf1-i.html Questions, comments, suggestions?

The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

Presentation Transcript

Computing in the Life Sciences

Life Sciences

COSMIC LIFE SCIENCES

Life Sciences

Life Sciences

Life Sciences

Regulation of the Life Sciences

Expanding the AHUC Tax Definition

Life Sciences on the TSX

LIFE SCIENCES OPERATIONS

NASA Life Sciences

The Massachusetts Life Sciences Center

Careers in the Life Sciences

LIFE SCIENCES OPERATIONS

LIFE SCIENCES OPERATIONS

LIFE SCIENCES

LIFE SCIENCES

Collaboration in the Life Sciences

Oracle Life Sciences

Life Sciences Efficiency

Life Sciences