Integration of Biological XML data

Integration of Biological XML data Ph. D. Lecture Bioinformatics & Software Systems Lab. Woo-Hyuk Jang

Where are we? Client-Side Info. Management Business related Issues Internationalization and Privacy WWW Concepts & Web-based Info. Management Web Services HTML, JavaScript, Plug-in, Applet… Semantic Web App. Of Web-based tech. XML & XML Processing S.S. Info. Management Concept CGI, Java Servlets This Lecture ? JDBC, MySQL Server-Side Info. Management

Can you remember? • Problems in Integrating Heterogeneous Information • Heterogeneity of formats, data types, units, or semantics. • Information Mediation Fig 1. Mediator in Lecture 7.

This Lecture Contains… • Information Integration in Bioinformatics • Bioinformatics Overview • Is there any relationship between Web and Bioinformatics? • Difficulties to handle Biological XML data • What is it? Why? • Cultures : Schema-driven, Data-driven • Models : Federation, Warehousing, Mediation • Integration of XML format data • Problems • Issues • Summary

Bioinformatics • A narrow sense • The application of information technology to life science research • Modeling (abstraction) • Analysis and collection • Data integration and information retrieval • Enables the discovery and analysis of biomolecules and their properties (Structure, function, interactions) • A wide sense • The use of computers to collect, analyze, and interpret biological information at the molecular level

Web and Bioinformatics Use Biological Data Bio Applications Make, Publish Use Experiment,Publish Biologist Computer Scientist

Difficulties to Handle Biological XML data • Lack of standard • Different data model and schemas • Different handling methods are needed • Different formats • Monstrous volume of data • It is growing exponentially • Data are updated very frequently • Newly introduced data, error fixed data

Why Integration? • In the post-human genome sequencing era, many analyses on the genome scale are possible • Majority of human diseases are the product of multi-step pathophysiological processes • The biggest challenge in interpreting the results of these analyses lies in the data integration problem

Two Cultures of Integration • Database Integration • Schema level view • Focus on outside of data • Data Integration • Data level view • Focus on inside of data Schema 4 Schema 3 Schema 1 Schema 2 Data 1 Data 2 Data 3 Data 4

Two Cultures of Integration • Schema-driven (computer scientists) • Much smaller than data, (hopefully) well-defined elements • Resolve redundancy and heterogeneity at the schema level • High degree of automation once system is set-up • Focus on methods - you rarely publish a “data paper” • Data-driven (biologists) • Value is in the data, abstraction is a result of analysis • Don‘t bother with schemas • Abstraction is volatile and depends on experimental technique • Manual integration at data level, constant high effort • You rarely publish a (database) “method paper”

Models of Integration • Federation (Multi-database) • Warehousing (Materialized in house) • Mediation (Virtual integration)

Models of Integration • Federation (Multi-database) • K2/BioKleisli, Entrez

Models of Integration • Warehousing (Materialized in house) • GUS (Genome Unified Schema), SRS (Sequence Retrieval System) Local Operational Network Internet - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Warehouse Decision Support & Mining Integration & Storage R2 R3

Models of Integration • Mediation (Virtual integration) • TAMBIS (Transparent Access to Multiple Bioinformatics Information Source) Network Internet Query Translation Mediator

Models of Integration • Federation represents a more “static” approach – using agreed couplings to allow view creation. • Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes.

Warehousing vs. Mediation • Warehouse • Update-driven: i.e. in warehouse repository • Heterogeneous data is integrated in advance and stored in-house for direct query and analysis. • Mediation • Wrapper and Mediator layeron top of source DBs. • Query-driven: Query to mediated schema then translated into queries appropriate to sources. • Results integrated into a global answer set.

Now let’s study the… • Information Integration in Bioinformatics • Bioinformatics Overview • Is there any relationship between Web and Bioinformatics? • Difficulties to handle Biological XML data • Why Integration? • Cultures : Schema-driven, Data-driven • Models : Federation, Warehousing, Mediation • Integration of XML format data • Problems • Issues • Discussion about Reading Question #6

Integration of XML format data • Why XML? • Biology is a complex discipline • Wide variety of data resources and repositories • No standard protocol exists to interrogate biological data stores • No standard data format exists to exchange biological data. • No standard data model exists. • Difficulties in using and exchanging data • There exist various tools that can support XML handling

Integration of XML format data • Problems • We focus on schema-driven integration • Warehousing model is efficient • Have to analyze data • Performance • To implement perfect mediation model is extremely difficult • XML data should be converted into RDB • We want to make our own DB schema accommodating the data from XML files • We need to make the DB schema regarding efficiency and our own purpose • Heterogeneity and Large scale

XML XML XML XML Sequence Structure Function Domain ٠ ٠ ٠ Integration of XML format data • PreSPI (Prediction System for Protein Interaction) Warehouse Local DB1 Local DB2 Local DB3 Integration Rule General XML Wrapper (SAX) Local Web

Issues of Using XML Biological data • Structure • Semi-structured: Can be expressed as trees, graphs • Theoretically, it is ideal to map them into DB regarding structural feature • Method for storing XML • File system • Has overhead for query • Text file, invert list, compression file • Specific storing method • Use XML’s own structure • DB system • Especially, mapping into RDB has been researched a lot • Has overhead for converting into the appropriate model

Issues of Using XML Biological data • Object view of the XML • use DOM • A Class can be mapped into a Table, PCDATA or ATTRIBUTE can be column • XML Objects Tables • ============= ============ ============== • Table A • <A> object A { ---------------------- • <B>bbb</B> B = "bbb" B C D • <C>ccc</C> <=> C = "ccc" <=> -- --- --- • <D>ddd</D> D = "ddd" ... ... ... • </A> } bbb ccc ddd • ... ... ... • XML-view • CREATE XMLVIEW xview_1( id char(20), email char (30) ) • AS ( ‘select p.personnel.person@id, p.personnel.person@email • from “file:/home/user1/personal.xml”, p; ‘); • “A generic load/extract utility for data transfer between XML documents and relational databases” Bourret, R.; Bornhovd, C.; Buchmann, A.;Advanced Issues of E-Commerce and Web-Based Information Systems, 2000.

Issues of Using XML Biological data • Direct method Output & execute XML Document input XML Saver Insert Statement input Mapping Rule  “A direct method of data exchange between XML and relational database” Bei Jia; Cai Fei; Tao Lie-Jun; Pan Jin-Gui;Information Technology Interfaces, 2004. 26th International Conference on 2004 Page(s):127 - 132 Vol.1

Issues of Using XML Biological data • Direct Method (cont’d)

Current methods force DB to follow XML schema Complex structured XML Share the same element name even thought they should be different columns in DB (DIP, InterPro…) Large size of file; we cannot use DOM XML updated frequently; the process should be easy Issues of Using XML Biological data Rather than <protein id=“ID_A" name=“PROTEIN_A“> <ref db=“B" id=“ID_B" /> <ref db=“C" id=“ID_C" /> ………..

Issues of Using XML Biological data • Direct Method cannot cover following XML type • Cannot integrate two more files ; Needs constraint <node id="G:1" uid="DIP:232N" name="BAXA_HUMAN" class="protein"> <xref db="DIP" id="232N" type="src"/> <feature name="swp_ref" class="cref"> <src>SwissProt</src> <val>SWP:Q07812</val> <xref db="SWP" id="Q07812" type="src"/> </feature> <feature name="pir_ref" class="cref"> <src>PIR</src> <val>PIR:A47538</val> <xref db="PIR" id="A47538" type="src"/> </feature> <feature name="gi_ref" class="cref"> <src>NCBI</src> <val>GI:539664</val> <xref db="gi" id="539664" type="src"/> </feature> <att name="descr"> <val>bcl-2-associated protein x, alpha splice form</val> </att> <att name="organism"> <val>Homo sapiens</val> <xref db="TXID" id="9606" type="ont"/> </att> </node> We want But,

Issues of Using XML Biological data • Make a data set for a tuple, which ignore sub document treenodes • Define SQL like syntax • Where condition of each column for constraints • Multiple files can be populated into one table by manipulation CREATE TABLE PROTEIN_IDs(ID_A CHAR(20), NAME CHAR(20), B CHAR(20), C CHAR(20), D CHAR(20) , E CHAR(20) ) AS ( SELECT ( FILE.protein@id, FILE.protein@name, [FILE.protein.ref]@id WHERE @db = B, [FILE.protein.ref]@id WHERE @db = C, [FILE.protein.ref]@id WHERE @db = D, [FILE.protein.ref]@id WHERE @db = E, [FILE_2.ELEMENT]@value WHERE @id=ID_A ) FROM “file/protein.xml” AS FILE, “file/file.xml” AS FILE_2);

Summary • Integration of biological data is a kind of Web based information management • Integration in bioinformatics is a very important work because we can find out more valuable biological information via comprehensive view • Biological XML data have some properties that disturb integration, so schema-driven and warehousing model are usually used for integration Thank you~~~

Integration of Biological XML data

Integration of Biological XML data

Presentation Transcript

The Integration of Biological Data Using Semantic Web Technologies

XML Data

XML Data

Biological Data Integration

Towards Seamless Integration and Querying of Biological Data

XML Schema Integration

Secure of historical biological data Assembly of current biological data

The Role of XML in Cloud Data Integration

XML Integration Testing

XML Data

Indexing of XML Data

STRING Modeling of biological systems through cross-species data integration

BioGrid: Integration of Biological Data Grid and Computing Grid

A probabilistic XML approach to data integration

Data integration via XML

Data Integration and Extraction over Molecular Biological Data

Integration of Biological Data (LifeDB)

An Advanced Strategy for Integration of Biological Measurement Data

XML API Integration

XML Data

STRING Modeling of biological systems through cross-species data integration