1 / 67

Genome Data and Tool Interoperation over the “Semantic” Web

Genome Data and Tool Interoperation over the “Semantic” Web. By Kei-Hoi Cheung, Ph.D. Assistant Professor Yale Center for Medical Informatics. MB&B 452b/752b, April 20, 2005, Yale University. Outline. Introduction Semantic Web Resource Description Framework (RDF)

mya
Download Presentation

Genome Data and Tool Interoperation over the “Semantic” Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Data and Tool Interoperation over the “Semantic” Web By Kei-Hoi Cheung, Ph.D. Assistant Professor Yale Center for Medical Informatics MB&B 452b/752b, April 20, 2005, Yale University

  2. Outline • Introduction • Semantic Web • Resource Description Framework (RDF) • Life Sciences Identifiers (LSID) • YeastHub: yeast genome data interoperation • Web Services for tool interoperation • Collaborative projects • Biosphere • Taverna • Semantic Web Services • Conclusion • Future directions

  3. Eras of Computing • Mainframe computing (many people share one computer) • Personal computing (one person uses one computer) • Ubiquitous computing (one person is served by many computers over the network) • Client/server computing, grid computing, peer-to-peer computing, distributed/parallel computing, component-based computing, etc • World Wide Web (WWW) is one of the main driving forces • It provides a globally distributed communication framework that is essential for almost all scientific collaboration, including bioinformatics

  4. The World Wide Web • On the order of 108 users • Used in every country on Earth • On the order of 1010 indexed web resources (text) in Google etc • Essentially Infinite if one includes “dynamic” web pages • Massively distributed and open

  5. It is difficult to keep track of these resources

  6. Data Heterogeneity • Data are exposed in different ways • Programmatic interfaces • Web forms or pages • FTP directory structures • Data are presented in different ways • Structured text • Tab delimited format, XML format, etc • Free text • Binary • Images • Naming conflicts (e.g., synonyms and homonyms)

  7. Tool heterogeneity • Server applications • Web server applications • Application programming interfaces (API) • Client applications (downloadable software) • Different programming languages • Different operating systems

  8. From Web to Semantic Web • Human processing  Machine processing • Free text description  ontological description • HTML  XML  RDF or its extensions • Metadata!

  9. Col#Description • pedigree id • Person id • Father id • Mother id • Sex • Status <html> <body> … <a href=“http://ycmi.med.yale.edu/ped_readme.html”> Readme</a> <table> <tr> <td>1</td> <td>1</td> <td>0</t> <td>0</td> … </tr> … </table> … </body> </html> HTML Example Readme 1 1 0 0 1 1 1 2 0 0 2 0 1 3 1 2 2 0 1 4 1 2 1 0 1 5 1 2 1 1 1 6 1 2 1 0

  10. XML Example

  11. Other Advantages of Using XML • It is simple, hierarchical, self-describing, and computer-readable • It can be validated using DTD or XSchema • It is a W3C standard • It has a large base of software support (both commercial and public domain software tools) • Editing tools, DOM, SAX, XSL, etc

  12. Sequence Microarray Gene Expression Pathway BSML MAML BIND SBML PSI-MI AGAVE GEML MAGE-ML RDF (e.g., BioPax) Semantically rich ontologies Proliferation of Bio-XML Formats Reasoning (machine intelligence)

  13. Definition of an Ontology • Conceptualization of a domain of interest • Concepts, relations, attributes, constraints, objects, values • An ontology is a specification of a conceptualization • Formal notation • Documentation • A variety of forms, but includes: • A vocabulary of terms • Some specification of the meaning of the terms • Ontologies are defined for reuse

  14. Roles of Ontologies in Bioinformatics • Success of many biological DBs depends on • High fidelity ontologies • Clearly communicating their ontologies • Prevent errors on data entry and interpretation • Common framework for multidatabase queries • Controlled vocabularies for genome annotation • GO • EC numbers • Information-extraction applications • Reuse is a core aspect of ontologies • Reuse of existing ontologies faster than designing new ones • Reuse decreases semantic heterogeneity of DBs • Schema-driven Software • Knowledge-acquisition tools • Query tools

  15. Example Bio-ontologies • Gene Ontologies • http://www.geneontology.org/ • MGED Ontologies • http://mged.sourceforge.net/ • Open Biomedical Ontologies (OBO) • http://obo.sourceforge.net/

  16. Are current bio-ontologies adequate?

  17. Precision Formal, unambiguous High fidelity Explicitness Clarity Commitment Reuse Systematic Quality Clarity Flexibility Expressivity Evolution Ontology desiderata machine computable

  18. Semantic Web • It provides a common framework that allows semantic interoperability among multiple resources through the use of ontologies • It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners • It is based on the Resource Description Framework (RDF)

  19. Resource Description Framework (RDF) • It is a standard data model (directed acyclic graph) for representing information (metadata) about resources in the World Wide Web • In general, it can be used to represent information about “things” that can be identified (using URI’s) on the Web • It is intended to provide a simple way to make statements (descriptions) about Web resources

  20. RDF Statement • A RDF statement consists of: • Subject: resource identified by a URI • Predicate: property (as defined in a name space identified by a URI) • Object: property value or a resource For example, the “dbSNP Website” is a subject, “creator” is a Predicate, “NCBI” is an object. A resource can be described by multiple statements.

  21. Graphical Representation

  22. RDF/XML Representation • <?xml version="1.0"?> • <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” • xmlns:dc=“http://purl.org/dc/elements/1.1” • xmlns:ex=“http://www.example.org/terms”> • <dc: creator rdf:resource=“http://www.example.org/staffid/85740”></dc:creator> • <dc:language>en</dc:language> • <ex:creation-date>August 16, 1999</dc:creation-date> • <rdf:RDF>

  23. Data Integration Using RDF humanhemoglobin atagccgtacctgcgagtctagaagct derives from atagccgtacctgcgagtctagaagct GenBank derives from + humanhemoglobin oxygentransportprotein humanhemoglobin oxygentransportprotein is a is a Gene Ontology + has 3D structure humanhemoglobin has 3D structure Unified view Protein Data Bank

  24. Reification • Making statements about statements • For example, GenBank provides the following statement: “human hemoglobin derives from atagccgtacctgcgagtctagaagct” Example <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:s=“http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=29436”> <rdf:Description about=“http://www.ncbi.nlm.nih.gov/Genbank”> <s:derive_from rdf:ID=“statement1”> atag… </s:derive_from> </rdf:Description> <rdf:Description about=“#statement1”> <s:providedBy>GenBank</s:providedBy> </rdf:Description> </rdf:RDF>

  25. Other RDF-Based Ontology Languages • RDFS • DAML+OIL • OWL

  26. Life Science Identifiers (“LSID”) Addresses Data Access Problems • LSID is a naming standard for distributed data, specifically: • Scientifically significant data • Geographically distributed • Files, database records, and data objects managed by N-tier applications • Public and/or private networks • And owned, managed, by different organizations

  27. LSID Syntax • 5 Part Format: URN:LSID:Authority:Namespace:Object:[Revision-ID] • URN:LSID: is a mandatory prefix • Authority is the Internet domain of the organization that assigns an LSID to a resource • Namespace constrains the scope of the object • Object is an alphanumeric describing the object • Revision-ID is an optional version of the object • Examples • URN:LSID:ncbi.nlm.nih.gov:genbank:AF271072:1 • URN:LSID:ncbi.nlm.nih.gov:pubmed:12571434

  28. LSID: a single naming schema • One standard naming scheme • Named data is unique • Data integrity is maintained • Breaking down of “data silos” • Names no longer only useful in a specific proprietary context • Integrate any data source using standard naming scheme • Single LSID protocol replaces proprietary source specific programs • Access to more data • Integrate data across discovery and development cycles • Metadata features • Standard access to specific data allows them to easily be related semantically. These semantic links can lead to new insights

  29. LSID-Enabled Applications • LaunchPad • BioHaystack

  30. LaunchPad • it takes an LSID; • resolves it; • attempts to match the local applications one uses to process/view this data.

  31. YeastHub (a semantic web approach to yeast data integration)(Collaboration between YCMI and Gerstein Lab: Kevin Yip, Andrew Smith, Andy Masiar, Remko deKnikker)(Accepted for publication and presentation in ISMB 2005)

  32. Yeast Genome Data • The budding yeast Saccharomyces cerevisiae was the first fully sequenced eukaryotic genome. • Ease of genetic manipulation and many of its genes are strikingly similar to human genes • It has been studied extensively through a wide range of biological experiments (e.g., microarray experiments). • A large variety of yeast genome data (e.g., gene expression data) have been made available through many resources (e.g., SGD, MIPS, YPD, TRIPLES, Yeast World, etc) • Integration of such a variety of yeast data can facilitate whole genome analysis

  33. Data Conversion and Integration Resource1 Resource2 Resourcen <xml> … </xml> DOM/SAX DB-specific tool XSLT RDF1 RDF2 RDFn RDF/DB (Sesame) RDQL Users/Agents

  34. Two Levels of RDF Description • Resource description • Data description

  35. Resource Description(Use of Dublin Core Metadata)

  36. Metadata Example

  37. RDF Modeling of Tabular Data

  38. Data Conversion

  39. RDF Example

  40. Query Form

  41. RQL Syntax and Query Results

  42. Semantic Web Technologies Employed in YeastHub • RDF Site Summary (RSS) • D2RQ (mapping from relational databases to RDF) • Semantic Web Database (Sesame) • RDF Query Languages (e.g., RQL and SeRQL)

  43. Tool Interoperation

  44. An Example Scenario • Comparative genomics

  45. Manual Interoperation

  46. A Better Way of Interoperation

  47. A Better Way of Interoperation (cont’d)

  48. Web Services“Creating a Bioinformatics Nation”(Lincoln Stein)

  49. Web Services UDDI WSDL SOAP

More Related