1 / 59

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). The Study of Genes. Chromosomal location Sequence Sequence Variation Splicing

india
Download Presentation

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Semantic Web:New-style data-integration(and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam

  2. What’s the problem?(data-mess in bio-inf)

  3. The Study of Genes... • Chromosomal location • Sequence • Sequence Variation • Splicing • Protein Sequence • Protein Structure

  4. … and Their Function • Homology • Motifs • Publications • Expression • HTS • In Vivo/Vitro Functional Characterization

  5. Metabolic and regulatory pathway induction Understanding Mechanisms of Disease

  6. Development of Drugs, Vaccines, Diagnostics • Differing types of Drugs, Vaccines, and Diagnostics • Small molecules • Protein therapeutics • Gene therapy • In vitro, In vivo diagnostics • Development requires • Preclinical research • Clinical trials • Long-term clinical research • All of which often feeds back into ongoing Genomics research and discovery.

  7. Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty

  8. “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)

  9. The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources

  10. Andy Law’s First Law ESTC “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008

  11. Andy Law’s Second Law ESTC “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008

  12. What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …

  13. Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006

  14. Why would Semantic Web technology help?

  15. Semantic Web Approach • Convert all data sources to RDF representation (local or distributed) • Optional: Collect the data to scalable semantic repository • Apply light-weight reasoning to specify formal interpretations of the data, e.g.: • remove redundancy, • establish equalities, etc • Derive new implicit knowledge ESTC Sept, 2008

  16. alleviates <treatment> <name> <symptoms> <drug> IS-A <disease> <drugadministration> machine accessible meaning(What it’s like to be a machine) META-DATA

  17. name symptoms disease drug administration What is meta-data? • it's just data • it's data describing other data • its' meant for machine consumption

  18. Required are: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached • mechanisms for attribution and trust

  19. What are ontologies &what are they used for world concept language Agree on a conceptualization no shared understanding Conceptual and terminologicalconfusion Make it explicitin some language. Actors: both humans and machines

  20. standard vocabularies(“Ontologies”) • Identify the key concepts in a domain • Identify a vocabulary for these concepts • Identify relations between these concepts • Make these precise enoughso that they can be shared between • humans and humans • humans and machines • machines and machines

  21. Real life examples • handcrafted • music: CDnow(2410/5), MusicMoz(1073/7) • biomedical: SNOMED (200k), GO(15k), Emtree(45k+190kSystems biology • ranging from lightweight • Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) • ranging from small (METAR) to large (UNSPC)

  22. Biomedical ontologies (a few..) • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCBI Cancer Ontology: • 17,000 classes (about 1M definitions),

  23. Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  24. Stack of languages

  25. Author-of pers05 ISBN... Publ-by Author-of pers05 ISBN... MIT Publ-by Author-of ISBN... Bluffer’s guide to RDF (1) • Object --Attribute-> Value triples • objects are web-resources • Value is again an Object: • triples can be linked • data-model = graph

  26. What does RDF Schema add? • Defines vocabulary for RDF • Organizes this vocabulary in a typed hierarchy • Class, subClassOf, type • Property, subPropertyOf • domain, range Person subClassOf subClassOf range domain Teacher Student supervises type type supervises Frank Marta

  27. RDF Triples in Life Sciences

  28. OWL: things RDF Schema can’t do • equality • enumeration • number restrictions • Single-valued/multi-valued • Optional/required values • inverse, symmetric, transitive • boolean algebra • Union, complement • …

  29. different owners & locations Web of Data: anybody can say anything about anything • All identifiers are URL's (= on the Web) • Allows total decoupling of • data • vocabulary • meta-data [<x> IsOfType <T>] x T <prince>

  30. RDF(S) have a (very small) formal semantics • Defines what other statements are implied by a given set of RDF(S) statements • Ensures mutual agreement on minimal contentbetween parties without further contact • In the form of “entailment rules” • Very simple to compute(and not explosive in practice)

  31. RDF(S) semantics: examples • Aspirin isOfType PainkillerPainkiller subClassOf Drug Aspirin isOfType Drug • aspirin alleviates headachealleviates range symptom  headache isOfType symptom

  32. RDF(S) semantics: examples • AspirinisOfTypePainkillerPainkillersubClassOfDrug AspirinisOfTypeDrug • aspirin alleviates headachetreatsrangesymptom headacheisOfTypesymptom

  33. RDF(S) semantics • X R Y + R domain T  X IsOfType T • X R Y + R range T  Y IsOfType T • T1 SubClassOf T2 +T2 SubClassOf T3  T1 SubClassOf T3 • X IsOfType T1 +T1 SubClassOf T2  X IsOfType T1

  34. OWL also has a formal semantics • Defines what other statements are implied by a given set of statements • Ensures mutual agreement on content(both minimal and maximal)between parties without further contact • Can be used for integrity/consistency checking • Hard to compute (and rarely/sometime/always explosive in practice)

  35. OWL semantics: minimal • vanGogh isOfType ImpressionistImpressionist subClassOf Painter vanGogh isOfType Painter • vanGogh painter-of sunflowerspainter-of domain painter vanGogh isOfType painter

  36. OWL semantics: maximal • vanGogh isOfType ImpressionistImpressionist disjointFrom Cubist NOT: vanGogh isOfType Cubist • painted-by has-cardinality 1sun-flowers painted-by vanGoghPicasso different-individual-from vanGogh NOT: sun-flowers painted-by Picasso

  37. Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  38. Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. • See previous slide on Biomedical ontologies • Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.

  39. trade antwerp europe amsterdam amsterdam netherlands merchant merchant center city city town town Question:Who writes the meta-data ? • Automated learning • shallow natural language analysis • Concept extraction Example: Encyclopedia Britannica on “Amsterdam”

  40. Remember “required are” • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  41. How to handle multiple ontologies: ontology matching • Linguistics & structure • Shared vocabulary • Instance-based matching • Shared background knowledge

  42. Q Matching through shared vocabulary

  43. Matching through shared instances

  44. sharedbackgroundknowledge ontology 2 ontology 1 Matching using shared background knowledge

  45. Some working examples? • Linked Life Data http://www.linkedlifedata.com • DOPE • HCLS http://www.w3.org/2001/sw/hcls/

  46. Linked Life Data Overview ESTC • LinkedLifeData - statistics: • Number of statements: 1,159,857,602 • Number of explicit statements: 403,361,589 • Number of entities: 128,948,564 • Platform to automate the process: • Infrastructure to store and inferences • Transform the structured data sources to RDF • Provide web interface to access the data • Currently operates over OWLIM semantic repository • Publicly available at: http://www.linkedlifedata.com ESTC Sept, 2008

  47. Light Weight Reasoning in Linked Life Data urn:biogrid:Interaction urn:uniprot:Protein urn:uniprot:FBgn0068575 urn:biogrid:FBgn0068575 rdf:type sameAs rdf:type urn:pubmed:15904 rdf:seeAlso rdf:type urn:intact:Interaction urn:uniprot:Q709356 hasParticipant Use relationships to derive new implicit knowledge Resolve the syntactic differences in the identifiers interactsWith sameAs rdf:type interactsWith urn:biogrid:15904 hasParticipant urn:uniprot:P104172 urn:intact:1007 sameAs rdf:seeAlso urn:biogrid:FBgn00134235 urn:uniprot:FBgn00134235 These are only examples resource names ESTC ESTC Sept, 2008

  48. ESTC ESTC Sept, 2008

  49. Some working examples? • Linked Life Data http://www.linkedlifedata.com • DOPE • HCLS http://www.w3.org/2001/sw/hcls/

  50. The Data • Document repositories: • ScienceDirect: approx. 500.000 fulltext articles • MEDLINE: approx. 10.000.000 abstracts • Extracted Metadata • The Collexis Metadata Server: concept-extraction ("semantic fingerprinting") • Thesauri and Ontologies • EMTREE: 60.000 preferred terms 200.000 synonyms

More Related