1 / 60

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003.

Download Presentation

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Semantic Web:New-style data-integration(and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam

  2. What’s the problem?(data-mess in bio-inf)

  3. Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003 Life Science Data Recent focus on genetic data “genomics: the study of genes and their function. Recent advances in genomics are bringing about a revolution in our understanding of the molecular mechanisms of disease, including the complex interplay of genetic and environmental factors. Genomics is also stimulating the discovery of breakthrough healthcare products by revealing thousands of new biological targets for the development of drugs, and by giving scientists innovative ways to design new drugs, vaccines and DNA diagnostics. Genomics-based therapeutics include "traditional" small chemical drugs, protein drugs, and potentially gene therapy.” The Pharmaceutical Research and Manufacturers of America - http://www.phrma.org/genomics/lexicon/g.html Study of genes and their function Understanding molecular mechanisms of disease Development of drugs, vaccines, and diagnostics

  4. The Study of Genes... • Chromosomal location • Sequence • Sequence Variation • Splicing • Protein Sequence • Protein Structure

  5. … and Their Function • Homology • Motifs • Publications • Expression • HTS • In Vivo/Vitro Functional Characterization

  6. Metabolic and regulatory pathway induction Understanding Mechanisms of Disease

  7. Development of Drugs, Vaccines, Diagnostics • Differing types of Drugs, Vaccines, and Diagnostics • Small molecules • Protein therapeutics • Gene therapy • In vitro, In vivo diagnostics • Development requires • Preclinical research • Clinical trials • Long-term clinical research • All of which often feeds back into ongoing Genomics research and discovery.

  8. The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources

  9. What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …

  10. Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty

  11. “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)

  12. The Complexity of Biological Data

  13. Pharmaceutical Productivity Source: PhRMA & FDA 2003

  14. Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006

  15. The Medical tower of Babel • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCI Cancer Ontology: • 17,000 classes (about 1M definitions),

  16. Problem with the Current WWW

  17. Why would Semantic Web technology help?

  18. alleviates <treatment> <name> <symptoms> <drug> IS-A <disease> <drugadministration> machine accessible meaning(What it’s like to be a machine) META-DATA

  19. name symptoms disease drug administration What is meta-data? • it's just data • it's data describing other data • its' meant for machine consumption

  20. Required are: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached • mechanisms for attribution and trust is this page really about Pamela Anderson?

  21. What are ontologies &what are they used for world concept language Agree on a conceptualization no shared understanding Conceptual and terminologicalconfusion Make it explicitin some language. Actors: both humans and machines

  22. standard vocabularies(“Ontologies”) • Identify the key concepts in a domain • Identify a vocabulary for these concepts • Identify relations between these concepts • Make these precise enoughso that they can be shared between • humans and humans • humans and machines • machines and machines

  23. concepts, properties, relations, functions Consensual knowledge machine processable Abstract model of some domain Shared content-vocabularies:Ontologies Formal, explicit specification of a shared conceptualisation

  24. Real life examples • handcrafted • music: CDnow(2410/5), MusicMoz(1073/7) • biomedical: SNOMED (200k), GO(15k), Emtree(45k+190kSystems biology • ranging from lightweight • Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) • ranging from small (METAR) to large (UNSPC)

  25. Biomedical ontologies (a few..) • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCBI Cancer Ontology: • 17,000 classes (about 1M definitions),

  26. Increasing semantic “weight” What’s inside an ontology? • terms + specialisation hierarchy • classes + class-hierarchy • instances • slots/values • inheritance (multiple? defaults?) • restrictions on slots (type, cardinality) • properties of slots (symm., trans., …) • relations between classes (disjoint, covers) • reasoning tasks: classification, subsumption

  27. NB: we’re not doing philosophy • Ontologies are not definitive descriptions of what exists in the world (= philosphy) • Ontologies are models of the worldconstructed to facilitate communication • Yes, ontologies exist(because we build them)

  28. Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  29. Stack of languages

  30. Stack of languages • XML: • Surface syntax, no semantics • XML Schema: • Describes structure of XML documents • RDF: • Datamodel for “relations” between “things” • RDF Schema: • RDF Vocabular Definition Language • OWL: • A more expressive Vocabular Definition Language

  31. RDF Triples in Life Sciences

  32. Author-of pers05 ISBN... Publ-by Author-of pers05 ISBN... MIT Publ-by Author-of ISBN... Bluffer’s guide to RDF (1) • Object --Attribute-> Value triples • objects are web-resources • Value is again an Object: • triples can be linked • data-model = graph

  33. <rdf:Descriptionrdf:about=“#pers05”> <authorOf>ISBN...</authorOf> </rdf:Description> claims Author-of pers05 NYT ISBN... Bluffer’s guide to RDF (2) • Every identifier is a URL = world-wide unique naming! • Has XML syntax • Any statement can be an object • graphs can be nested

  34. What does RDF Schema add? • Defines vocabulary for RDF • Organizes this vocabulary in a typed hierarchy • Class, subClassOf, type • Property, subPropertyOf • domain, range Person subClassOf subClassOf range domain Teacher Student supervises type type supervises Frank Marta

  35. Stack of languages • XML: • Surface syntax, no semantics • XML Schema: • Describes structure of XML documents • RDF: • Datamodel for “relations” between “things” • RDF Schema: • RDF Vocabular Definition Language • OWL: • A more expressive Vocabular Definition Language

  36. OWL: things RDF Schema can’t do • equality • enumeration • number restrictions • Single-valued/multi-valued • Optional/required values • inverse, symmetric, transitive • boolean algebra • Union, complement • …

  37. OWL Light • (sub)classes, individuals • (sub)properties, domain, range • conjunction • (in)equality • cardinality 0/1 • datatypes • inverse, transitive, symmetric • hasValue • someValuesFrom • allValuesFrom RDF Schema • OWL Full • Allow meta-classes etc • OWL DL • Negation • Disjunction • Full Cardinality • Enumerated types OWL: more expressivity Full DL Lite

  38. Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  39. Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. • See previous slide on Biomedical ontologies • Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.

  40. trade antwerp europe amsterdam amsterdam netherlands merchant merchant center city city town town Question:Who writes the meta-data ? • Automated learning • shallow natural language analysis • Concept extraction Example: Encyclopedia Britannica on “Amsterdam”

  41. Question:Who writes the meta-data ? • exploit existing legacy-data • Amazon • Lab equipment? • side-effect from user interaction • MIT Lab photo-annotator • NOT from manual effort • Web 2.0 community/social interaction

  42. Remember “required are” • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  43. Some working examples? • DOPE • HCLS (http://www.w3.org/2001/sw/hcls/)

  44. DOPE: Background • Vertical Information Provision • Buy a topic instead of a Journal ! • Web provides new opportunities • Business driver: drug development • Rich, information-hungry market • Good thesaurus (EMTREE)

  45. The Data • Document repositories: • ScienceDirect: approx. 500.000 fulltext articles • MEDLINE: approx. 10.000.000 abstracts • Extracted Metadata • The Collexis Metadata Server: concept-extraction ("semantic fingerprinting") • Thesauri and Ontologies • EMTREE: 60.000 preferred terms 200.000 synonyms

  46. RDF Schema EMTREE RDF RDF Datasource 1 Datasource n …. Query interface Architecture:

  47. Source Model (RDF) Additional Source of Data Gene Thesaurus (RDFS) Architecture: GUI: Spectacle (Aduna) http requests Mediator: Sesame (Aduna) SeRQL Document Model (RDFS) EMTREE Thesaurus (RDFS) SeRQL Source Model (RDF) SOAP Metadata Server (Collexis) Java Client

More Related