1 / 38

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). Pharmaceutical Productivity. Source: PhRMA & FDA 2003. Kenneth Griffiths and Richard Resnick

ash
Download Presentation

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Semantic Web:New-style data-integration(and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam

  2. What’s the problem?(data-mess in bio-inf)

  3. Pharmaceutical Productivity Source: PhRMA & FDA 2003

  4. Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003 The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources

  5. What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …

  6. Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty

  7. “Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)

  8. The Medical tower of Babel • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCI Cancer Ontology: • 17,000 classes (about 1M definitions),

  9. Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006

  10. Why would Semantic technology help?

  11. alleviates <treatment> <name> <symptoms> <drug> IS-A <disease> <drugadministration> machine accessible meaning(What it’s like to be a machine) META-DATA

  12. name symptoms disease drug administration What is meta-data? • it's just data • it's data describing other data • its' meant for machine consumption

  13. Required are: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached • mechanisms for attribution and trust is this page really about Pamela Anderson?

  14. What are ontologies &what are they used for world concept language Agree on a conceptualization no shared understanding Conceptual and terminologicalconfusion Make it explicitin some language. Actors: both humans and machines

  15. standard vocabularies(“Ontologies”) • Identify the key concepts in a domain • Identify a vocabulary for these concepts • Identify relations between these concepts • Make these precise enoughso that they can be shared between • humans and humans • humans and machines • machines and machines

  16. Biomedical ontologies (a few..) • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCBI Cancer Ontology: • 17,000 classes (about 1M definitions),

  17. Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  18. Stack of languages

  19. Stack of languages • XML: • Surface syntax, no semantics • XML Schema: • Describes structure of XML documents • RDF: • Datamodel for “relations” between “things” • RDF Schema: • RDF Vocabular Definition Language • OWL: • A more expressive Vocabular Definition Language

  20. Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  21. Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. • See previous slide on Biomedical ontologies • Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.

  22. trade antwerp europe amsterdam amsterdam netherlands merchant merchant center city city town town Question:Who writes the meta-data ? • Automated learning • shallow natural language analysis • Concept extraction Example: Encyclopedia Britannica on “Amsterdam”

  23. Question:Who writes the meta-data ? • exploit existing legacy-data • Databases • Lab equipment • (Amazon) • side-effect from user interaction • email keyword extraction • NOT from manual effort

  24. Remember “required are” • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

  25. Some working examples? • DOPE

  26. DOPE: Background • Vertical Information Provision • Buy a topic instead of a Journal ! • Web provides new opportunities • Business driver: drug development • Rich, information-hungry market • Good thesaurus (EMTREE)

  27. The Data • Document repositories: • ScienceDirect: approx. 500.000 fulltext articles • MEDLINE: approx. 10.000.000 abstracts • Extracted Metadata • The Collexis Metadata Server: concept-extraction ("semantic fingerprinting") • Thesauri and Ontologies • EMTREE: 60.000 preferred terms 200.000 synonyms

  28. RDF Schema EMTREE RDF RDF Datasource 1 Datasource n …. Query interface Architecture:

  29. Ontology disambiguates query

  30. Ontology groups results

  31. Ontology clusters results

  32. Ontology refines query

  33. Some working examples? • DOPE • HCLS (http://www.w3.org/2001/sw/hcls/)

  34. RDF Schema EMTREE RDF RDF Datasource 1 Datasource n Query interface Architecture: RDF Schema …. Gene Ontology ….

  35. Summarising… • Data integration on the Web: • machine processable data besides human processable data • Syntax for meta-data • Representation • Inference • Vocabularies for meta-data • Lot’s of them in bio-inf. • Actual meta-data: • Lot’s in bio-inf. • Will enable: • Better search engines (recall, precision, concepts) • Combining information across pages (inference) • …

  36. Things to do for you • Practical: Use existing software to construct new use-scenario’s • Conceptual:Create on ontology for some area of bio-medical expertise • from scratch • as a refinement of an existing ontology • Technical:Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)

More Related