The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

The Semantic Web:New-style data-integration(and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam

What’s the problem?(data-mess in bio-inf)

The Study of Genes... • Chromosomal location • Sequence • Sequence Variation • Splicing • Protein Sequence • Protein Structure

… and Their Function • Homology • Motifs • Publications • Expression • HTS • In Vivo/Vitro Functional Characterization

Metabolic and regulatory pathway induction Understanding Mechanisms of Disease

Development of Drugs, Vaccines, Diagnostics • Differing types of Drugs, Vaccines, and Diagnostics • Small molecules • Protein therapeutics • Gene therapy • In vitro, In vivo diagnostics • Development requires • Preclinical research • Clinical trials • Long-term clinical research • All of which often feeds back into ongoing Genomics research and discovery.

Sample Problem: Hyperprolactinemia Over production of prolactin • prolactin stimulates mammary gland development and milk production Hyperprolactinemia is characterized by: • inappropriate milk production • disruption of menstrual cycle • can lead to conception difficulty

“Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells” “Show me all genes that are homologous to known transcription factors” “Show me all genes in the public literature that are putatively related to hyperprolactinemia” SEQUENCE EXPRESSION LITERATURE Understanding transcription factors for prolactin production “Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.” (Q1Q2Q3)

The Industry’s Problem Too much unintegrated data: • from a variety of incompatible sources • no standard naming convention • each with a custom browsing and querying mechanism (no common interface) • and poor interaction with other data sources

Andy Law’s First Law ESTC “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008

Andy Law’s Second Law ESTC “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008

What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …

Stitching this all together by hand? Source: Stephens et al. J Web Semantics 2006

Why would Semantic Web technology help?

Semantic Web Approach • Convert all data sources to RDF representation (local or distributed) • Optional: Collect the data to scalable semantic repository • Apply light-weight reasoning to specify formal interpretations of the data, e.g.: • remove redundancy, • establish equalities, etc • Derive new implicit knowledge ESTC Sept, 2008

alleviates <treatment> <name> <symptoms> <drug> IS-A <disease> <drugadministration> machine accessible meaning(What it’s like to be a machine) META-DATA

name symptoms disease drug administration What is meta-data? • it's just data • it's data describing other data • its' meant for machine consumption

Required are: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached • mechanisms for attribution and trust

What are ontologies &what are they used for world concept language Agree on a conceptualization no shared understanding Conceptual and terminologicalconfusion Make it explicitin some language. Actors: both humans and machines

standard vocabularies(“Ontologies”) • Identify the key concepts in a domain • Identify a vocabulary for these concepts • Identify relations between these concepts • Make these precise enoughso that they can be shared between • humans and humans • humans and machines • machines and machines

Real life examples • handcrafted • music: CDnow(2410/5), MusicMoz(1073/7) • biomedical: SNOMED (200k), GO(15k), Emtree(45k+190kSystems biology • ranging from lightweight • Yahoo, UNSPC, Open directory (400k) to heavyweight (Cyc (300k)) • ranging from small (METAR) to large (UNSPC)

Biomedical ontologies (a few..) • Mesh • Medical Subject Headings, National Library of Medicine • 22.000 descriptions • EMTREE • Commercial Elsevier, Drugs and diseases • 45.000 terms, 190.000 synonyms • UMLS • Integrates 100 different vocabularies • SNOMED • 200.000 concepts, College of American Pathologists • Gene Ontology • 15.000 terms in molecular biology • NCBI Cancer Ontology: • 17,000 classes (about 1M definitions),

Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

Stack of languages

Author-of pers05 ISBN... Publ-by Author-of pers05 ISBN... MIT Publ-by Author-of ISBN... Bluffer’s guide to RDF (1) • Object --Attribute-> Value triples • objects are web-resources • Value is again an Object: • triples can be linked • data-model = graph

What does RDF Schema add? • Defines vocabulary for RDF • Organizes this vocabulary in a typed hierarchy • Class, subClassOf, type • Property, subPropertyOf • domain, range Person subClassOf subClassOf range domain Teacher Student supervises type type supervises Frank Marta

RDF Triples in Life Sciences

OWL: things RDF Schema can’t do • equality • enumeration • number restrictions • Single-valued/multi-valued • Optional/required values • inverse, symmetric, transitive • boolean algebra • Union, complement • …

different owners & locations Web of Data: anybody can say anything about anything • All identifiers are URL's (= on the Web) • Allows total decoupling of • data • vocabulary • meta-data [<x> IsOfType <T>] x T <prince>

RDF(S) have a (very small) formal semantics • Defines what other statements are implied by a given set of RDF(S) statements • Ensures mutual agreement on minimal contentbetween parties without further contact • In the form of “entailment rules” • Very simple to compute(and not explosive in practice)

RDF(S) semantics: examples • Aspirin isOfType PainkillerPainkiller subClassOf Drug Aspirin isOfType Drug • aspirin alleviates headachealleviates range symptom  headache isOfType symptom

RDF(S) semantics: examples • AspirinisOfTypePainkillerPainkillersubClassOfDrug AspirinisOfTypeDrug • aspirin alleviates headachetreatsrangesymptom headacheisOfTypesymptom

RDF(S) semantics • X R Y + R domain T  X IsOfType T • X R Y + R range T  Y IsOfType T • T1 SubClassOf T2 +T2 SubClassOf T3  T1 SubClassOf T3 • X IsOfType T1 +T1 SubClassOf T2  X IsOfType T1

OWL also has a formal semantics • Defines what other statements are implied by a given set of statements • Ensures mutual agreement on content(both minimal and maximal)between parties without further contact • Can be used for integrity/consistency checking • Hard to compute (and rarely/sometime/always explosive in practice)

OWL semantics: minimal • vanGogh isOfType ImpressionistImpressionist subClassOf Painter vanGogh isOfType Painter • vanGogh painter-of sunflowerspainter-of domain painter vanGogh isOfType painter

OWL semantics: maximal • vanGogh isOfType ImpressionistImpressionist disjointFrom Cubist NOT: vanGogh isOfType Cubist • painted-by has-cardinality 1sun-flowers painted-by vanGoghPicasso different-individual-from vanGogh NOT: sun-flowers painted-by Picasso

Remember “required are”: • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

Question: who writes the ontologies? Professional bodies, scientific communities, companies, publishers, …. • See previous slide on Biomedical ontologies • Same developments in many other fields Good old fashioned Knowledge Engineering Convert from DB-schema, UML, etc.

trade antwerp europe amsterdam amsterdam netherlands merchant merchant center city city town town Question:Who writes the meta-data ? • Automated learning • shallow natural language analysis • Concept extraction Example: Encyclopedia Britannica on “Amsterdam”

Remember “required are” • one or more standard vocabularies • so search engines, producers and consumersall speak the same language • a standard syntax, • so meta-data can be recognised as such • lots of resources with meta-data attached

How to handle multiple ontologies: ontology matching • Linguistics & structure • Shared vocabulary • Instance-based matching • Shared background knowledge

Q Matching through shared vocabulary

Matching through shared instances

sharedbackgroundknowledge ontology 2 ontology 1 Matching using shared background knowledge

Some working examples? • Linked Life Data http://www.linkedlifedata.com • DOPE • HCLS http://www.w3.org/2001/sw/hcls/

Linked Life Data Overview ESTC • LinkedLifeData - statistics: • Number of statements: 1,159,857,602 • Number of explicit statements: 403,361,589 • Number of entities: 128,948,564 • Platform to automate the process: • Infrastructure to store and inferences • Transform the structured data sources to RDF • Provide web interface to access the data • Currently operates over OWLIM semantic repository • Publicly available at: http://www.linkedlifedata.com ESTC Sept, 2008

Light Weight Reasoning in Linked Life Data urn:biogrid:Interaction urn:uniprot:Protein urn:uniprot:FBgn0068575 urn:biogrid:FBgn0068575 rdf:type sameAs rdf:type urn:pubmed:15904 rdf:seeAlso rdf:type urn:intact:Interaction urn:uniprot:Q709356 hasParticipant Use relationships to derive new implicit knowledge Resolve the syntactic differences in the identifiers interactsWith sameAs rdf:type interactsWith urn:biogrid:15904 hasParticipant urn:uniprot:P104172 urn:intact:1007 sameAs rdf:seeAlso urn:biogrid:FBgn00134235 urn:uniprot:FBgn00134235 These are only examples resource names ESTC ESTC Sept, 2008

ESTC ESTC Sept, 2008

Some working examples? • Linked Life Data http://www.linkedlifedata.com • DOPE • HCLS http://www.w3.org/2001/sw/hcls/

The Data • Document repositories: • ScienceDirect: approx. 500.000 fulltext articles • MEDLINE: approx. 10.000.000 abstracts • Extracted Metadata • The Collexis Metadata Server: concept-extraction ("semantic fingerprinting") • Thesauri and Ontologies • EMTREE: 60.000 preferred terms 200.000 synonyms

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)