1 / 27

Tactical Formalization of Linked Open Data

Tactical Formalization of Linked Open Data. Michel Dumontier , Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University. Linked Open Data provides an incredibly dynamic, rapidly growing set of interlinked resources.

Download Presentation

Tactical Formalization of Linked Open Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Tactical Formalization of Linked Open Data • Michel Dumontier, Ph.D. • Associate Professor of Medicine (Biomedical Informatics) • Stanford University @micheldumontier::RDA Workshop:23-02-2015

  2. Linked Open Data provides an incredibly dynamic, rapidly growing set of interlinkedresources @micheldumontier::RDA Workshop:23-02-2015

  3. Bio2RDF is an open source project to unify the representation and interlinking of biological data Linked Data for the Life Sciences chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications 11B triples from 35 datasets Each dataset is described using its own schema. @micheldumontier::RDA Workshop:23-02-2015

  4. Query data,ontologies, and services simultaneouslyusing Federated Queries over Independent SPARQL DBs Gene Ontology Get all protein catabolic processes (and more specific) in biomodels query against <http://bioportal.bio2rdf.org/sparql> SELECT?go ?label count(distinct ?x) WHERE { ?go rdfs:label ?label . ?go rdfs:subClassOf?tgo ?tgordfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . } } @micheldumontier::RDA Workshop:23-02-2015

  5. Despite all the data, it’s still hard to find answers to questions Because there are many ways to represent the same data and each dataset represents it differently @micheldumontier::RDA Workshop:23-02-2015

  6. multiple models for the same kind of data do emerge, each with their own merit @micheldumontier::RDA Workshop:23-02-2015

  7. This lack of coordination makes Linked Open Data somewhat chaotic and unwieldy @micheldumontier::RDA Workshop:23-02-2015

  8. Massive Proliferation of Ontologies / Vocabularies could be harnessed to bring order out of chaos @micheldumontier::RDA Workshop:23-02-2015

  9. The Semanticscience Integrated Ontology (SIO) Is used to ground Bio2RDF, SADI semantic web services 1300+ classes, 201 object properties (inc. inverses) 1 datatype property @micheldumontier::RDA Workshop:23-02-2015

  10. Ontology Design PatternsObjects, processesand their attributes. @micheldumontier::NIDM:Jan 26,2015

  11. Multi-Stakeholder Efforts to Standardize Representations are Reasonable, Long Term Strategies for Data Integration @micheldumontier::RDA Workshop:23-02-2015

  12. tactical formalization • Turn linked data into • whatever you need STANDARDS APPLICATION SPECIFIC W3C Community-standards Application standards Visualization standards Drug repurposing Verifying annotations in biomodels Discovering aberrant pathways @micheldumontier::RDA Workshop:23-02-2015

  13. w/ Deborah McGuiness, Jim McCusker @ RPI @micheldumontier::RDA Workshop:23-02-2015

  14. ReDrugS uses Nanopublications • A nanopublication is a structured digital object to associate a statement composed of one or more triples with its evidence/provenance, and digital object metadata. • http://nanopub.org @micheldumontier::RDA Workshop:23-02-2015

  15. nanopublications are dynamically generated from a SPARQL query Fit for purpose to a target schema Here, we make an ontological commitment @micheldumontier::RDA Workshop:23-02-2015

  16. we can do more @micheldumontier::RDA Workshop:23-02-2015

  17. Have you heard of OWL? @micheldumontier::RDA Workshop:23-02-2015

  18. SBML-based biomodels place semantic annotations in an annotation element    <speciesmetaid="_525530" id="GLCi" compartment="cyto" >      <annotation>         <rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">           <rdf:Descriptionrdf:about="#_525530"> <bqbiol:is>               <rdf:Bag> <rdf:lirdf:resource=“http://identifiers.org/obo.chebi/CHEBI:4167"/> <rdf:lirdf:resource=“http://identifiers.org/kegg.compound/C00031"/>               </rdf:Bag>             </bqbiol:is>           </rdf:Description>         </rdf:RDF>       </annotation>     </species> subject predicate (qualifier) object The intent is to express that the species represents a substance composed of glucose molecules We also know from the SBML model that this substance is located in the cytosol and with a (initial) concentration of 0.09765M @micheldumontier::RDA Workshop:23-02-2015

  19. By converting models into formal representations of knowledge we get to: • capturethe semantics of models and the biological systems they represent • leverage knowledge explicit in linked terminologies • validatethe accuracy of the annotations / models • discover biological implications inherent in the models • query the results of simulations in the context of the biological knowledge @micheldumontier::RDA Workshop:23-02-2015

  20. Model verification After reasoning, we found 27 models to be inconsistent reasons • our representation - functions sometimes found in the place of physical entities (e.g. entities that secrete insulin). better to constrain with appropriate relations • SBML abused – e.g. species used as a measure of time • Incorrect annotations - constraints in the ontologies themselves mean that the annotation is simply not possible Integrating systems biology models and biomedical ontologies. BMC Syst Biol. 2011 Aug 11;5:124. doi: 10.1186/1752-0509-5-124. @micheldumontier::RDA Workshop:23-02-2015

  21. Identification of drug and disease enriched pathways Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. • Approach • Integrated 3 datasets & 7 terminologies • DrugBank, PharmGKBand CTD • MeSH, ATC, ChEBI, UMLS, SNOMED, ICD, DO • Formalized into an OWL-EL ontology • 4 class top level ontology • logical class axioms • 650,000+ classes, 3.2M subClassOf axioms • Identified significant associations using enrichment analysis over the fully inferred knowledge base @micheldumontier::RDA Workshop:23-02-2015

  22. Benefit 1: Enhanced Query Capability • Use any mapped terminology to query a target resource. • Use knowledge in target ontologies to formulate more precise questions • ask for drugs that are associated with diseases of the joint: ‘Chikungunya’ (do:0050012) is defined as a viral infectious disease located in the ‘joint’ (fma:7490) and caused by a ‘Chikungunya virus’ (taxon:37124). • Learn relationships that are inferred by automated reasoning. • alcohol (ChEBI:30879) is associated with alcoholism (PA443309) since alcoholism is directly associated with ethanol (CHEBI:16236) @micheldumontier::RDA Workshop:23-02-2015

  23. Benefit 2: Knowledge Discovery through Enrichment Analysis • OntoFunc: Tool to discover significant associations between sets of objects and ontology categories. • We found 22,653 disease-pathway associations, where for each pathway we find genes that are linked to disease. • Mood disorder (do:3324) associated with Zidovudine Pathway (pharmgkb:PA165859361). Zidovudineis used to treat HIV/AIDS. Side effects include fatigue, headache, myalgia, malaise and anorexia @micheldumontier::RDA Workshop:23-02-2015

  24. http://tiny.cc/hcls-datadesc-ed @micheldumontier::RDA Workshop:23-02-2015

  25. 61 metadata elements Core • Identifiers • Title • Description • Attribution • Homepage • License • Language • Keywords • Concepts and vocabularies used • Standards • Publication Editor & Validator underway Provenance and Change • Version number • Source • Provenance: retrieved from, derived from, created with • Frequency of change Availability • Format • Download URL • Landing page • SPARQL endpoint 13 Content Statistics • With SPARQL queries @micheldumontier::RDA Workshop:23-02-2015

  26. NIH Big Data to Knowledge (BD2K) Center of Excellence Mark Musen (PI), Michel Dumontier (Co-I), Purvesh Khatri (Co-I), Olivier Gevaert (Co-I) Goals: Tools to facilitate template construction and semi-automated metadata annotation Enable dataset discovery & ruse Partnership with ImmPort imm. Repo, BioSharing, and the Stanford Library @micheldumontier::RDA Workshop:23-02-2015

  27. dumontierlab.com michel.dumontier@stanford.edu Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier @micheldumontier::RDA Workshop:23-02-2015

More Related