Tactical Formalization of Linked Open Data

Tactical Formalization of Linked Open Data • Michel Dumontier, Ph.D. • Associate Professor of Medicine (Biomedical Informatics) • Stanford University @micheldumontier::RDA Workshop:23-02-2015

Linked Open Data provides an incredibly dynamic, rapidly growing set of interlinkedresources @micheldumontier::RDA Workshop:23-02-2015

Bio2RDF is an open source project to unify the representation and interlinking of biological data Linked Data for the Life Sciences chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications 11B triples from 35 datasets Each dataset is described using its own schema. @micheldumontier::RDA Workshop:23-02-2015

Query data,ontologies, and services simultaneouslyusing Federated Queries over Independent SPARQL DBs Gene Ontology Get all protein catabolic processes (and more specific) in biomodels query against <http://bioportal.bio2rdf.org/sparql> SELECT?go ?label count(distinct ?x) WHERE { ?go rdfs:label ?label . ?go rdfs:subClassOf?tgo ?tgordfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . } } @micheldumontier::RDA Workshop:23-02-2015

Despite all the data, it’s still hard to find answers to questions Because there are many ways to represent the same data and each dataset represents it differently @micheldumontier::RDA Workshop:23-02-2015

multiple models for the same kind of data do emerge, each with their own merit @micheldumontier::RDA Workshop:23-02-2015

This lack of coordination makes Linked Open Data somewhat chaotic and unwieldy @micheldumontier::RDA Workshop:23-02-2015

Massive Proliferation of Ontologies / Vocabularies could be harnessed to bring order out of chaos @micheldumontier::RDA Workshop:23-02-2015

The Semanticscience Integrated Ontology (SIO) Is used to ground Bio2RDF, SADI semantic web services 1300+ classes, 201 object properties (inc. inverses) 1 datatype property @micheldumontier::RDA Workshop:23-02-2015

Ontology Design PatternsObjects, processesand their attributes. @micheldumontier::NIDM:Jan 26,2015

Multi-Stakeholder Efforts to Standardize Representations are Reasonable, Long Term Strategies for Data Integration @micheldumontier::RDA Workshop:23-02-2015

tactical formalization • Turn linked data into • whatever you need STANDARDS APPLICATION SPECIFIC W3C Community-standards Application standards Visualization standards Drug repurposing Verifying annotations in biomodels Discovering aberrant pathways @micheldumontier::RDA Workshop:23-02-2015

w/ Deborah McGuiness, Jim McCusker @ RPI @micheldumontier::RDA Workshop:23-02-2015

ReDrugS uses Nanopublications • A nanopublication is a structured digital object to associate a statement composed of one or more triples with its evidence/provenance, and digital object metadata. • http://nanopub.org @micheldumontier::RDA Workshop:23-02-2015

nanopublications are dynamically generated from a SPARQL query Fit for purpose to a target schema Here, we make an ontological commitment @micheldumontier::RDA Workshop:23-02-2015

we can do more @micheldumontier::RDA Workshop:23-02-2015

Have you heard of OWL? @micheldumontier::RDA Workshop:23-02-2015

SBML-based biomodels place semantic annotations in an annotation element <speciesmetaid="_525530" id="GLCi" compartment="cyto" > <annotation> <rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/"> <rdf:Descriptionrdf:about="#_525530"> <bqbiol:is> <rdf:Bag> <rdf:lirdf:resource=“http://identifiers.org/obo.chebi/CHEBI:4167"/> <rdf:lirdf:resource=“http://identifiers.org/kegg.compound/C00031"/> </rdf:Bag> </bqbiol:is> </rdf:Description> </rdf:RDF> </annotation> </species> subject predicate (qualifier) object The intent is to express that the species represents a substance composed of glucose molecules We also know from the SBML model that this substance is located in the cytosol and with a (initial) concentration of 0.09765M @micheldumontier::RDA Workshop:23-02-2015

By converting models into formal representations of knowledge we get to: • capturethe semantics of models and the biological systems they represent • leverage knowledge explicit in linked terminologies • validatethe accuracy of the annotations / models • discover biological implications inherent in the models • query the results of simulations in the context of the biological knowledge @micheldumontier::RDA Workshop:23-02-2015

Model verification After reasoning, we found 27 models to be inconsistent reasons • our representation - functions sometimes found in the place of physical entities (e.g. entities that secrete insulin). better to constrain with appropriate relations • SBML abused – e.g. species used as a measure of time • Incorrect annotations - constraints in the ontologies themselves mean that the annotation is simply not possible Integrating systems biology models and biomedical ontologies. BMC Syst Biol. 2011 Aug 11;5:124. doi: 10.1186/1752-0509-5-124. @micheldumontier::RDA Workshop:23-02-2015

Identification of drug and disease enriched pathways Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. • Approach • Integrated 3 datasets & 7 terminologies • DrugBank, PharmGKBand CTD • MeSH, ATC, ChEBI, UMLS, SNOMED, ICD, DO • Formalized into an OWL-EL ontology • 4 class top level ontology • logical class axioms • 650,000+ classes, 3.2M subClassOf axioms • Identified significant associations using enrichment analysis over the fully inferred knowledge base @micheldumontier::RDA Workshop:23-02-2015

Benefit 1: Enhanced Query Capability • Use any mapped terminology to query a target resource. • Use knowledge in target ontologies to formulate more precise questions • ask for drugs that are associated with diseases of the joint: ‘Chikungunya’ (do:0050012) is defined as a viral infectious disease located in the ‘joint’ (fma:7490) and caused by a ‘Chikungunya virus’ (taxon:37124). • Learn relationships that are inferred by automated reasoning. • alcohol (ChEBI:30879) is associated with alcoholism (PA443309) since alcoholism is directly associated with ethanol (CHEBI:16236) @micheldumontier::RDA Workshop:23-02-2015

Benefit 2: Knowledge Discovery through Enrichment Analysis • OntoFunc: Tool to discover significant associations between sets of objects and ontology categories. • We found 22,653 disease-pathway associations, where for each pathway we find genes that are linked to disease. • Mood disorder (do:3324) associated with Zidovudine Pathway (pharmgkb:PA165859361). Zidovudineis used to treat HIV/AIDS. Side effects include fatigue, headache, myalgia, malaise and anorexia @micheldumontier::RDA Workshop:23-02-2015

http://tiny.cc/hcls-datadesc-ed @micheldumontier::RDA Workshop:23-02-2015

61 metadata elements Core • Identifiers • Title • Description • Attribution • Homepage • License • Language • Keywords • Concepts and vocabularies used • Standards • Publication Editor & Validator underway Provenance and Change • Version number • Source • Provenance: retrieved from, derived from, created with • Frequency of change Availability • Format • Download URL • Landing page • SPARQL endpoint 13 Content Statistics • With SPARQL queries @micheldumontier::RDA Workshop:23-02-2015

NIH Big Data to Knowledge (BD2K) Center of Excellence Mark Musen (PI), Michel Dumontier (Co-I), Purvesh Khatri (Co-I), Olivier Gevaert (Co-I) Goals: Tools to facilitate template construction and semi-automated metadata annotation Enable dataset discovery & ruse Partnership with ImmPort imm. Repo, BioSharing, and the Stanford Library @micheldumontier::RDA Workshop:23-02-2015

dumontierlab.com michel.dumontier@stanford.edu Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier @micheldumontier::RDA Workshop:23-02-2015

Tactical Formalization of Linked Open Data

Tactical Formalization of Linked Open Data

Presentation Transcript

Utilising Linked Open Data in Applications

University of Southampton Linked Open Data Architect

Weaving the economic Linked Open Data

Granularity in Library Linked Open Data

Open Data Linked Data Big Data

Linked Open Data stuff

OCLC Open Source Linked Data Framework

Libraries and linked open data

Research Information Linked Open Data Store

OpenEI and Linked Open Data

Linked Open Library Data @hbz

Linked Open Data in the Humanities

Visualizing Linked Open Data

Linked Open Government Data: What’s Next?

Mashing Up Linked Open Government Data

Linked Open Innsbruck

Linked Open Data Cloud

Corpus Annotation with Linked Open Data