Enhancing Genomic and Disease Data Using Linked Data

Integrating Semantics & Numerics: Case Study on Enhancing Genomic and Disease Data Using Linked Data Technologies (Semantic Technologies Meets Data Analysis) Deborah L. McGuinness Tetherless World Senior Constellation Chair Professor of Computer Science and Cognitive Science Director RPI Web Science Research Center RPI Institute for Data Exploration and Applications Health Informatics Lead Thanks to the extended RPI Tetherless World & SemNExT Teams: in particular Kristin Bennett and Evan Pattonas well as the rest of the SemNExT team: Elisabeth Brown, Hannah De Los Santos, Spencer Norris, Matt Poegel, & the ReDrugS team: Jim McCusker, Michel Dumontier, Rui Yan

Motivation • Data and data analysis tasks are exploding and the tasks are often time consuming and results are often difficult to understand for non-experts • Semantic representation languages and environments are available and enjoying increased usage • Structured vetted and maintained resources are increasingly available on the web, particularly in bioinformatics • Two groups (McGuinness – Semantic Technologies & Bennett – Data Analysis) had maturing processes that we believed could be improved if integrated • We believed tooling support could be produced to help identify and link experimental bioinformatics data and analyses with relevant semantic knowledge

SemNExT – Semantic Numeric Exploration Technology • Developing next generation integrated semantic/numeric data exploration and analysis. • Joint analysis of experimental and semantic data streams from several sources. • Combination of multiple statistical and machine learning techniques with semantically encoded domain knowledge. • Advanced interactive visualizations mathematical data analytics techniques with full semantic markup designed for RPI unique platforms. Joint work between McGuinness’ Semantics group and Bennett’s Applied Math group and leveraging Experimental Multimedia Performing Arts Center (EMPAC) infrastructure http://tw.rpi.edu/web/project/SemNExT

Background: Ontologies An ontology specifies a rich description of the • Terminology, concepts, nomenclature • Relationships among concepts and individuals • Sentences distinguishing concepts, refining definitions and relationships (constraints, restrictions, regular expressions) relevant to a particular domain or area of interest. * Based on AAAI ‘99 Ontologies Panel ̶ McGuinness, Welty, Uschold, Gruninger, Lehmann

SemNExT Workflow • Identify data sources and ontologies • Generate ontology and instance mappings between sources (e.g., identifiers) • Identify appropriate statistical analyses based on the types of data (e.g., nominal, ratio) • Identify appropriate visualization for statistical results • Capture and expose provenance for end users

Ontologies as an Enabling Technology • Identify Data Source Vocabularies. Determine equivalency and subclass relationships between different data sources • Model inputs, outputs, assumptions, techniques of statistical models, simulations • Provide automated mappings between individuals using reasoning • Upper level ontology specialized with domain knowledge

Background: Understanding Human Cerebral Cortex Development and Disease Neural Differentiation Fertilization Cortical Layer Formation Human embryonic Stem Cells hESCs Cortical Development Clock from Analysis of RNA-Seq from Day 0 to 77captures genes temporal role in stages of corticogenesis Analyze brain grown in a dish model Create molecular signature of normal cortical development Analyze mutated genes associated with disease to understand developmental origins Compare to diseased patient stem cell lines to identify differences-> e.g. autism signature Joint work on CORTECON Data with Dr. Chris Fasano and Sally Temple at Neural Stem Cell Inst

Primary Components • Data Source Ingest • Generate ontology and instance mappings between sources (e.g., identifiers) • Identify appropriate statistical analyses based on the types of data (e.g., nominal, ratio) • Identify appropriate visualization for statistical results • Capture and expose provenance for end users

Semantic Numeric Exploration Technology Components Ontologies are used to integrate and map concepts between data sources. Also used to power smart search, browsing, and visualizations. Semantic technologies capture the provenance of mapping, analysis, and visualization.

How gene mutations alter stages of corticogenesis to cause disease.

SemNExT - Example

Knowledge Graph Ex: Associations: p ≥ 0.9 nanopub McGuinness 10/614

ReDrugS(Repurposing of Drugs using Semantics) • Use semantic technologies to encode and process biological knowledge to generate hypotheses about new uses for existing drugs. • Leverage existing curated data sources, build reusable integrated content sources and infrastructure McCusker, J., Solanki, K., Chang, C., Dumontier, M., Dordick, J., and McGuinness, D.L. 2014. A Nanopublication Framework for Systems Biology and Drug Repurposing. Proc. of CSHALS 2014 Boston, MA. McCusker, J., Yan, R., Solanki, K., Erickson, J.S., Chang, C., Dumontier, M., Dordick, J., and McGuinness, D.L. 2014. A Nanopublication Framework for Biological Networks using Cytoscape.js. In Proceedings of International Conference on Biomedical Ontologies (ICBO 2014) (October 6-9 2014, Houston, TX). http://tw.rpi.edu/web/doc/redrugsnanopub

Nanopublications Simple yet semantically-rich encodings allow algorithms to not just find correlation but to look for causality using reasoning NanoPub_501799_Supporting NanoPub_501799_Assertion NanoPub_501799_Attribution

Experimental Method Coverage • 99.98% coverage of the ~936,000 nanopubs with evidence data from iRefIndex. • Top 10 methods (86% coverage):

Powering Interfaces: Querying the Knowledge Graph nanopub McGuinness 1/7/2015

Annotating the Chord Heat Map / Group Interactive Visualization Chord Heat Map is interactive and annotated. Also available on additional platforms: CAMPFIRE Proprietary Platform being developed at RPI

Discussion • Data analyst has much less manual work to find connections AND potential semantic relationships • Helping to move along the path from human expert to semi-automatic service to help move from correlation to potential causation • This is scratching the surface in the potential for semantic numeric integration but has potential now • Platform is ready for usage and collaborators • Contact us! dlm@cs.rpi.edu

Current Opportunities & Challenges: Vocabularies child health/ exposure… Metadata characterizingstudies & methods... Definitions... Studies... Evolving Ontology Data Science Domain Ontologies & Mappings... Use Cases... Examples: Relationships between: small for gestational age (SGA) and lifecycle outcome; preterm birth and neurocognitive faltering Policy The Open Biological and Biomedical Ontologies

Current Challenges & Opportunities: Annotation for Reuse • Data Analysis is both science and art • Many decisions such has how to handle missing values are made and often not recorded • Some toolkits automatically do some “cleanup”, again often not recorded • Integrating results from multiple analyses often requires deep understanding of what was done • Ongoing work is addressing adequate markup • Motivating example 2 CPP (Collaborative Perinatal Project) analyses done at RPI – what does it take to combine them…

Current Challenges and Opportunities: Context • Context of data is often missing however is often critical • True in many settings including current work on global health • E.g., Access to food – weather impacts; SES metadata inconsistent, incomplete, highly variable, … • What is your favorite challenge / opportunity?

More Information • Questions? dlm@cs.rpi.edu http://tw.rpi.edu/web/project/SemNExT Also, SemStats paper at International Semantic Web Conference (ISWC) https://semstats.wordpress.com/

Legend • GO term ID located in slices • Selected genes highlighted in cluster color • 2) Cluster Match Percentile = (cluster percentile) / (sum of fuzzy-cmeans.m cluster percentiles) • Only clusters with > .16 • Scaled to 100% for comparison purposes • p-score printed, wedge height scaled- p-score determines enrichedness of cluster per term • Chords drawn between instances of repeated terms (similarities in class provenance) • - Color of cluster w/ highest p-value for term chosen • 5) Pastel colored lines circling figure = clusters’ average GO term p-values 22.8% 42.62% 22.80%

Semantic Results - Certain strongly enriched terms in different clusters - Weaker link in one cluster suggests membership to others with higher p-val 22.8% Ex. GO:0031105 owl:sameAs umls:C1423771umls:C1423771rdfs:label ‘SEPT6’ GO:0031105 rdfs:label ‘septin complex’→ SEPT6 is the same gene as septin complex Visualization techniques reveal multiple other such relationships between semantics and statistics 42.62% However, only Cluster 4 (red) is enriched for septin complex; strengthens case for membership in Cluster 4. 22.80% Same logic applies to other terms heavily enriched for specific clusters. Semantics conflicts with statistical assessment of cluster assignments but also opens up dynamic between the two.

Cortical Development Clock from Analysis of RNA-Seq from Day 0 to 77captures genes temporal role in stages of corticogenesis

Enhancing Genomic and Disease Data Using Linked Data

Enhancing Genomic and Disease Data Using Linked Data

Presentation Transcript

watch constellation energy senior players championship golf

A Tetherless Computing Architecture

Creating , Maintaining , and Integrating Understandable Knowledge Bases Richard Fikes Deborah McGuinness Sheila McIlrai

Peter Fox Tetherless World Constellation RPI Australia Ontology Workshop 2009

Tetherless World Constellation Conference Calls LDAP Subversion Trac

Community Science – The Next Frontier Deborah L. McGuinness 12 ( dlm@cs.rpi )

Tetherless World Constellation

Deborah Koch, CFA Senior Equity Analyst

Deborah McGuinness Co-Director Knowledge Systems, Artificial Intelligence Laboratory

IAAI - Session 23F Robert S. Engelmore Award* Lecture Standing in for Deborah L. McGuinness*

Mr L. Ashton Chair

Deborah L. Deming

Deborah L. McGuinness Tetherless World Senior Constellation Chair and

Deborah McGuinness and Joanne Luciano with Peter Fox and Li Ding CSCI-6962-01

Semantic Provenance for Image Data Processing Peter Fox (HAO/ESSL/NCAR) Deborah McGuinness (RPI)

Mourning Martin McGuinness

Dr. Deborah L. Schafer DDS, MS

DAML Language Breakout Deborah L. McGuinness Knowledge Systems Laboratory Stanford University

A Tetherless Computing Architecture

Senior Executive Revolving Chair