1 / 40

Scaling the walls of discovery: using semantic metadata for integrative problem solving

Scaling the walls of discovery: using semantic metadata for integrative problem solving. Greg Tucker-Kellogg, Ph.D. Chief Technology Officer Senior Director, Systems Biology Lilly Singapore Centre for Drug Discovery. Outline.

elsie
Download Presentation

Scaling the walls of discovery: using semantic metadata for integrative problem solving

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling the walls of discovery: using semantic metadata for integrative problem solving Greg Tucker-Kellogg, Ph.D. Chief Technology Officer Senior Director, Systems Biology Lilly Singapore Centre for Drug Discovery

  2. Outline The Challenge of Translational Discovery in Pharmaceutical Research Integration of Metadata using Semantic Web Technologies • Why focus on metadata? • How it helps Examples

  3. Lilly Singapore Centre for Drug Discovery Drug Discovery(drug candidates) Systems Biology (biomarkers) Integrative Computational Sciences (tools) Experimental Computational Oncology and diabetes research towards tailored therapy to improve patient outcome Clinical Candidates Biomarkers Informatics/Software Wet lab biology

  4. Pharmaceutical R&D spends more to get less

  5. Lost in translation Translate Translate The limits of my language mean the limits of my world (Ludwig Wittgenstein) 我的语言限制的范围是我的 I limit the scope of the language I (Ludwig Wittgenstein)

  6. Translational research in cancer: Connecting the dots of genetic aberrations Disease Patients Pathways Targets Tailored TherapeuticsImprove individual patient outcomes and health outcome predictability through tailoring drug, dose, timing of treatment, and relevant information

  7. The “Web” of heterogeneous data Cell/Assay Technologies

  8. Integrating Scientific Data Sets • Uncontrollable diversity • Most of the valuable data is from outside our walls • Much of it is poorly structured • Ranging from large (1TB/day) to boutique

  9. Scientist’s View of Integrated Information

  10. Manual Data Integration • A repeated, tedious process: • Pull data from internal and public data sets • Normalize terms and values • Write and run analysis scripts • Compile into a single Excel file, detached from the data source (no drill-down) • Often this process can consume days with no guaranteed resolution

  11. Integration Approaches Considered • Data Warehouse • Difficult to maintain and integrate new data sets • Difficult to evolve as data changes • Schemas tightly coupled to applications • Federated queries • Query performance issues • Where to place the index? • Problematic to maintain • Translating user search syntax to all sources requires deep knowledge of data layer • Semantic Integration • Relatively unproven in enterprise systems but adaptive to change • Relationships between data can be more fully characterized

  12. Standard Semantic Integration Model • All data is mapped to domain ontology in both directions • If single system is down, incomplete results. • Performance is limited to slowest system in network • Massive mapping effort • Multiple implementations of this approach, including: • Biological and Chemical Integrated Information System (BACIIS) • Boeing Query Generator Results Presentation Query Planning Data Set Integration DomainOntology &Mappings Query Submission Semantic Normalization Source Source Source Source

  13. Can we do better for our purposes? • Avoid a complex architecture and extended development effort • Realize benefits in the near-term • Preprocess metadata to improve efficiency. • Characterize the type of questions that ontology should answer • Identify stable semantic technologies, do not employ parsers. • Allow semantic and relational databases to work together

  14. What we need • Data Management and Availability • Capturing and filtering the global and growing avalanche of internal and external scientific data • Data Fusion • Systems to link, combine and navigate massive and heterogeneous data sets • Information Analysis and Mining • Algorithms and tools to help scientists seek correlations and find connections between pre-clinical and clinical knowledge to generate and test translational hypotheses.

  15. Data Architecture Integration Layer Experimental Matedata Repository Annotation Services (Genomics mapping + Gene functional info) Ontology Genomics mapping Experiment Context Functional Information Mapping & Annotation Centralized Experiment Context Common Vocabulary Proteome /GO Readout 30 million triples Derived Results 34 platforms Analysis and Mining Query Visualization Algorithms Workflow Domain/Platform Specific Data Expression (Affy,Agilent, Illumina) aCGH Screening Methylation SNP Mutation Tissue Microarray ChIP-Chip, miRNA Analysis Results

  16. LSCDD Data integration process in use Affy Expression Agilent Expression Illumina Expression aCGH Screening RNAi Database Mutation SNP TMA Analysis Results Query Visualization Experimental Metadata Repository Annotation Services (Genomics mapping + Gene Function)

  17. LSCDD Semantic Integration Approach • Use semantic technology on an appropriate problem • Create Ontology focused on solving LSCDD integration needs • Scientists and IT Analysts work together to iteratively create tailored vocabulary • Define competency questions to validate the ontology • Encourage ontology to evolve, a different animal than RDBMS schemas • Create bridges to public and internal ontologies to realize the full capabilities of the vocabulary • Involve users to verify RDBMS-to-ontology mapping to increase confidence in the solution. • Sparql is hard. Design an intuitive query model or question templates for users to navigate the repository.

  18. LSCDD Semantic Integration Approach (Cont) • Used Agile philosophy throughout: application development, ontology development and mapping effort • Drive adoption by engaging users to understand their challenges and refine the solution. • Technologies • Protégé Ontology Editor • Oracle Semantic Technologies 11g • D2R Map (Database to RDF Mapping Language) • C# development in Visual Studio 2205

  19. Metadata RDF Repository • Aggregates experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation. • Scientists at LSCDD now have a single source of experiment information described with a common vocabulary. • Current data sources include: • Expression Data : Affymetrix, Illumina, Agilent • aCGH Data • RNAi Screening Data • Reagent Data • Gene Ontology (GO) • Medical Subject Headings (MeSH) • Many others Currently ~30 million triples

  20. LSCDD Metadata Ontology

  21. Metadata Repository Application • Both browse and query views are provided for repository access. • The Query View allows the user to search the repository by setting constraints on attributes of the entities in the ontology. • Links to external data sets such as Gene Ontology and MeSH have been defined, queries may span multiple ontologies. • Results View displays details about each of the matches found and allows user to navigate across entities. • The application is created as a plugin to the Lilly Science Grid and can leverage Integrated Genomics Portal for Cancer Research (IGPCR) plugins to provide details about Genes in hit lists.

  22. Metadata Repository Application Find all deacetylasesinvolved in Colorectal Neoplasms Results View shows list of Genes - Run Query… - Add filter to MeSH Description Name attribute - Add filter to Gene Ontology Label attribute Navigate across data links

  23. Experiment Data Annotation H460 screen: run 789 hasConflictingResults While raw experiment results are not suitable for editing, metadata such as experiment descriptions and relations becomes more valuable when users augment and refine. Experiment hasId: abc123hasContact: Bill SmithhasType: SiRNA ScreenhasDescription: ____ … Experiment hasId: def456hasContact: Jane SmithhasType: SiRNA ScreenhasDescription: H460 screen …

  24. IGPCR: Integrated Genomics Portal for Cancer Research • An Integrated view for analysis results • Helps oncology researchers with: • Drug target identification and prioritization • Biomarker discovery • Combination therapy

  25. Backup

  26. Answering scientific questions • What is the status of the target of my interest across multiple tumor types? • Get me all the interactions for methylases that are involved in colorectal cancer. And for all these genes, get the expression and aCGH values for all colon cancer samples. • Are there any reagents available to conduct functional validation? • What are the right model systems to study the perturbation of my gene of interest?

  27. Cancer drug discovery

  28. Integration of high throughput datasets Tumor Samples Tumor Samples Celllines Cell lines Mutations Mutations CGH / SKY CGH / SKY Public / Private Public / Private Expression Expression Tissue Tissue Microarrays Microarrays Chemosensitivity Chemosensitivity Patient Survival PatientSurvival RNAi RNAi

  29. Going Forward • Integration with additional external sources: NCBI, KEGG, Proteome, PubMED • Integration with National Cancer Institute Metathesaurus • Continued integration with new data types generated internally or from collaborators • Definition and support of additional ontologies

  30. Acknowledgements • LSCDD, Singapore IT • Kevin Gao, Rakhi Bhat, Srinivasulu Kota and Maurice Manning Systems Biology • Amit Aggarwal and Mahesh Kumar Guzuva Desikan ICS • Pat Hartman • HiSoft Technology – Dalian, China • Bill Yan, Young Gong, Harold Yin, Steven Cao and Jason Wang • Lilly, Indianapolis USA • Susie Stephens, Jacob Koehler

  31. Backup Slides

  32. Putting it all together… Objects Measure Map 1 Map 2 Compounds Fingerprint Literature MTS Genes Expression Binding Coding SNPs Linkage D Clinical DB Images Signature

  33. Silos Need to Broken Down Project Program Product Exploratory Launch Hit To Lead Lead To PgS Target To Hit Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Generate/Test Hypothesis Lead Optimization Pre-Clinical Development Global Launch Phase I Phase 2 Phase 3 Registration Target Hit Lead PgS CS FHD FED PD/RD FS FA FL GL Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Model & Understand Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Analyze & Mine Transform Transform Transform Transform Transform Transform Transform Transform Transform Transform Data Data Data Data Data Data Data Data Data Data

  34. BACIIS System Architecture

  35. Hybrid Architecture User Interface Knowledge-Space Navigation List Management Presentation Services Analytic Services Metadata Repositories Request Brokers Analysis Entities Navigation Service Layer Data Set Integration Services Presentation Entities Semantic Layer Persistence Entities Metadata Services Layer Query Preparation Service Semantic Normalization Service Personalization Entities Navigational Entities Adaptive Layer Federation Entities Query Submission Service Streams Management Service Physical Access Layer Data Access Service Layer Source Source Source Source Source

  36. Goals • Make knowledge emerge from repositories • Make data more valuable by adding context • Leverage intellectual assets • Decision support • Enhance productivity • Reduce IT integration efforts

More Related