1 / 43

Migrating to the Semantic Web: Bioinformatics as a case study.

Migrating to the Semantic Web: Bioinformatics as a case study. Phillip Lord, Dept of Computer Science, University of Manchester. What is the Semantic Web. We are here!. OWL RDF XML. The talk. Three (and a half) example case studies Two different technologies.

yair
Download Presentation

Migrating to the Semantic Web: Bioinformatics as a case study.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Migrating to the Semantic Web: Bioinformatics as a case study. Phillip Lord, Dept of Computer Science, University of Manchester

  2. What is the Semantic Web We are here! OWL RDF XML

  3. The talk • Three (and a half) example case studies • Two different technologies. • Why we choose the different technologies.

  4. RDF in a nutshell;Tim Berners-Lee’s original vision… 1989

  5. OWL in a nutshell

  6. The Motivation “At the doctor’s office, Lucy instructed her semantic web agent. It promptly retrievedinformation about her Mom’s prescribed treatment, looked up a list of several providers within 20 miles of home, with a good trust rating.”

  7. Scientific American, May 2001: Beware of the Hype!

  8. The Motivating Example Lucy Doctor

  9. UK e-Science Pilot Project. Oct 2001 – April 2005. £3.4 million. £0.4 million studentships. myGrid Newcastle Sheffield Manchester Nottingham Hinxton Southampton

  10. Data(type)-intensive bioinformatics ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY). FT CONFLICT 374 374 S -> A (IN REF. 3). SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

  11. Service Stack Bioinformaticians Tool Providers Service Providers Work bench Taverna Talisman Web Portal Applications Gateway Personalisation Registries Service and Workflow Discovery Provenance Event Notification Ontologies Ontology Mgt Views Metadata Mgt Core services myGrid Information Repository OGSA-DQP Distributed Query Processor FreeFluo Workflow Enactment Engine Web Service (Grid Service) communication fabric External services SoapLab GowLab Native Web Services AMBIT Text Extraction Service Legacy apps Legacy apps

  12. WBS Workflows: Query nucleotide sequence RepeatMasker ncbiBlastWrapper Pink: Outputs/inputs of a service Purple: Tailor-made services Green: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns GenBank Accession No URL inc GB identifier Translation/sequence file. Good for records and publications prettyseq GenBank Entry Amino Acid translation Sort for appropriate Sequences only Identifies PEST seq epestfind 6 ORFs Seqret Identifies FingerPRINTS pscan MW, length, charge, pI, etc Nucleotide seq (Fasta) pepstats sixpack ORFs transeq Predicts Coiled-coil regions RepeatMasker pepcoil tblastn Vs nr, est, est_mouse, est_human databases. Blastp Vs nr Coding sequence GenScan ncbiBlastWrapper Restriction enzyme map restrict SignalP TargetP PSORTII Predicts cellular location CpG Island locations and % cpgreport InterPro PFAM Prosite Smart Identifies functional and structural domains/motifs RepeatMasker Repetative elements Hydrophobic regions Pepwindow? Octanol? Blastn Vs nr, est databases. ncbiBlastWrapper

  13. Semantic discovery • Query-ontology – discovering workflows and services described in the registry by building a query in Taverna. • A common ontology is used to annotate and query. • Look for all workflows that accept an input of semantic type nucleotide sequence. • Aim to have semantic discovery over public view on the Web.

  14. Service annotation • Adding structured metadata to a workflow registration to enable others to discover and reuse it more effectively. E.g. what semantic type of input does it accept.

  15. Semantic Discovery Pedro data capture tool View annotations on workflow Drag a workflow entry into the explorer pane and the workflow loads. Drag a service/ workflow to the scavenger window for inclusion into the workflow

  16. Biologist Ontologist Service Providers

  17. Problems when doing In Silico Experiments Experiments being performed repeatedly, at different site, different time, by different users or groups; A large repository of records about experiments!! • verification of data; • “recipes” for experiment designs; • explanation for the impact of changes; • ownership; • performance of services; • data quality; Scientists In silico experiments:

  18. The Current State of the Art

  19. Tim Berners-Lee’s original vision… 1989

  20. A Semantic Web of Provenance what how/which/ when/where Literature relevant to provenance study or data in this workflow Provenance record of a workflow run DAML+OiL Ontologies linking provenance documents how XML PDF HTML XML XML who why Interlinking graph of the workflow that generates the provenance logs Web page of people who has related interests as the owner of the workflow Experiment Notes

  21. Population Semantic Data Web Services Data Repository FreeFluo Taverna Metadata Repository LaunchPad Haystack

  22. Haystack from IBM

  23. Biologist Biologist Database Biologist

  24. Gene Ontology Next Generation Project(GONG) • Demonstrate the utility of finer grained concept descriptions in DAML+OIL (OWL-DL) • Develop methodologies and tools to support the process

  25. Translating theory into practice • Gene Ontology provides a service to the model organism database community • Description logic (DL) is a technology born out of computer science research • OWL is a standard ontology interchange language underpinned by DL

  26. GONG - proof of concept • Maintaining an exhaustive is-a structure Parent Is-a relationship GO concept

  27. Example: heparin biosynthesis [chemical] biosynthesis (GO:0009058) [i]carbohydratebiosynthesis (GO:0016051) Axis 1: Chemicals [i]aminoglycanbiosynthesis (GO:0006023) [i]heparinbiosynthesis (GO:0030210)

  28. Example: heparin biosynthesis [chemical] biosynthesis (GO:0009058) [i]carbohydratebiosynthesis (GO:0016051) Axis 1: Chemicals [i]aminoglycanbiosynthesis (GO:0006023) [i]heparinbiosynthesis (GO:0030210) Axis 2: Process [i]heparinmetabolism (GO:0030202) [i]heparinbiosynthesis (GO:0030210)

  29. Example: heparin biosynthesis [chemical] biosynthesis (GO:0009058) [i]carbohydratebiosynthesis (GO:0016051) Axis 1: Chemicals [i]aminoglycanbiosynthesis (GO:0006023) [i] glycosaminoglycan biosynthesis (GO:0006024) [i]heparinbiosynthesis (GO:0030210) Axis 2: Process [i]heparinmetabolism (GO:0030202) [i]heparinbiosynthesis (GO:0030210)

  30. Is this important? • Missing is-a not noticed by users • BUT… improves fidelity of DB record retrieval. • Asking for gene products involved in ‘glycosaminoglycan biosynthesis’ will lead to an additional result: O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)

  31. Paraphrased reasoning process • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClassheparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan Is-a

  32. Inferring a new is-a link • heparin biosynthesis • class heparin biosynthesis definedsubClassOf biosynthesisrestrictiononProperty acts_on hasClassheparin • glycosaminoglycan biosynthesis • class glycosaminoglycan biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClassglycosaminoglycan Is-a Is-a

  33. Results • Carbohydrate metabolism ~250 concepts • 22 additional is-a links 17 of which now in GO • Amino acid metabolism ~ 250 concepts • Further 17 additional is-a links now in GO • GO team will be reviewing results for metabolism as a whole once we have the tools to support the process • Useful results come from even a partial coverage

  34. Build a practical environment • Tools needed for: • Creating OWL definitions • Tracking changes • Reporting reasoning results • Viewing definitions

  35. Reporting tools

  36. OWL for GONG Biologist Ontologist

  37. Conclusions • Three problems, three different solutions, all making use of semantic web technologies. • A little semantics can go a long way. • The expressivity of the language has to be chosen at least in part based on the tasks to be performed, and the user base. • Tools, tools, tools.

  38. Acknowledgments • Chris Wroe, Robert Stevens, Carole GobleUniversity of Manchester, UK • Michael Ashburner • EBI, Hinxton, UK • Jane Lomax and Midori Harris of the GO editorial team for help and advice and responding to the suggested changes • UMLS and MeSH which provided valuable resources for chemical information • Sean Bechhofer for development on OilEd • Project funded as a subcontract of the DARPA DAML programme

  39. Acknowledgements myGrid is an EPSRC funded UK eScience Program Pilot Project Particular thanks to the other members of the Taverna project, http://taverna.sf.net

  40. myGrid People Core • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe. Users • Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK • Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK Postgraduates • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire Industrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM) • Robin McEntire (GSK) Collaborators • Keith Decker

More Related