TM Web services: Whatizit, CiteXplore

EBI TM services: mapping targets to diseasesSeptember 3rd, 2009Dietrich Rebholz-Schuhmann, MD, PhDGroup Leader Rebholz GroupEBI, WT Genome CampusHinxton, Cambridge, U.K.

TM at the EBI: current developments • TM Web services: Whatizit, CiteXplore • One of EBI’s major services: 11,000 hits per day, 400 MB data transfer • Ongoing integration into public services (UKPMC) • Research around new developments and Quality assurance • Working towards a knowledge infrastructure from literature • Named entity recognition: most progress • Relation / event identification • Repository of inferred knowledge: functional annotation of genes, diseases, gene-disease associations, relation identification • Exploitation of semantic resources (ontologies)

The magic transformation from text to semantics Concepts Ideas Facts Relationships Events “Knowledge” ?

How far can we go? Automatic + full integration with database resources? => Mainly entities + concepts ? Automatic generation of paper summaries => Extraction of facts + events Extraction of new knowledge => Generate hypothesis first Let the authors do it all => Do not use papers anymore

Idealized R&D stages (overview) Genes/ProteinsChemical entitiesDiseasesGO/MeSH termsBioLexicon Gene regulationontology Ternary relations Functions of proteinsGene-diseaseassociations WhatizitIeXML Integration of literature into bioinformatics IT services 2006 2008 2009 Time Semanticssupport Named entity recognition / grounding Identificationof relations Interoperabilityof literature and text mining

Document Entities Concepts Tokens Facts

“The function of OmpR appears to be the enhancement of a basal level of ompC expression” basal level of ompCexpression OmpRompC … the of appear … OmpR increases ompC expression

Gene normalisation SwissProt Biolexicon, human Best performance=> 100% Precision=> 100% Recall Performance is state of the ArtResults are nottuned to theBioCreAtIve IIcorpus Pezik et al., Proc. LREC Workshop, 2008

Entities + concepts SwissProt Biolexicon, human Chemicals entities Disease NER MeSH terms Go terms(@ Rank 1) All solutions are state of the art Jimeno et al., BMC Bioinformatics, 2008

Protein-protein interaction identification GREs withInference Performance not adequate,improvementsrequired “Associate” MI-PPI All NMI-PPI GREs w/oInference Rebholz-Schuhmann et al., SMBM 2008

How do we find knowledge?

Gene-disease associations Motivation • Some diseases have a mono-genetic cause: • For example Cystic fibrosis, sickle cell anemia, F8/F9-defects, deafness • Other diseases have a pluri-genetic cause: • Schizophrenia, stomach cancer, hypertension • Question: • Can we find molecular functions that are shared between genes and diseases?

Gene-disease association pairsfrom the literature

Candidate genes: Approach • Complete Medline analysis • Identify all genes/proteins (80% F-measure) • Identify all gene ontological terms (35% F-measure) • Identify all diseases (70% F-measure) • Generation of concept profiles for genes and diseases • Each vector contains the TF-IDF value of all relevant GO concepts • A GO concept is relevant if found in the context of a gene or disease • Pivoted cosine similarity • Selection of gene profiles that are most similar to disease profiles • Prioritization of gene-disease associations • Evaluation • Alternative methods: MeSH annotations and tokens • “Gold standard” data resources: OMIM, GAD, GOA • Assessment by curators

Candidate genes: Evaluation, Omim/GAD Limited performance due to:- term variability- not all G2D associations are relevant to Omim

Candidate genes: Validation by curators • Neither OMIM, nor GAD are complete • Curators are more able to verify putative novel knowledge • Evaluation: • Random sample of novel 30 gene disease association pairs • At least 2 out of 3 curators have to agree, use of literature resources • Verify the direct mention of the gene diseases association • Identify indirect evidence for the gene-disease pair • Verify the assignment of GO concepts

Candidate genes: curator assessment 63% of gene-diseaseassociations can beconfirmed by at least 2 curators 57% of GOassignmentsdescribe thedisease and the gene

P-values of GDAPs (based on cosine scores) No clear confirmation of gene-disease associations Clear confirmation of mostGDAPs

Candidate genes: Outcome • Identification of 1,154 putative novel gene-disease associations from the literature • 63% (in total 727) should be reliable=> to be confirmed • 672 distinct candidate genes linked to the associations • 340 genes are also covered in GOA linked to 545 gene-disease associations • 57% of the assigned GO concepts are reliable • Interpretation of the gene-disease association • 10% of the GO concept annotations are shared with GOA

Gene-disease association pairsfrom the literature

Where do we move in the future?

How far can we go? ? Automatic generation of paper summaries => Extraction of facts + events Let the authors do it all => Do not use papers anymore

Research to drive standards Standardization of Document Formats: • IeXML • SciXML • Standardization of Content: • Genes • Chemical Entities • Medical terms • MeSH, GO terms PaperMaker: Support to authors Performance assessment on a very large corpus(FP07, support action) Bioinformatics user: Analytical pipelines

UKPMC: Prospect

The process • Collaborative annotation of a large-scale biomedical corpus • Five project partners annotated the first corpus(150,000 documents, different semantic types) • Reconciliation, syntax + semantics=> generate the pilot corpus • Make part of the pilot corpus available => challenge: reproduce the annotations • Close the challenge, harmonise the annotations again=> next corpus • Reopen the challenge with the second harmonized corpus

The challenge 150,000 documentsor more ... Test set for all systemsAssessment, benchmarking

Support to authors / readers • FEBS Letter experiment • Authors contribute to the curation work • They identify the correct entity in the DBs (gene/protein) • Curators add the protein-protein interaction to the DB (MINT) • BioCreative Meta-Server => BioCreative II.5 • BioLit (P. Bourne et all) • adding semantic data to the literature => keep it in a DB • Word plug-gin to annotate ontological terms • PaperMaker (Rebholz group) • Consistency analysis of manuscripts • Reflect , OnTheFly (Schneider group) • Annotation of documents + interlinking with DBs • Royal Society of Chemistry • Markup of text (Oscar + editors) => interlinked chemistry

PaperMaker • PaperMaker - a tool to support authors writing biomedical papers: • Interactive feedback on the contents of papers (related work and concept annotations) • Formal consistency criteria checking (spelling, terminology, acronyms, references)

Consistency parameters Domain-independent • General spelling and grammar • General readability • Appropriate use of references • Finding and acknowledging related work Domain-specific use of terminology: • Should be consistent with naming domain-specific guidelines • Should not be ambiguous • Should conform to the conventional usage (possible clashes between naming guidelines and common-sense convention) • Useful to resolve terminology to reference databases (e. g. UniProt for protein names, ChEBI chemical entities, etc.) • The special case of acronyms

Content feedback • Resolving the contents to literature repositories • Finding related work (document retrieval) • Finding related ideas (passage retrieval) • Resolving the contents to ontological reference databases • MeSH descriptors have been demonstrated to improve biomedical information retrieval. Can we suggest MeSH terms directly to the authors? • Gene Ontology (GO) terms are increasingly used in information extraction systems.

PaperMaker workflow Original manuscript text Module 1 Spell Checker Module 2 Acronym Resolution Module 3 NER Module 4 GO Recognition Module 8 Summary Module 7 Related Work Module 6 Reference Check Module 5 MeSH Annotation Modified manuscript text

PaperMaker, Conclusions • PaperMaker can help the author conform to the formal requirements of paper writing with special emphasis on the domain • It also provides feedback on the contents by relating it to reference resources and literature repositories • It may improve the indexing of a paper in literature repositories (less ambiguous terminology) • http://www.ebi.ac.uk/Rebholz-srv/PaperMakerWork in progress 

TM services at the EBI: Conclusions • Standardised TM solutions available, free use • Quality assurance is ongoing work, integration with EBI’s data resources • About 500 to 2,500 users, 50 GB annual data transfer • Knowledge infrastructure is work in progress • Annotations of genes, diseases • Extraction of different types of relationsCollaborations between publishers and pharmaceutical industry(SESL project)

Editorial Board: Christopher Baker Olivier Bodenreider Philip Bourne Anita Burgun-Parenthoine Carol Friedman Carole Goble Udo Hahn Lynette Hirschman Jung-Jae Kim Patrick Lambrix Ulf Leser Susanna Lewis Jong C. Park Editorial Board (cont): Alan Ruttenberg Tapio Salakoski Susanna Assunto-Sansone Michael Schroeder Stefan Schulz Amnon Shabo Barry Smith Robert Stevens Toshihisa Takagi Alfonso Valencia Mark Wilkinson Limsoon Wong

Acknowledgements … IeXML: G. Nenadic, Uo.Manchester CALBC: J. v.d.Lei, Rotterdam E. v.Mulligen, Rotterdam O. Bodenreider, NLM Other: M. Ashburner, Uo.Cambridge U. Leser, HUo.Berlin D. Trieschnigg, Uo.Twente F. Couto, Uo.Lisbon A. Waagmeester, Uo.Maastricht S. Jaeger, HUo.Berlin T. Grego, Uo.Lisbon A. Baillif, Uo. Clermont-Feront BootStrep: Udo Hahn, Uo.Jena E. Beisswanter, Uo.Jena K. Tomanek, Uo.Jena K. Buyko, Uo.Jena S. Ananiadou, Uo.Manchester N. Calzolari, CNRS Pisa A. Burgun, Uo.Rennes EBI: P. Stoehr, E. Dimmer, E. Camron, M. Kapushevski, H. Hermjakob, N. Luscombe, D. Clark, P. Flicek,

TM Web services: Whatizit, CiteXplore