1 / 24

Marc Zimmermann, Martin Hofmann- Apitius

Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image. 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, August 2011. Marc Zimmermann, Martin Hofmann- Apitius

denali
Download Presentation

Marc Zimmermann, Martin Hofmann- Apitius

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, August 2011 Marc Zimmermann, Martin Hofmann-Apitius Bonn-Aachen Center for Information Technology, University of Bonn

  2. Chemical Structure Reconstruction

  3. SCAIView: gene + protein index (lucene + semantic entities) Link out to biological reference databases Entities handled by ProMiner Semantic tagging

  4. Beautiful Artwork But Wrong Molecule

  5. Automatic Binning of Images Database Curation Trash

  6. Challenge • Problem • Predict the quality of the reconstruction result without a reference molecule • Solution • Machine learning • Expected results • Quality of new reconstructions estimated by trained models

  7. The Evaluation Conceptof SCAI andInfoChem 1 Manual abstractionofchemicalnames Comparison (quantitative) 2 NER N2S ICANNOTATOR 5 Pdf to Text Automatic chemical verification Database 3 Page seg- mentation Chemical recognition Image classifier chemoCR (Fraunhofer) Comparison (quantitative) 4 “Similarity MCD” Manual abstraction of structures from images

  8. Chemistry to be defined: Examples from Patents (I)

  9. Chemistry to be defined: Examples from Patents (II)

  10. Quality Measure: Graph Matching OK bad • SimilarityMCD (Minimal Chemical Distance) • Module from InfoChem • Graph-matching on • Reconstruction result of chemoCR • The reference molecule • Results in • Numerical value, [0,1]

  11. Chemical Error Classification Scheme • MISSED • BOND_MISSED • COMPLETE_BOND_MISSED • ORDER_BOND_MISSED • CHIRAL_BOND_MISSED • SYMBOL_MISSED • ATOM_SYMBOL_MISSED • ISOTOPE_SYMBOL_MISSED • CHARGE_SYMBOL_MISSED • RADICAL_SYMBOL_MISSED

  12. Mapping of Reaction Schemes with Spatial Constraints reconstruction reference

  13. Mining of Chemical Names • Chemical names should be found in the text • Synonyms and spelling variations in different databases • Several Text Mining techniques developed Sodium lauryl sulfate (DB00815 DrugBank) : 230 brand names and 26 synonyms

  14. Compounds sharing a Synonym “Livesan” An entry DB00436 Bendroflumethiazidefrom DrugBank An entry Procetofen C07586 from the KEGG Compound

  15. Task: Generating a Dictionary • (-)-Epiafzelechin • epi-Afzelechin UID1 • 5-(1-cycloheptenyl)-5-ethyl-1,3-diazinane-2,4,6-trione • Heptabarbital • rather reliable data sources • recognizes different chemical names referring to the same structure and to map them to the unique identifier

  16. Different Mapping Approaches C02265 D-Phenylalanine; D-alpha-Amino-beta-phenylpropionic acid. DB02556 D-Phenylalanine; (2R)-2-amino-3-phenylpropanoic acid Synonym based Interlink based Structure based

  17. Interlink based Approach D02592 from KEGG Drug DB01234 from DrugBank Non-unified approach towards parametric isomers Link structurally different compounds:

  18. Problem: Merging Data Sources to UID Identity problem (“parametric isomers”): • Stereochemistry • Tautomerism • Charges • Isotopes • Mixtures • Polymers • Aromaticity • Markush Structures

  19. Workflow

  20. Importing SDF into SQL Schema KEGG COMPOUND2 KEGG DRUG2 SDF files Drugcard files DrugBank1 • http://www.drugbank.ca/ last accessed August 2010 • http://kegg.jp/ last accessed August 2010

  21. Dictionary Comparison Dictionary 1 Dictionary 2 Entry1. Compound1, Compound2, Compound3 Entry1. Compound1, Compound2, Compound4 Dictionary 1 Dictionary 2 Entry1. Compound1, Compound2 - present Compound1, Compound3 - absent Compound2, Compound3 - absent Entry1. Compound1, Compound2 - present Compound1, Compound4 - absent Compound2, Compound4 - absent Entries are transformed into Binary correspondences – all possible pairs between the compounds from one entry

  22. Overlap of Binary Correspondences DrugBank& KEGG

  23. The Open PharmacologicalConcepts Triple Store • Develop a setof robust standards… • Implement the standards in a semantic integration hub (“Open Pharmacological Space”)… • Deliver services to support on-going drug discovery programs in pharma and public domain… Prototype: www.openphacts.org

  24. Conclusions http://trec.nist.gov/ • Chemical information extraction is an ongoing effort • Task is challenging • In need of critical assessments and gold standards • Structure reconstruction • Database mapping • Retrieval tasks • In need of strategies • Deal with reconstruction errors • Extended file formats & search algorithms • Result visualizations

More Related