1 / 29

Grounding Gene Mentions with Respect to Gene Database Identifiers

This overview discusses the approaches, related work, and conclusions of the BioCreative Task 1B, which focuses on grounding gene mentions with respect to gene database identifiers. It also includes information on entity normalization, automatic GO code annotation, and analogous tasks. The evaluation metrics and the top-performing systems are discussed as well.

ambrose
Download Presentation

Grounding Gene Mentions with Respect to Gene Database Identifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grounding Gene Mentions with Respect to Gene Database Identifiers Overview of BioCreAtIvE Task 1B Ben Hachey BioNLP Reading Group 18.07.2005

  2. Outline • BioCreAtIvE Task 1B • Approaches to Task 1B • Related Work • Conclusions BioCreative Task 1B

  3. BioCreAtIvE • Critical Assessment of Information Extraction Systems in Biology • Task 1A Named Entity Recognition • Given a single sentence from an abstract, to identify all mentions of genes • “(or proteins where there is ambiguity)” • Task 1B Entity Normalisation • Given NER’d abstract, associate list of unique identifiers • Task 2 Automatic GO code annotation BioCreative Task 1B

  4. Example Abstract from Fly Since Dpp and Gbb levels are not detectably higher in the early phases of cross vein development, other factors apparently account for this localized activity. Our evidence suggests that the product of the crossveinless 2 gene is a novel member of the BMP-like signaling pathway required to potentiate Gbb of Dpp signaling in the cross veins. crossveinless 2 is expressed at higher levels in the developing cross veins and is necessary for local BMP-like activity. Input Output BioCreative Task 1B

  5. Analogous Tasks • Can be seen as… • Grounding: Tying a textual mention of an entity to its identifier in a gene database/ontology Provides a list, without repetition, of the entities referred to in the sentence (Information Extraction) • Coreference: Identifying which textual mentions refer to the same entity • Lexical Entailment: Whether term is substitutable in given context BioCreative Task 1B

  6. Resources • Synonym database provided for each organism: • Fly Drosophila melanogaster • Yeast Saccharomyces cerevisiae • Mouse Mus musculus • These list a number of different textual realisations for each unique gene identifier BioCreative Task 1B

  7. Fly Synonym DB Examples BioCreative Task 1B

  8. Data Preparation • (Start with documents whose full text has been manually curated) • Noisy Training Data • Automatically eliminate gene Ids not found in abstract • Fly: 0.83, Mouse: 0.71, Yeast: 0.92 (quality) • Testing Gold Standard • Hand check for over-zealous elimination • Add genes mentioned “in passing” (so task is same across organisms) • Fly: 0.93, Mouse: 0.87, Yeast: 0.96 (agreement) • 250 abstracts/organism BioCreative Task 1B

  9. Evaluation • Precision, recall, and balanced f-score automatically calculated with respect to gold standard gene ID lists • 8 teams total • Various numbers of submissions (0-3) on each organism • Number of submissions • Fly: 11 • Mouse: 16 • Yeast: 15 BioCreative Task 1B

  10. Top Systems Performance (P/R/F) BioCreative Task 1B

  11. Outline • BioCreAtIvE Task 1B • Approaches to Task 1B • Related Work • Conclusions BioCreative Task 1B

  12. Approaches • Use synonyms for simple matching against text • Difficult to ID false positives • Especially mouse and fly where synonyms include e.g. common words (with, at, yellow, …) • ID gene text, then ground • Leverage NER system from Task 1A • Limited by performance of NER • 78.8% precision, 73.5% recall, 76.1 balance F • 37% of FPs and 39% of FNs due to boundary problems • Synonym lists not exhaustive BioCreative Task 1B

  13. Information Sources • Edit synonym list? • Add other specific and frequently used synonyms • Remove problematic synonyms • String similarity • Matching against synonym list • Fuzzy matching (spelling variations, abbreviations, …) • Coreference • Synonym in same text • Other contextual evidence • Gene co-occurrence in same text • Word context around entity… • Probabilistic/Statistical models • Pr(geneID), Pr(geneID|synonym) BioCreative Task 1B

  14. Top Systems Overview BioCreative Task 1B

  15. Systems • Team: user24 (mouse, yeast) • Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis • Ludwig-Maximilians-Universität München • Approach: No NER, match synonyms to text • Rule-based generation and curation of synonym lists • Remove unspecific and inappropriate synonyms • Expanded to include additional, frequently used synonyms • Automatic rule-based edit system • Human curation to assure quality • Tuned using training data • Select all matches • Post-filter: Remove matches with non-gene context (e.g. ‘cells’, ‘domains’, ‘cell type’, ‘DNA binding site’) • Semi-automatic syn list curation, could be used for gazetteers! BioCreative Task 1B

  16. Systems • Team: user16 (fly, mouse, yeast) • Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck • Fraunhofer Institute & Ludwig-Maximilians-Universität München • Approach: No NER, match synonyms to text • Synonym list expanded (offline) • Automatic rule-based edit system w/ human curation to assure quality • Rule-based classification of synonyms • Class I: Case-insensitive near-synonyms • Class II: Case-sensitive near-synonyms • Class III: Questionable synonyms (high frequency, inexact match) • Select n highest scoring matches (Hanisch et al., 2003) • Focus on matching multi-word terms • Syn list curation, multi-word term matching! BioCreative Task 1B

  17. Systems • Team: user8 (fly, mouse, yeast) • Jeremiah Crim, Ryan McDonald, and Fernando Pereira • University of Pennsylvania • Approach: No NER, match synonyms to text • Pattern Matching • Synonym list pruned by threshold on conditional probability of a gene ID (g) being a label for a document given that a synonym (s) matches • List of candidate gene IDs compiled by selecting 1000 training documents with highest token-level cosine similarity • Match Classification • Binary maximum entropy classifier trained to predict whether gene IDs selected by pattern matching should be kept • Fly: +7.7, Mouse: +1.5, Yeast: -0.4 • Prob models (pruning, disambiguation), no human curation! BioCreative Task 1B

  18. Systems • Team: user5 (fly, mouse, yeast) • Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover • University of Edinburgh & Stanford • Approach: NER, match entities to synonyms • Build organism-specific named entity recognition • Noisy training data obtained from Task 1B materials • Match gene entities to synonym lists (fuzzy) • Incorporates various edit operations (e.g. case folding, optional dashes and other punc, Brit/Am spellings) • Tuned per-organism to select and order edit operations • Disambiguate each entity to a single gene ID • Var. heuristic, statistical approaches (e.g. gene ID co-occurrence, IR query term weighting, repetition in synonym list) • Again, optimised per-organism • Bootstrapping NE data, IR term weighting, no human curation! BioCreative Task 1B

  19. Systems • Team: user6 (fly, mouse, yeast) • Javier Tamames • BioAlma SL • Approach: NER, match entities to synonyms • NER for various bio ents (e.g. genes, proteins, compounds) • Also bio-medical semantic tagging of words E.g. Core terms (receptor, kinase, …) and types (alpha, a1, …) • Match gene entities to synonym lists (fuzzy) • Use BioCreAtIvE lists and other relevant databases • Match and weighting based on semantic labels • Disambiguate each entity to a single gene ID • Use of key words extracted from databases (e.g. HUGO, MGI, SGD) • Semantic tagging module, Key word context from org DBs! BioCreative Task 1B

  20. Outline • BioCreAtIvE Task 1B • Approaches to Task 1B • Related Work • Conclusions BioCreative Task 1B

  21. Related Work • Ben Wellner (2005). Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data. In: Proceedings of BioLink-2005. • … BioCreative Task 1B

  22. Outline • BioCreAtIvE Task 1B • Approaches to Task 1B • Related Work • Conclusions BioCreative Task 1B

  23. Conclusions • Model that can be automatically tuned to e.g. domain, organism • Proper modelling of: • Abbreviations • Spelling variants • Coreference in abstracts • Textual context, key words • Entity co-occurrence • Entity and term distributions • Token semantic roles BioCreative Task 1B

  24. Thank you BioCreative Task 1B

  25. References Lynette Hirschman, Marc Colosimo, Alexander Morgan, Jeffrey Colombe, and Alexander Yeh (2004). Task 1B: Gene list task. In: Proceedings BioCreAtIvE Workshop. Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck (2004). ProMiner: Organis-specific protein name detection using approximate string matching. In: Proceedings BioCreAtIvE Workshop. [user16] Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis (2004). Exact versus approximate string matching for protein name identification. In: Proceedings BioCreAtIvE Workshop.[user24] Jerimiah Crim, Ryan McDonald, and Fernando Pereira (2004). Automatically annotating documents with normalized gene lists. In: Proceedings BioCreAtIvE Workshop. [user8] Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover (2004). Grounding gene mentions with respect to gene database identifiers. In: Proceedings BioCreAtIvE Workshop. [user5] Javer Tamames (2004). Text detective: BioAlma’s gene annotation tool. In: Proceedings BioCreAtIvE Workshop. [user6] Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, and Ralf Zimmer (2003). Playing biology’s name game: Identifying protein names in scientific text. BioCreative Task 1B

  26. BioCreative Task 1B

  27. The SEER Project Team Bea Alex, Shipra Dingare, Claire Grover, Ben Hachey, Ewan Klein, Yuval Krymolowski, Malvina Nissim Edinburgh: Stanford: Jenny Finkel, Chris Manning, Huy Nguyen

  28. Top Systems Performance (P/R/F) BioCreative Task 1B

  29. Top Systems Rank BioCreative Task 1B

More Related