1 / 18

BioNLP related talks and demos at ACL and CONLL ‘05

BioNLP related talks and demos at ACL and CONLL ‘05. Presented by Beatrice Alex BioNLP meeting 11 th of July 2005. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction. (R. McDonald, F. Pereira, S. Kulick, S. Winters, Y. Jin, P. White)

blue
Download Presentation

BioNLP related talks and demos at ACL and CONLL ‘05

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11th of July 2005

  2. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction (R. McDonald, F. Pereira, S. Kulick, S. Winters, Y. Jin, P. White) • Complex relations: • John Smith is the CEO at Inc. Corp. (John Smith, CEO, Inc. Corp.) • John Smith goes to his office at Inc. Corp. (John Smith,  , Inc. Corp.)

  3. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction • Complex Relation Extraction: • Recognition of pairs of entity mentions (binary relations are edges in a graph and named entities are nodes) • Create set of positive (valid) and negative (invalid) relations using a standard maxent classifier (Berger et al. ’96, McCallum ’02) • Reconstruction of complex relations by making tuples from maximal cliques in the graph

  4. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction • Complex relation reconstruction methods: • Maximal cliques (MC) • Consider all cliques in graph consistent with definition of the relation and add  • For overlapping cliques, only return maximal cliques (those that are not a subset of other cliques). • Use branch and bound algorithm to find all maximal cliques (Bron and Kerbosch ’73) = very efficient

  5. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction • Probabilistic Cliques (PC) • Assign weight to each binary relation (taken from classifier) • Weight of a cliques w(C) is the mean weight of the edges in the clique • Cliques is valid if w(C)  0.5

  6. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction • Extraction of genomic variation events from biomedical text (variation type, location, initial state, altered state) “At codons 12 and 16, the occurrence of point mutations from G/A to T/G were observed. (point mutation, codon 12, G, T) (point mutation, codon 16, A, G)

  7. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction • 447 Medline abstracts • 4691 sentence, 4773 entities, 1218 relations (38% not binary) • 760 2-ary relations • 283 3-ary relations • 175 4-ary relations • Gold standard named entities (56% of entity pairs not related)

  8. Simple Algorithms for Complex Relation Extraction with Applications to Biomedical Information Extraction • Results: MC and PC significantly faster and more accurate than NE (naïve enumeration)

  9. Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing (P. Nakov and M. Hearst) • Unsupervised method for noun compound bracketing [[liver cell] antibody] vs. [liver [cell line]] • Use of bigram estimates with ² measure • Use of surface features for querying web search engines • Experiments with paraphrases • Evaluation on encyclopaedia and bioscience text

  10. Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing • Web-driven surface features • Dash: cell-cycle analysis, donor T-cell • Possessive marker: brain’s stem cell, brain stem’s cells • Internal capitalisation: Plasmodium vivax Malaria, brain Stem cells • Embedded slashes: leukaemia/lymphoma cell • Brackets: growth factor (beta), (brain) stem cells • Collected surface features using regular expressions in summaries of returned documents of exact NC queries

  11. Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing • Other features: • Abbreviations: “tumor necrosis factor (NF)”, tumor necrosis (TN) factor • Concatenation: “health care reform” -> healthcare, carereform • Reordering • Internal inflection variability

  12. Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing • Paraphrases: “brain stem cells” “stem cells in the brain” “cells from the brain stem” • Used queries with a set of selected paraphrase patterns to see how often they occurred for bracketing prediction

  13. Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing • Evaluation • Lauer’s data set (Lauer ‘95) • 244 three noun NCs • Biomedical data set • Extracted 500 three noun NCs from Medline abstracts • 430 unambiguous (361 with left, 69 with right bracketing) • Inter-annotator agreement: 88% and 82% (kappa: .606 and .442)

  14. Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing • Results: • Surface features perform best Enc.: P=85.51% with 87.70% coverage Bio: P=88.84% with 100% coverage • Best overall scores by combining most reliable models (majority vote)

  15. Search Engine Statistics Beyond the n-gram: Applications to Noun Compound Bracketing

  16. Dynamically Generating a Protein Entity Dictionary Using Online Resources (H. Liu, Z. Hu and C. Wu) • Available at: http://biocreative.ifsm.umbc.edu/biothesaurus • 4,046,733 terms and 1,640,082 entities

  17. Dynamically Generating a Protein Entity Dictionary Using Online Resources • Use of large biological databases incl. • 3 NCBI databases (GenPept, RefSeq, Entrez GENE) • PSD database from Protein Information Resources (PIR) • Uniprot • Model organism databases • Nomenclature databases

  18. Dynamically Generating a Protein Entity Dictionary Using Online Resources • Automatically gathered fields containing annotation information for each iProtClass record • Extracted terms associated with one or more UniProt unique identifiers => raw dictionary • Automated curation using UMLS to flag UMLS semantic types and remove high frequency nonsensical terms

More Related