1 / 33

Information Extraction from Literature

Information Extraction from Literature. Yue Lu BeeSpace Seminar Oct 24, 2007. Outline. Overview of BeeSpace v4 Entity Recognition Relation Extraction. Overview. BeeSpace V4 deeper semantic base than the current v3 system entities and relations VS mutual information Four levels

ahanu
Download Presentation

Information Extraction from Literature

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007

  2. Outline • Overview of BeeSpace v4 • Entity Recognition • Relation Extraction

  3. Overview • BeeSpace V4 • deeper semantic base than the current v3 system • entities and relations VS mutual information • Four levels • Level1: Entity Recognition • Level2: Entity Association Mining • Level3: Relation Extraction • Level4: Inference and Hypothesis Generation

  4. Overview • Level1: Entity Recognition (detailed later) • Level2 Entity Association Mining • Suppose entities are properly tagged • Utilize the co-occurrence patterns of entities to extract semantics • e.g. a bee biologist may want to know which genes are important for foraging behavior. • Similar to TREC Genomics 2007 task

  5. TREC Genomics 2007 • e.g. “Which [PATHWAYS] are possibly involved in the disease ADPKD?” • currently only retrieval techniques • Gene synonym expansion • Conjunctive query interpretation • User relevance feedback • tagged Entities definitely would help

  6. Overview • Level3: Relation Extraction • Goal is to extract the relations between entities • Generally requires entities to be properly tagged first • Detailed later • Level4: Inference and Hypothesis Generation • Inference on knowledge base • Graph mining

  7. Outline • Overview of BeeSpace v4 • Entity Recognition • Relation Extraction

  8. Entity Recognition • Gene • Example: • Although <GENE>mxp</GENE> and <GENE>Pb</GENE> display very similar expression patterns, <GENE>pb</GENE> null embryos develop normally

  9. Entity Recognition • Anatomy • Example: • In normal embryos, mxp is expressed in the <ANATOMY>maxillary</ANATOMY> and <ANATOMY>labial</ANATOMY> segments, whereas ectopic expression is observed in some GOF variants.

  10. Entity Recognition • Biological process • Example: • Amongst these are the Bicoid, the Nanos, and the terminal class gene products, some of which are oncoproteins involved in signal transduction for <BIOLOGICAL PROCESS>the formation of terminal structures in the embryo<BIOLOGICAL PROCESS>.

  11. Entity Recognition • Pathways • Example: • Several signal transduction pathways have been described in Drosophila, and this review explores the potential of oncogene studies using one of those pathways - <PATHWAY>the terminal class signal transduction pathway</PATHWAY> - to better understand the cellular mechanisms of proto-oncogenes that mediate cellular responses in vertebrates including humans

  12. Entity Recognition • Protein family • Example: • While non-arthropod orthologs have been found for many Drosophila eye developmental genes, this has not been the case for the glass (gl) gene, which encodes a <PROTEIN FAMILY>zinc finger transcription factor</PROTEIN FAMILY> required for photoreceptor cell specification, differentiation, and survival.

  13. Entity Recognition • CRE (cis-regulatory elements) • Example: • A synthetic, 23-bp <CRE>ecdysterone regulatory element (EcRE) </CRE>, derived from the upstream region of the Drosophila melanogaster hsp27 gene, was inserted adjacent to the herpes simplex virus thymidine kinase promoter fused to a bacterial gene for chloramphenicol acetyltransferase (CAT).

  14. Entity Recognition • Phenotype • Definition: • a set of observable physical characteristics of an individual organism • Example: • Fog, dumpy

  15. Entity Recognition • Class1: Small Variation (Dictionary/Ontology) • Organism, Anatomy , Biological Process, Pathway, Protein Family • Class2: Medium Variation • Gene, cis Regulatory Element • Class3: Large Variation • Phenotype, Behavior

  16. Entity Recognition • Generally can be defined as a classification problem • Boils down to feature definition • Class1: matching a word in the Dictionary/Ontology • Class2: prefix/suffix of the word, POS tags, … • Class3:?

  17. Entity Recognition • Firstly focus on Class1 • Relatively simple • Class2 and Class3 need training examples • Useful in entity association mining • Useful in facilitating extraction of many interesting relations • Related work: Textpresso

  18. Textpresso • Input: full text C. elegans literature • Output: tagged XML format • Defined a Textpresso ontology • First category is biological entities • manually curated a lexicon of names • Implemented by PERL regular expressions • We could reuse some of the regular expressions

  19. Entity Recognition Resources:

  20. Outline • Overview of BeeSpace v4 • Entity Recognition • Relation Extraction

  21. Relation Extraction • Expression Location • the expression of a gene in some location (tissues, body parts) • Homology/Orthology • one gene is homologous to another gene

  22. Relation Extraction • Biological process • one gene has some role in a biological process • Genetic/Physical/Regulatory Interaction • one gene interacts with another gene in a certain fashion (3 types of relations) • a simple case: Protein-Protein Interaction (PPI)

  23. Relation Extraction • Generally can be defined as a classification problem, which requires training data • Domain adaptation? • an example of PPI

  24. PPI • Problem Definition: • Gene/protein names are already tagged • A known list of interaction words • 133 words • classify each tuple (p1, p2, interWord) in one single sentence

  25. PPI • Methods • Learning algorithm: Maximum Entropy • Context features • “Extracting protein-protein interactions using simple contextual features training data” BioNLP Workshop on HLT-NAACL 06 • e.g. lexical forms, POS tags … • Less dependent on domain

  26. PPI • Training/Testing data: • BioCreative • 1000 hand labeled sentences, 3964 tuples • 5-fold cross validation • Performance • avgpr = 47.14624 • avgre = 43.97337 • avgf1 = 45.35523

  27. PPI • Training data: • BioCreative • 1000 hand labeled sentences, 3964 tuples • Testing Data (different domain) • Bee collection • Performance (Judged by Moushumi) • Total number of tuples extracted as PPI instances: 92 • Precision: 63%

  28. PPI Misclassification examples • Type1: No interaction • Sentence: Pretreatment of platelet suspension with phospholipase A2 from N. naja atra or A. mellifera venom (50 .mu.g/ml) inhibited platelet aggregation induced by sodium arachidonate or collagen, but not induced by thrombin or ionophore A-23187. • False: (collagen, thrombin, induced) • True: relation between protein and platelet aggregation; no PPI

  29. PPI Misclassification examples • Type2: Incorrect interaction word • Sentence: IgG antibody was able to inhibit binding of IgE antibody in the PLA radioallergsorbent test (RAST) from 10-40% at a molar excess of 10- to 1000-fold. • False: (IgG antibody, IgE antibody, binding) • True: (IgG antibody, IgE antibody, inhibit)

  30. PPI Misclassification examples • Type3: Incorrect protein involved • Sentence: AChEexhibits a butyrylcholinesterase (BuChE) activity that represents about 14% of AChE activity. • False: (AChE, AChE, exhibits) • True: (AChE, BuChE, exhibits )

  31. PPI • Possible Improvement • syntactic patterns: “Optimizing syntax-patterns for discovering protein-protein interactions” In Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track, • parse tree • dependency parsing • …

  32. The End

More Related