1 / 65

TM and NLP for Biology Research Issues in HPSG Parsing

TM and NLP for Biology Research Issues in HPSG Parsing. Junichi TSUJII. Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN. School of Computer Science National Centre for Text Mining University of Manchester, UK. 600,000. 14,000,000.

long
Download Presentation

TM and NLP for Biology Research Issues in HPSG Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TM and NLPfor BiologyResearch Issues in HPSG Parsing Junichi TSUJII Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN School of Computer Science National Centre for Text Mining University of Manchester, UK

  2. 600,000 14,000,000 12,000,000 Increments 500,000 :accumulation MEDLINE alone More than 0.5million per year More than 1.3 thousand per day Articles added 10,000,000 400,000 accumulation 8,000,000 increments 300,000 6,000,000 200,000 Medline Access 1997: 0.163 M accesses/month 2006: 82.027 M accesses/month 4,000,000 100,000 2,000,000 0 0 年 G-protein coupled receptor [D.L.Banville 2006] 2005 14,000 papers Before 1988 9 papers 1992 256 papers 500 times more 1964 1966 1968 1970 1972 1974 1976 1978 1982 1984 1986 1990 1992 1994 1996 1998 2000 2002 1980 1988 Increase in Medline

  3. NaCTeMwww.nactem.ac.uk • First such centre in the world • Funding: JISC, BBSRC, EPSRC • Consortium investment • Chair in TM (Prof. J. Tsujii, Univ. Tokyo) • Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust • Initial focus: biomedical academic community • Extend services to industry • Extend focus to other domains (social sciences)

  4. Consortium • Universities of Manchester, Liverpool • Service activity run by MIMAS (National Centre for Dataset Services), within MC (Manchester Computing) • Self-funded partners • San Diego Supercomputing Center • University of California, Berkeley • University of Geneva • University of Tokyo • Strong industrial & academic support • IBM, AZ, EBI, Wellcome Trust, Sanger Institute, Unilever, NowGEN, MerseyBio, …

  5. NLP and TM Linking text with knowledge Natural Language Processing Language as a complex system linking surface strings of characters with their meanings Text and words as structured objects Text Mining Text as a bag of words Words as surface strings NLP-based TM

  6. From surface diversities and ambiguities to conceptual invariants Non-Trivial Mappings Terminology Parsing Paraphrasing Knowledge Domain Language Domain Concepts and Relationships among Them Linguistic expressions Motivated Independently of language

  7. Example

  8. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. Retrieval using Regional Algebra [sentence] > ([arg1_activate] > [protein]) Non-trivial Mapping Same relations with different Structures Language Domain Knowledge Domain Independently motivated of Language

  9. Predicate-argument structureParser based on Probabilistic HPSG (Enju) S VP VP VP S VP arg3 arg1 NP ADVP NP arg2 arg2 p53 has been shown to directly activate the Bcl-2 protein

  10. Passive Passive and Infinitival Clause 述語/項構造確率HPSG解析器 (Enju)の出力 s Semantic Retrieval System Using Deep Syntax MEDIE vp vp np pp arg2 arg1 mod dt np vp vp pp np DT NN VBZ VBN IN PRP The protein is activated by it

  11. Demos • MEDIE • Info-PubMed

  12. Predicate-argument structureParser based on Probabilistic HPSG (Enju) S VP VP VP S VP arg3 arg1 NP ADVP NP arg2 arg2 p53 has been shown to directly activate the Bcl-2 protein

  13. Performance of Semantic Parser

  14. Scalability of TM Tools Target Corpus: MEDLINE corpus Suppose, for example, that it takes one second for parsing one sentence…. 70 million seconds, that is, about 2 years

  15. TM and GRID • Solution • The entire MEDLINE were parsed by distributed PC clusters consisting of 340 CPUs • Parallel processing was managed by grid platform GXP [Taura2004] • Experiments • The entire MEDLINE was parsed in 8 days • Output • Syntactic parse trees and predicate argument structures in XML format • The data sizes of compressed/uncompressed output were 42.5GB/260GB.

  16. Efficient Parsing for HPSG

  17. Background: HPSG • Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994] • Lexicalized and Constraints-based Grammar • A few Rule Schema General constraints on linguistic constructions • Constraints embedded in Lexicon Word-Specific Constraints • Constraints between phrase structures and semantic structures

  18. Parsing by HPSG I like it

  19. <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ COMPS <NP> I it like Parsing by HPSG Assignment of Lexical Entries

  20. HEAD verbSUBJ COMPS < > < > 1 < > 2 HEAD verbSUBJ COMPS HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > < > 1 it like I Application of Rule Schema Head-Complement 2

  21. HEAD verbSUBJ COMPS < > < > 2 HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ COMPS HEAD verbSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > < < > > 1 1 it like I Application of Rule Schema Subject-Head 2 1

  22. Inefficiency of HPSG Parsing • Complex DAG:Typed-feature structures • Abstract machine for Unification (LiLFeS) • Unification: Expensive Operation(⇔CFG Approximation: CFG Filtering) • Assignment of Lexical Entries • High reduction of search space / Super tagging

  23. Filtering with CFG (1/5) • 2-phased parsing • Approximate HPSG with CFG with keeping important constraints. • Obtained CFG might over-generate, but can be used in filtering. • Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Feature Structures HPSG + Compile CFG Input Sentences Built-in CFG Parser LiLFeS Unification Parsing Output Complete parse trees

  24. HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD verbSUBJ <NP>COMPS <NP> HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > ... it it it I I I like like like I like it like System Overview Input sentence I like it CFG Filtering Supertagger Deterministic Shift/Reduce Parser HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> HEAD nounSUBJ < >COMPS < > HEAD nounSUBJ < >COMPS < > HEAD verbSUBJ <NP>COMPS <NP> P High it I

  25. Experiment Results 6 times faster 20 times faster than the initial model

  26. Domain/Text Type Adaptation

  27. Adaptation with Reference Distribution Lexical Assignment Syntactic Preference Feature function Feature weight Original model

  28. 90 89 88 score 87 - Baseline (PTB) F 86 Simple Retraining (GENIA) Retraining (GENIA+PTB) 85 Structure with Ref.Dist Lexical with RefDist 84 Lexical/Structure woth RefDist 83 0 2000 4000 6000 8000 Number of Sentence of the GENIA Training Set

  29. Retrinaing (GENIA) 90 89 88 score 87 - F 86 Structure with RefDist 85 Lexicon woth RefDist Lex/Str with RefDist 84 83 0 10000 20000 30000 Training Time (Sec)

  30. Tool1: POS Tagger • General-Purpose POS taggers, trained by WSJ • Brill’s tagger, TnT tagger, MX POST, etc. • 97% • General-Purpose POS taggers do not work well for MEDLINE abstracts The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

More Related