1 / 30

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain. Yuka Tateisi 1 , Yusuke Miyao 2 , Kenji Sagae 2 , Jun'ichi Tsujii 2,3 1 Department of Informatics, Kogakuin University 2 Graduate School of Information Science and Technology, University of Tokyo

lis
Download Presentation

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi1, Yusuke Miyao2, Kenji Sagae2, Jun'ichi Tsujii2,31Department of Informatics, Kogakuin University 2Graduate School of Information Science and Technology, University of Tokyo 3School of Computer Science, University of Manchester/National Centre for Text Mining

  2. Background Interleukin 1 beta inhibits insulin secretion Sentence IL-1beta is known to inhibit insulin secretion Insulin secretionis inhibited by IL-1 beta Predicate-Argument Structure Parsing inhibit ↓ [Event]=[Verb] [Agent]=[Subject] [Target]=[Object] [Verb] Inhibit [Subject] IL-1beta [Object] insulin secretion [Event] Inhibition [Agent] IL-1beta [Target] insulin secretion Rules IE

  3. Constituency vs Dependency • Dependency structure is becoming more common • Predicate-argument relation is easier to access from dependency than from constituency structure

  4. Parser Evaluation • Need to know the performance of parsers • Cross-formalism evaluation(PTB-based, TAG, HPSG, etc) • Cross-domain evaluation(Newswire text etc.) • Previous works used Stanford Dependency (Clegg et al. 2007, Pyysalo et al. 2007) • Automatic translation from treebank-based corpus • May contain error in conversion process • Good for comparison between treebank-based parsersbut may be unreliable for evaluating parsers based on other formalisms

  5. Our Approach • Manual Annotation of Dependency Structure • Grammatical Relations (GR:Carrol et al.1998) • Gold Standard Corpus exists for Wall Street Journal section of Penn Treebank • Has been used for cross-framework evaluation in newswire domain • Annotation on GENIA text • Other information (POS, tree, terms, event structure) available

  6. GENIA-GR Base Text • 50-abstract subset of GENIA • MeSH term: NF kappa B • Full text in PubMed Central • Never been used for training with GENIATreebank Annotation Scheme • Follows Briscoe’s scheme* • Not annotated inside technical terms *An introduction to tag sequence grammars and the RASP system parser (Technical report, University of Cambridge)

  7. Grammatical Relations • Shows dependency structure • Head-Dependent pairs with types of dependency I love you (ncsubj love I _) I is the subject of love (dobj love you)you is the object of love

  8. Technical Terms (Named Entities) • Technical Terms:= Terms tagged in GENIA term corpus • Multi-word terms are “declared” with • IDs taken from GENIA term corpus • Position in the sentence • Only IDs are referred to in the dependency structure

  9. Example sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation inhuman immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

  10. Example sentence( id(96099434.1)Sentence ID named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation inhuman immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

  11. Example sentence( id(96099434.1) named_entities(Multi-word Terms term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation inhuman immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

  12. Example sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation inhuman immunodeficiency virus-infected monocytes . ) rasp(Sentence Form (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

  13. Example sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation inhuman immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8)Dependency Structure (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

  14. Annotation Procedure • Manual correction of Rasp parser output • Annotation by one annotator • Procedure • Initial annotation • Error correction, identification of problems • Refinement of scheme, further correction • Annotation by other annotator (s), Inter Annotator Agreement evaluation • Error correction, re-refinement of scheme, etc • Publication

  15. Annotation Results

  16. Distribution of Dependency Types Nx: occurrence in the corpus Fx : frequency per sentence GENIA: GENIA (492 sentences) PTB: PTB-WSJ (560 sentences)

  17. Typical Construction in Biomedical Abstract (?) A similar potentiation of TNF effects was observed in Jurkat T cells andHeLa cells treated with soluble Tat protein .

  18. Problems found in the initial annotation process • Domain-specific structures • Coordination • Prepositional Arguments/Modifiers • Inconsistency with other GENIA corpora

  19. Genes 296 residues ncmod(ta) ncmod 302 to dobj

  20. New York firm ncmod ncmod based Binding(participle as a prenominal modifier) Current annotation ncmod NF kappa B domain ncmod binding

  21. ncsubj domain binding NF kappa B dobj xmod Binding(participle as a prenominal modifier) But perhaps it’s better to do this way activity NF kappa B binding dobj ncmod(ta)

  22. References in the text Where is the head? J.Virol.72 ta ta ta and 5852-5861 1998 conj conj conj N.Israel F.Barre-Sinoussi M.Nugeyre

  23. Math expressions What’s the dependency type? 2 x NFKappaB arg or conj conj SlVmac239 ≈ xmod arg = > arg deltaNFkappaB

  24. References and Math expressions • Not frequent but problematic • May occur more frequently in full text • References may be treated like terms • Math expressions can work as S or VP Tn<Tn+1 if n>0.

  25. CD4(+) CD8(+) Coordination CD4(+) and CD8(+) T lymphocytes CD4(+) CD8(+)

  26. CD4(+) CD8(+) CD4(+) and CD8(+) T lymphocytes ncmod and T lymphocytes conj conj CD8(+) CD4(+)

  27. CD4(+) CD8(+) and conj conj ellip T lymphocytes ncmod ncmod CD8(+) CD4(+) CD4(+) and CD8(+) T lymphocytes ? no mechanism to coindex

  28. PP argument/modifier • iobj (argument) or ncmod(modifier)? Reference to external resources GENIA event corpus PASBio(Wattarujeekrit et al 2008) A similar potentiation of TNF effects was observed in Jurkat T cells andHeLa cells treated with soluble Tat protein .

  29. Inconsistencies with other GENIA corpora • Dependency Inside a Token renal cell carcinoma-derived gangliosides • Mostly involving hyphens • Many inside technical terms • Dependency on Part of a Technical Term more mature cell lines

  30. Conclusions • 50-abstract GENIA subset with dependency annotation • is in the phase of error correction • Works to be done • Refinement of the scheme • Evaluation • (Enhancement to full text)

More Related