1 / 13

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain. Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany. Overview. MuchMore Objectives. Semantic Annotation  Semantic Resources, Term/Relation Tagging. Corpus Annotation

diamond
Download Presentation

Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-Layer Annotation for Cross-Lingual Information Retrieval in the Medical Domain Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany

  2. Overview MuchMore Objectives Semantic Annotation  Semantic Resources, Term/Relation Tagging Corpus Annotation  Part-of-Speech, Morphology, Chunks  Grammatical Functions Annotation Format (DTD), Examples, Demo

  3. MuchMore Objectives Evaluation Systematic Comparison of CLIR Methods on a Realistic Scenario in the Medical Domain  Establishing a Baseline with Corpus-Based Methods  Comparison with Concept-Based Methods Concept-Based CLIR Effective Use of Medical and General Semantic Resources by Developing Methods for Tuning and Extension

  4. Semantic Resources Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (MeSH, ICD, …) English, German, Spanish, … 730.000 Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”)

  5. C0019682|ENG|P|L0019682|PF|S0048631|HIV|0| C0019682|ENG|S|L0020103|PF|S0049688|HTLV-III|0| C0019682|ENG|S|L0020128|VS|S0049756|Human Immunodeficiency Virus|0| C0019682|ENG|S|L0020128|VWS|S0098727|Virus, Human Immunodeficiency|0| C0019682|FIN|P|L1523437|PF|S1819346|HIV|3| C0019682|FRE|P|L0168651|PF|S0233132|HIV|3| C0019682|FRE|S|L0206547|PF|S0277133|VIRUS IMMUNODEFICIENCE HUMAINE|3| C0019682|GER|P|L0413854|PF|S0538136|HIV|3| C0019682|GER|S|L1261793|PF|S1503739|Humanes T-Zell-lymphotropes Virus Typ III|3| Concept Names (MRCON): 1.734,706 ENGLISH 1.462,202 GERMAN 66,381 other languages UMLS • Each CUI (Concept Unique Identifier) is mapped to one of 134 semantic types (TUI) • Clozapine : C0009079  Pharmacologic Substance : T121 • Semantic Types are organized in a Network through 54 Relations • T121|T154|T047

  6. Annotate Terms (of length 1-4 tokens) with Preferred Term, CUI and TUI <term id="13" tokenid="14, 15, 16" preferred ="Intensive Care Unit” cui="C0021708" tui="T073"/> Term / Relation Tagging Annotate All Possible Semantic Relations between Identified Terms within a Sentence <term id="2" tokenid="2” preferred="Heparinoid” cui="C0019142” tui="T121"/> <term id="5" tokenid="6" preferred ="Thrombin” cui="C0040018" tui="T126"/> <semrel id="40" relterms="5, 2" reltype="interacts_with" />

  7. Morpho/Syntactic Processing TnT Tokenization, Segmentation, PoS-tagging Mmorph Lemmatization(German compound analysis) Chunkie Phrase Recognition under developmentGrammatical Function Tagging Parallel Corpus ~ 9000 English and German Medical Abstracts from 41 Journals (obtained through Springer LINK WebSite) ~ 1 M Tokens for each Language Manual Clean-Up Corpus Annotation

  8. Tokenization Hyphenated Compounds, e.g: side-effects, short-term, follow-up Abbreviations, e.g: aquos., emulsific., Ungt. TnT PoS-Tagger (Brants, 2000) Retrain on an annotated domain-specific corpus Update underlying lexicon Specialist Medical Lexicon  UMLS (Englisch), ZInfo (German) Tokenization, POS Tagging

  9. Mmorph Dumped Full-Form Lexicon (domain independent) Decomposition: Problematic for German, e.g. Schleimhautoedem > Schleimhaut+Oe+Dem German Medical Specialist Lexicon Chunkie HMM-based Partial Parser (Skut and Brants, 2000) Recognition of internal structure of simple as well as complex NPs, PPs and APs Retraining needed on Annotated Medical Corpora Morphology, Phrase Recognition

  10. Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. Untersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten. ”Untersucht” PAS.SUBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:OBJ ”sich” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:IOBJ ”Bypassoperation” Grammatical Function Tagging

  11. document gramrels gramrels chunks chunks terms terms ewnterms ewnterms semrels semrels text text title sentence keywords keyword gramrel chunk term ewnterm semrel token gramrel chunk term ewnterm semrel token XML Annotation Format (DTD)

  12. XML Annotation (Example) <?xml version="1.0" encoding="ISO-8859-1" ?> <document id="DerHautarzt.80490581.eng" type="abstract" lang="eng"> <sentence id="s1" corresp="s1"> <terms> <term id="s1.t1" tokenid="s1.w5" preferred="Women" cui="C0043209" tui="T098" /> <term id="s1.t2" tokenid="s1.w7" preferred="Fevers" cui="C0015967" tui="T184" /> <term id="s1.t3" tokenid="s1.w9 s1.w10" preferred="Weight Loss" cui="C0043096“ tui="T184" /> </terms> </semrels> <gramrels> <gramrel id="s1.g1" tokenid="s1.w6 s1.w6" gramtype="ACT" prob="0.750" /> <gramrel id="s1.g2" tokenid="s1.w5 s1.w6" gramtype="SUBJ" prob="0.017" /> <gramrel id="s1.g3" tokenid="s1.w7 s1.w6" gramtype="OBJ" prob="0.056" /> <gramrel id="s1.g3" tokenid="s1.w10 s1.w6" gramtype="OBJ" prob="0.106" /> </gramrels> <chunks> <chunk id="s1.c1" from="s1.w1" to="s1.w5" type="NP" /> <chunk id="s1.c2" from="s1.w9" to="s1.w10" type="NP" /> <chunk id="s1.c3" from="s1.w11" to="s1.w13" type="PP" /> </chunks> <text> <token id="s1.w1" pos="DT" lemma="a">A</token> <token id="s1.w2" pos="JJ">34-year-old</token> <token id="s1.w3" pos="VBN" lemma1="HIV" lemma2="infect">HIV-infected</token> <token id="s1.w4" pos="JJ" lemma="african">African</token> <token id="s1.w5" pos="NN" lemma="woman">woman</token> <token id="s1.w6" pos="VBN" lemma="develop">developed</token> <token id="s1.w7" pos="NN" lemma="fever">fever</token> <token id="s1.w8" pos="CC" lemma="and">and</token> <token id="s1.w9" pos="NN" lemma="weight">weight</token> <token id="s1.w10" pos="NN" lemma="loss">loss</token> <token id="s1.w11" pos="IN" lemma="on">on</token> <token id="s1.w12" pos="PRP" lemma="her">her</token> <token id="s1.w13" pos="NN" lemma="trunk">trunk</token> <token id="s1.w14" pos="CC" lemma="and">and</token> <token id="s1.w15" pos="NN" lemma="arm">arms</token> <token id="s1.w16" pos="punct">.</token> </text> </sentence> </document>

  13. Demo...

More Related