mu ltilingual c oncept h ierarchies for m edical information o rganization and re trieval n.
Skip this Video
Download Presentation
Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval

Loading in 2 Seconds...

play fullscreen
1 / 32

Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval - PowerPoint PPT Presentation

  • Uploaded on

MUCHMORE. Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval. Project Overview. Application  Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval. Research & Development

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval' - peri

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
project overview
Project Overview


 Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval

Research & Development

 Developing Novel, Hybrid (Corpus-/Concept- Based) Methods for Handling this Scenario


 Evaluating the Technical Performance of (Combinations of) Existing and Novel Methods

user perspective zinfo
User Perspective (ZInfo)

Vision: BAIK Model

  • MuchMore
  •  Provide Relevant Medical Information
  • … for a Specific Patient Problem
  • … Automatically, from the Web
  • … Independent of Language
user perspective zinfo1
User Perspective (ZInfo)

User Requirements

  • Automatic Query Generation (and Expansion), Identifying the Exact Problem of the Patient
  • Retrieval and Relevance Ranking of Evidence Based Medical Literature, Language Independent
  • Summarization and Filtering of Results According to a User Profile
user perspective zinfo2
User Perspective (ZInfo)

User Evaluation

Evaluate Usefulness

 Query Generation

 Relevance for Decisions in Diagnostics and Treatment

Use for Medical Cases

 Part of Postgraduate Course in Medical Informatics

Problematic Issues

 Different medical profiles, schools, experience, speciality

 Relevant for one user may mean less or nothing to another

 Evidence based medicine criteria exist only for a small fraction of medicine

muchmore prototype
MuchMore Prototype
  • Overview of Prototype Functionality
  • Relation between Functionality and User Requirements
  •  Issues Addressed by Research and Development within MuchMore
r d in muchmore
R&D in MuchMore

Semantic Annotation Based CLIR

Corpus Annotation (DFKI, ZInfo)

  •  PoS, Morphology, Phrases, Grammatical Functions
  •  Term and Relation Tagging
  • Term Extraction (XRCE, EIT, CMU, CSLI)
  •  Bilingual Lexicon Extraction, Extension of Semantic Resources
  • Sense Disambiguation (CSLI, DFKI)
  •  Tuning and Extension of Semantic Resources
  •  Combining Sense Disambiguation Methods
  • Relation Extraction (DFKI, CSLI)
  •  Grammatical Function Tagging
  •  Extracting Semantic Relation Indicators
  •  Extracting Novel Semantic Relations
  • Semantic Indexing/Retrieval (EIT,DFKI)
r d in muchmore1
R&D in MuchMore

Additional Approaches in CLIR

  • Corpus Based CLIR
  • Bilingual Lexicon Extraction (XRCE, EIT, CMU, CSLI)
  • Pseudo Relevance Feedback: PRF (CMU)
  • Generalized Vector Space Model: GVSM (CMU)

Text Classification Based CLIR (CMU)

 Hierarchical/Flat kNN with MeSH

Summarization (CMU)

 Query, Genre Specific

corpus annotation
Corpus Annotation

Annotation Evaluation


~ 9000 English and German Medical Abstracts from 41 Journals, Springer LINK WebSite, ~ 1 M Tokens for each Language


  • Lexicon Update, Remaining Error Rate ~ 1.5% (EN)

Histologically, we found a subepidermal blister formation and a predominantly neutrophilic infiltrate. pos=VB > pos_correct=NN


Incorrect, e.g.:Chorionzottenbiopsie > Chor + Ion + Zotte + Biopsie

  • Term and Relation Tagging
    •  Evaluation of 8 DE/EN Parallel Abstracts, Relevant for a Query
term extraction

Bilingual Lexicon Extraction

From Comparable Corpora at Word Level; From Parallel Corpora at Word, and Term (Multi-Word) Level

Bilingual Extension of Semantic Resource (MeSH)

Term Extraction

XRCE (Aims and Resources)


  • Optimal Combination of Existing Resources (Corpus, General Dictionary, Thesaurus: MeSH)
  • Corpus Specific German Decompounding (Improves Recall by 25% at Equal Precision)
term extraction1
Optimal Combination of Resources

Retaining only 10 best Translations for each Candidate

1. word-to-word, comparable corpora: F1 = 0.84

2.a word-to-word, parallel corpora: F1 = 0.98

2.b term-to-term, parallel corpora: F1 = 0.85

Evaluating Separately with Individual Resources (F1)

Corpus: 0.62; MeSH: 0.51; General Dictionary: 0.56

3. MeSH Extension: 1453 new multi-word terms added (synonyms or new term entries) extracted from the Springer corpus

Term Extraction

XRCE (Results of Best Method)

term extraction2
Term Extraction

EIT (Similarity Thesauri)


 Extract Most Frequent Terms (Single Word) by Comparison of Term Frequencies in a General Corpus (German: SDA, English: LA Times) vs. Medical Corpus


 Single Word Terms (Springer Abstracts)

German-English:104,904 / English-German: 49,454

 Multiword Terms (Phrase Lexicon Generated from ICD10)

German Phrases: 354 / English Phrases: 665

Bilingual Phrasal Entries Generated:

German - English: 225 / English - German: 246

term extraction3
Term Extraction

CMU (EBT Bilingual Lexicon)


 For each word in one language, accumulate counts of the number of times the translations of the sentences containing that word include each word of the other language. These co-occurrence counts may be restricted using word-alignment techniques.

 Apply a variable threshold to filter out uncommon co-occurrences which are unlikely to be translations. The result is a lexicon listing candidate translations and their relative frequencies.


 ~99.000 Bilingual Term Pairs (PubMed Parallel Abstracts)

(Estimated Error Rate: < 10%)


Term Extraction

CSLI (Infomap System)

Represent English and German Words as Vectors that are Produced by Recording the Number of Co-Occurrences of the Word in Question with each of a Set of Content-Bearing Words. Use (Cosine) Similarity Measure on these Rows to Find “Nearest Neighbours”.

1, 000 (English) content-bearing words

ligament kneejoint




















WSD: Terms, Senses

Semantic Resource Extension and Tuning

  • Extension (DFKI)
  • Morphological Analysis (Decomposition)
  • Entzündungsgewebe (infection tissue) HYPONYM Gewebe,Körpergewebe (body tissue)
  • Gewebe, Stoff,Textilstoff (textile)
  • Semantic Similarity (Co-Occurrence Patterns)
    • Karzinom (carcinoma), Metastase (metastasis) SYNONYM Geschwulst, Tumor, ....
  • Tuning (CSLI, DFKI)
  • Aligning Clusters with Senses


C0043210|ENG|P|L1189496|PF|S1423265|Human adult females|0|

wsd algorithm
WSD: Algorithm

Combination of Methods (Task, Domain, General)

Bilingual Sense Selection (CSLI)

  • 1 Sense in L1 vs. >1 Sense in L2
      • English blood vessel (C0005847)vs. vessel (polysaccharide) (C0148346)
      • German Blutgefaesse = blood vessel (C0005847)

Collocations and Senses (CSLI)

  • For an ambiguous single word term that is part of several unambiguous multiword terms, choose the sense of the most frequent multiword term.

single word term abortion 1) a natural process C0000786 (T047)

2) a medical procedure C0000811 (T061)

multiword term recurrent abortion C0000809 (T047) => sense 1

induced abortion C0000811 (T061) => sense 2

wsd algorithm1
WSD: Algorithm

Combination of Methods (Task, Domain, General)

Domain Specific Senses (DFKI)

  • Concept Relevance in Domain Corpus
  • Mineral 0.030774033: Mineralstoff, Eisen, Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5: Allanit, Alumogel, ..., Axionit, Beryll, ... Wurtzit, Zirkon

Instance-Based Learning (DFKI)

  • Unsupervised Context Models (n-grams)
      • Training (Learn Class Models) He drank <milk LIQUID> He drank <coffee LIQUID> He drank <tea LIQUID> He drank <chocolate FOOD, LIQUID>
      • Application (Apply Class Models) He drank <chocolate FOOD, LIQUID> He drank <Java GEOGAPHICAL, LIQUID>

WSD: Evaluation

Lexical Sample Evaluation Corpora (Medical)

  • Ambiguous: MeSH EN: 847 (2.5), DE: 780 (2.1); EWN EN: 6300 (2.8) DE: 4059 (1.5)
  • Evaluation (Nouns): GermaNet (40), English MeSH (59), German MeSH (28)

Relation Extraction

Grammatical Function Tagging (DFKI)

  • Robust, Shallow Grammatical Function Tagger
  • EM Model (Trained on Frankfurter Rundschau: 35M Tokens,Adaptation on Medical Corpora Under Development)

1.5M ‘Types’: Verb, Voice, Function, Nom-Head-Argument

abarbeiten ACT SUBJ Politiker

 Use of PoS Information, Use of Chunk Information Planned


 German Available, English under Development

  • Untersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten.

Relation Extraction

Semantic Relation Indicators (DFKI, CSLI)

Novel Semantic Relations (DFKI, CSLI)






Cluster 1

T047/T060 (Diagnoses)

T060/T101 (Affects)








Cluster 3

T047/T121 (Treats, Causes)

T061/T121 (Uses)

T121/T184 (Treats)


Cluster 2





T047: Disease

T048: Mental Dysfunction

T060: Diagnostic Procedure

T101: Patient

T121: Pharm. Substance

T169: Funct. Concept (Syndrom)

T184: Sign or Symptom






Maximal Marginal Relevance (MMR)

 Find passages most relevant to query

 Maximize information novelty (minimize passage redundancy)

Assemble extracted passages for summary

Argmaxkdiin C[λS(Q, di) - (1-λ)maxdjin R (S(di, dj))]

Q = query, d = document, S = similarity function

λ = tradeoff factor between relevance & novelty

k = number of passages to include in summary

Summarization (CMU)

Extractive Summarization


 Re-ranking retrieved documents from IR Engine

 Ranking passages from a document for inclusion in summaries

 Ranking passages from topically-related document cluster for cluster summary


Summarization (CMU)

MuchMore Application


 MMR applies to English and German

  • Genre-based specialization (e.g. include conclusions for scientific articles)
  • Linguistic specialization possible

 Summarization should apply when retrieving FULL articles  query-driven summaries instead of generic abstracts


Technical Evaluation

Test Data

 Test Collection: Springer Abstracts (German and English)

 Query Set: 25 of 126 Selected by ZInfo

 Relevance Assessments

Assumption: Documents Retrieved by all Runs for one Query (Intersection) are Relevant

Pool Size: 500 Documents Based on 18 Runs

Done by CMU, CSLI and EIT

German (ZInfo): 959 Relevant Documents

English (CMU): 500 Relevant Documents (1 judge)

964 Relevant Documents (3 judges)


Technical Evaluation

Methods Evaluated

  • Corpus Based Similarity Thesaurus (EIT)
          • Example-based Translation (CMU)
          • Pseudo Relevance Feedback (CMU)
          • Generalized Vector Space Model (CMU)
  • Hybrid Classification (CMU)
          • Hierarchical: kNN, Rocchio
          • Flat: kNN, Rocchio-style Classifier
          • Semantic Annotation + Extraction (DFKI, XRCE)
        • UMLS / XRCE Terms & Semantic Relations EuroWordNet Terms
          • Semantic Annotation + Similarity Thesaurus

Technical Evaluation

TREC-Style Performance Measurements

  • Overall Performance
      •  11point-Average Precision (Interpolated)
  • Performance in the High-Precision Area
    • Assumption: User Wants to Get Most Relevant Documents Topranked within the Result List
      •  Average Interpolated Precision at Recall of 0.1
      •  Exact Precision after 10 Retrieved Documents
  • Applied to Experiments Evaluating Semantic Annotations

Technical Evaluation

Results: Corpus Based Methods

Data Sets

 EIT: The Springer Parallel Corpus, i.e. 9640 Documents for English, and 9640 documents for German

CMU: Half of the Corpus, i.e. a Test Set with 4820 Documents in each.


Technical Evaluation

Results: Hybrid Methods

Categorization (Preliminary Results)

Reuters-21578: 10,000+ documents, 90 categories

Reuters Corpus Volume 1, TREC-10 version (RCV1): 783,484 documents, 84 categories

Reuters Koller & Sahami subsets (ICML’98): 138 to 939 documents, 6-11 categories in a set

OHSUMED: 233,445 documents, 14,321 categories


Technical Evaluation

Results: Hybrid Methods

Semantic Annotation + Extraction

Data Set Full Springer Corpus

Weighting Scheme Coordination Level Matching (CLM):

1. Pass: Documents Preferred Containing Matching Terms or Semantic Relations

2. Pass: All Features Using lnu.ltn

Rel. Assessments German


Technical Evaluation

Results: Hybrid Methods

Semantic Annotation + Similarity Thesaurus

Data Set Full Springer Corpus

Weighting Scheme Coordination Level Matching (CLM)

Rel. Assessments German


Technical Evaluation

Summary of the Results

  • Assumption: CLIR achieves up to 75 % of Monolingual Baseline
  • (11pt Average Precision)
  • Corpus-based Methods (Compared to Monolingual PRF)
  • German – English PRF: 81 %, EBT: 77 %, EIT: 66%
  • English – German PRF: 113 %, EBT: 106 %, EIT: 60%
  • Hybrid Methods (Compared to Monolingual EIT)
  • German – English: 73 % (UMLS Terms & SemRels)
  • English – German: 50 % (UMLS Terms & SemRels)
  • English – German: 80 % (UMLS Terms & SemRels & XRCE Terms)
  • German – English: 74 % (SimThes & UMLS Terms & SemRels)
  • English – German: 80 % (SimThes & UMLS Terms & SemRels)
  • English – German: 92 % (SimThes & UMLS Terms & SemRels & XRCE Terms)


Deviations from the Work Plan

Corpus Collection

  • Comparable Medical Document Corpora are Very Difficult to Obtain, Anonymization Must be Validated by Hospital CIO
  • Work with „Shuffled“ Parallel Corpus
  • Radiology Reports (~600.000) Available in German, to be Obtained for English

Corpus Annotation

  • More Efforts on Improving PoS Tagging and Morphological Analysis (English and German Medical Specialist Lexicon)

Relation Extraction

  • More Efforts on Grammatical Function Tagging as Preprocessing for Semantic Relation Tagging and Extraction


Future Prospects and Activities

R&D Topics

  • Ontology DevelopmentCombining Axes in AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI, DFKI)
  • Semantic WebSemantic Annotation of Medical Documents with Metadata (UMLS in Protégé)

Related Projects and Workshops

  • Project Proposal IKAR/OS on KM & Visualization in Life Sciences
  • OntoWeb SIG on LT in Ontology Development and Use
  • MuchMore Workshop with Invited Experts in Medical Information Access, CLIR and Semantic Annotation (September 2002)
  • ZInfo/MuchMore Workshop on Electronic Patient Records (Spring 2003)