Mu ltilingual c oncept h ierarchies for m edical information o rganization and re trieval
Download
1 / 32

Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

MUCHMORE. Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval. Project Overview. Application  Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval. Research & Development

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Mu ltilingual C oncept H ierarchies for M edical Information O rganization and Re trieval' - peri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mu ltilingual c oncept h ierarchies for m edical information o rganization and re trieval

MUCHMORE

Multilingual Concept Hierarchies for Medical Information Organization and Retrieval


Project overview
Project Overview

Application

 Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval

Research & Development

 Developing Novel, Hybrid (Corpus-/Concept- Based) Methods for Handling this Scenario

Evaluation

 Evaluating the Technical Performance of (Combinations of) Existing and Novel Methods


User perspective zinfo
User Perspective (ZInfo)

Vision: BAIK Model

  • MuchMore

  •  Provide Relevant Medical Information

  • … for a Specific Patient Problem

  • … Automatically, from the Web

  • … Independent of Language


User perspective zinfo1
User Perspective (ZInfo)

User Requirements

  • Automatic Query Generation (and Expansion), Identifying the Exact Problem of the Patient

  • Retrieval and Relevance Ranking of Evidence Based Medical Literature, Language Independent

  • Summarization and Filtering of Results According to a User Profile


User perspective zinfo2
User Perspective (ZInfo)

User Evaluation

Evaluate Usefulness

 Query Generation

 Relevance for Decisions in Diagnostics and Treatment

Use for Medical Cases

 Part of Postgraduate Course in Medical Informatics

Problematic Issues

 Different medical profiles, schools, experience, speciality

 Relevant for one user may mean less or nothing to another

 Evidence based medicine criteria exist only for a small fraction of medicine


Muchmore prototype
MuchMore Prototype

  • Overview of Prototype Functionality

  • Relation between Functionality and User Requirements

  •  Issues Addressed by Research and Development within MuchMore


R d in muchmore
R&D in MuchMore

Semantic Annotation Based CLIR

Corpus Annotation (DFKI, ZInfo)

  •  PoS, Morphology, Phrases, Grammatical Functions

  •  Term and Relation Tagging

  • Term Extraction (XRCE, EIT, CMU, CSLI)

  •  Bilingual Lexicon Extraction, Extension of Semantic Resources

  • Sense Disambiguation (CSLI, DFKI)

  •  Tuning and Extension of Semantic Resources

  •  Combining Sense Disambiguation Methods

  • Relation Extraction (DFKI, CSLI)

  •  Grammatical Function Tagging

  •  Extracting Semantic Relation Indicators

  •  Extracting Novel Semantic Relations

  • Semantic Indexing/Retrieval (EIT,DFKI)


R d in muchmore1
R&D in MuchMore

Additional Approaches in CLIR

  • Corpus Based CLIR

  • Bilingual Lexicon Extraction (XRCE, EIT, CMU, CSLI)

  • Pseudo Relevance Feedback: PRF (CMU)

  • Generalized Vector Space Model: GVSM (CMU)

Text Classification Based CLIR (CMU)

 Hierarchical/Flat kNN with MeSH

Summarization (CMU)

 Query, Genre Specific


Corpus annotation
Corpus Annotation

Annotation Evaluation

Corpus

~ 9000 English and German Medical Abstracts from 41 Journals, Springer LINK WebSite, ~ 1 M Tokens for each Language

PoS

  • Lexicon Update, Remaining Error Rate ~ 1.5% (EN)

    Histologically, we found a subepidermal blister formation and a predominantly neutrophilic infiltrate. pos=VB > pos_correct=NN

Morphology

Incorrect, e.g.:Chorionzottenbiopsie > Chor + Ion + Zotte + Biopsie

  • Term and Relation Tagging

    •  Evaluation of 8 DE/EN Parallel Abstracts, Relevant for a Query


Term extraction

Aim

Bilingual Lexicon Extraction

From Comparable Corpora at Word Level; From Parallel Corpora at Word, and Term (Multi-Word) Level

Bilingual Extension of Semantic Resource (MeSH)

Term Extraction

XRCE (Aims and Resources)

Resources

  • Optimal Combination of Existing Resources (Corpus, General Dictionary, Thesaurus: MeSH)

  • Corpus Specific German Decompounding (Improves Recall by 25% at Equal Precision)


Term extraction1

Optimal Combination of Resources

Retaining only 10 best Translations for each Candidate

1. word-to-word, comparable corpora: F1 = 0.84

2.a word-to-word, parallel corpora: F1 = 0.98

2.b term-to-term, parallel corpora: F1 = 0.85

Evaluating Separately with Individual Resources (F1)

Corpus: 0.62; MeSH: 0.51; General Dictionary: 0.56

3. MeSH Extension: 1453 new multi-word terms added (synonyms or new term entries) extracted from the Springer corpus

Term Extraction

XRCE (Results of Best Method)


Term extraction2
Term Extraction

EIT (Similarity Thesauri)

Method

 Extract Most Frequent Terms (Single Word) by Comparison of Term Frequencies in a General Corpus (German: SDA, English: LA Times) vs. Medical Corpus

Results

 Single Word Terms (Springer Abstracts)

German-English:104,904 / English-German: 49,454

 Multiword Terms (Phrase Lexicon Generated from ICD10)

German Phrases: 354 / English Phrases: 665

Bilingual Phrasal Entries Generated:

German - English: 225 / English - German: 246


Term extraction3
Term Extraction

CMU (EBT Bilingual Lexicon)

Method

 For each word in one language, accumulate counts of the number of times the translations of the sentences containing that word include each word of the other language. These co-occurrence counts may be restricted using word-alignment techniques.

 Apply a variable threshold to filter out uncommon co-occurrences which are unlikely to be translations. The result is a lexicon listing candidate translations and their relative frequencies.

Results

 ~99.000 Bilingual Term Pairs (PubMed Parallel Abstracts)

(Estimated Error Rate: < 10%)


Term Extraction

CSLI (Infomap System)

Represent English and German Words as Vectors that are Produced by Recording the Number of Co-Occurrences of the Word in Question with each of a Set of Content-Bearing Words. Use (Cosine) Similarity Measure on these Rows to Find “Nearest Neighbours”.

1, 000 (English) content-bearing words

ligament kneejoint

.

.

.

ligament

English

words

English

Kreuzband

Kniegelenk

German

words

German

.

.

.

.

.

.


WSD: Terms, Senses

Semantic Resource Extension and Tuning

  • Extension (DFKI)

  • Morphological Analysis (Decomposition)

  • Entzündungsgewebe (infection tissue) HYPONYM Gewebe,Körpergewebe (body tissue)

  • Gewebe, Stoff,Textilstoff (textile)

  • Semantic Similarity (Co-Occurrence Patterns)

    • Karzinom (carcinoma), Metastase (metastasis) SYNONYM Geschwulst, Tumor, ....

  • Tuning (CSLI, DFKI)

  • Aligning Clusters with Senses

    C0043210|GER|P|L1254343|PF|S1496289|Frauen|3|

    C0043210|ENG|P|L1189496|PF|S1423265|Human adult females|0|


Wsd algorithm
WSD: Algorithm

Combination of Methods (Task, Domain, General)

Bilingual Sense Selection (CSLI)

  • 1 Sense in L1 vs. >1 Sense in L2

    • English blood vessel (C0005847)vs. vessel (polysaccharide) (C0148346)

    • German Blutgefaesse = blood vessel (C0005847)

Collocations and Senses (CSLI)

  • For an ambiguous single word term that is part of several unambiguous multiword terms, choose the sense of the most frequent multiword term.

    single word term abortion 1) a natural process C0000786 (T047)

    2) a medical procedure C0000811 (T061)

    multiword term recurrent abortion C0000809 (T047) => sense 1

    induced abortion C0000811 (T061) => sense 2


Wsd algorithm1
WSD: Algorithm

Combination of Methods (Task, Domain, General)

Domain Specific Senses (DFKI)

  • Concept Relevance in Domain Corpus

  • Mineral 0.030774033: Mineralstoff, Eisen, Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5: Allanit, Alumogel, ..., Axionit, Beryll, ... Wurtzit, Zirkon

Instance-Based Learning (DFKI)

  • Unsupervised Context Models (n-grams)

    • Training (Learn Class Models) He drank <milk LIQUID> He drank <coffee LIQUID> He drank <tea LIQUID> He drank <chocolate FOOD, LIQUID>

    • Application (Apply Class Models) He drank <chocolate FOOD, LIQUID> He drank <Java GEOGAPHICAL, LIQUID>


WSD: Evaluation

Lexical Sample Evaluation Corpora (Medical)

  • Ambiguous: MeSH EN: 847 (2.5), DE: 780 (2.1); EWN EN: 6300 (2.8) DE: 4059 (1.5)

  • Evaluation (Nouns): GermaNet (40), English MeSH (59), German MeSH (28)


Relation Extraction

Grammatical Function Tagging (DFKI)

  • Robust, Shallow Grammatical Function Tagger

  • EM Model (Trained on Frankfurter Rundschau: 35M Tokens,Adaptation on Medical Corpora Under Development)

    1.5M ‘Types’: Verb, Voice, Function, Nom-Head-Argument

    abarbeiten ACT SUBJ Politiker

     Use of PoS Information, Use of Chunk Information Planned

     Tags for SUBJ, OBJ, IOBJ, ACT/PAS

     German Available, English under Development

  • Untersucht <PRED1:PAS> wurden 30 Patienten <PRED1:SUBJ> <PRED2:SUBJ>, die sich <PRED2:SUBJ> einer elektiven aortokoronaren Bypassoperation <PRED2:IOBJ> unterziehen <PRED2:ACT> mussten.


Relation Extraction

Semantic Relation Indicators (DFKI, CSLI)

Novel Semantic Relations (DFKI, CSLI)

differentiate

conclude

discriminate

diagnose

illustrate

Cluster 1

T047/T060 (Diagnoses)

T060/T101 (Affects)

T060/T169

...

reduce

treat

follow

diagnose

cure

Cluster 3

T047/T121 (Treats, Causes)

T061/T121 (Uses)

T121/T184 (Treats)

...

Cluster 2

T101/T169

T101/T184

T101/T048

...

T047: Disease

T048: Mental Dysfunction

T060: Diagnostic Procedure

T101: Patient

T121: Pharm. Substance

T169: Funct. Concept (Syndrom)

T184: Sign or Symptom

suffer

demonstrate

progress

develop

die


Maximal Marginal Relevance (MMR)

 Find passages most relevant to query

 Maximize information novelty (minimize passage redundancy)

Assemble extracted passages for summary

Argmaxkdiin C[λS(Q, di) - (1-λ)maxdjin R (S(di, dj))]

Q = query, d = document, S = similarity function

λ = tradeoff factor between relevance & novelty

k = number of passages to include in summary

Summarization (CMU)

Extractive Summarization

Applications

 Re-ranking retrieved documents from IR Engine

 Ranking passages from a document for inclusion in summaries

 Ranking passages from topically-related document cluster for cluster summary


Summarization (CMU)

MuchMore Application

 INDICATIVE and QUERY-RELEVANT

 MMR applies to English and German

  • Genre-based specialization (e.g. include conclusions for scientific articles)

  • Linguistic specialization possible

     Summarization should apply when retrieving FULL articles  query-driven summaries instead of generic abstracts


Technical Evaluation

Test Data

 Test Collection: Springer Abstracts (German and English)

 Query Set: 25 of 126 Selected by ZInfo

 Relevance Assessments

Assumption: Documents Retrieved by all Runs for one Query (Intersection) are Relevant

Pool Size: 500 Documents Based on 18 Runs

Done by CMU, CSLI and EIT

German (ZInfo): 959 Relevant Documents

English (CMU): 500 Relevant Documents (1 judge)

964 Relevant Documents (3 judges)


Technical Evaluation

Methods Evaluated

  • Corpus Based Similarity Thesaurus (EIT)

    • Example-based Translation (CMU)

    • Pseudo Relevance Feedback (CMU)

    • Generalized Vector Space Model (CMU)

  • Hybrid Classification (CMU)

    • Hierarchical: kNN, Rocchio

    • Flat: kNN, Rocchio-style Classifier

    • Semantic Annotation + Extraction (DFKI, XRCE)

  • UMLS / XRCE Terms & Semantic Relations EuroWordNet Terms

    • Semantic Annotation + Similarity Thesaurus


  • Technical Evaluation

    TREC-Style Performance Measurements

    • Overall Performance

      •  11point-Average Precision (Interpolated)

  • Performance in the High-Precision Area

    • Assumption: User Wants to Get Most Relevant Documents Topranked within the Result List

      •  Average Interpolated Precision at Recall of 0.1

      •  Exact Precision after 10 Retrieved Documents

  • Applied to Experiments Evaluating Semantic Annotations


  • Technical Evaluation

    Results: Corpus Based Methods

    Data Sets

     EIT: The Springer Parallel Corpus, i.e. 9640 Documents for English, and 9640 documents for German

    CMU: Half of the Corpus, i.e. a Test Set with 4820 Documents in each.


    Technical Evaluation

    Results: Hybrid Methods

    Categorization (Preliminary Results)

    Reuters-21578: 10,000+ documents, 90 categories

    Reuters Corpus Volume 1, TREC-10 version (RCV1): 783,484 documents, 84 categories

    Reuters Koller & Sahami subsets (ICML’98): 138 to 939 documents, 6-11 categories in a set

    OHSUMED: 233,445 documents, 14,321 categories


    Technical Evaluation

    Results: Hybrid Methods

    Semantic Annotation + Extraction

    Data Set Full Springer Corpus

    Weighting Scheme Coordination Level Matching (CLM):

    1. Pass: Documents Preferred Containing Matching Terms or Semantic Relations

    2. Pass: All Features Using lnu.ltn

    Rel. Assessments German


    Technical Evaluation

    Results: Hybrid Methods

    Semantic Annotation + Similarity Thesaurus

    Data Set Full Springer Corpus

    Weighting Scheme Coordination Level Matching (CLM)

    Rel. Assessments German


    Technical Evaluation

    Summary of the Results

    • Assumption: CLIR achieves up to 75 % of Monolingual Baseline

    • (11pt Average Precision)

    • Corpus-based Methods (Compared to Monolingual PRF)

    • German – English PRF: 81 %, EBT: 77 %, EIT: 66%

    • English – German PRF: 113 %, EBT: 106 %, EIT: 60%

    • Hybrid Methods (Compared to Monolingual EIT)

    • German – English: 73 % (UMLS Terms & SemRels)

    • English – German: 50 % (UMLS Terms & SemRels)

    • English – German: 80 % (UMLS Terms & SemRels & XRCE Terms)

    • German – English: 74 % (SimThes & UMLS Terms & SemRels)

    • English – German: 80 % (SimThes & UMLS Terms & SemRels)

    • English – German: 92 % (SimThes & UMLS Terms & SemRels & XRCE Terms)


    Management

    Deviations from the Work Plan

    Corpus Collection

    • Comparable Medical Document Corpora are Very Difficult to Obtain, Anonymization Must be Validated by Hospital CIO

    • Work with „Shuffled“ Parallel Corpus

    • Radiology Reports (~600.000) Available in German, to be Obtained for English

    Corpus Annotation

    • More Efforts on Improving PoS Tagging and Morphological Analysis (English and German Medical Specialist Lexicon)

    Relation Extraction

    • More Efforts on Grammatical Function Tagging as Preprocessing for Semantic Relation Tagging and Extraction


    Management

    Future Prospects and Activities

    R&D Topics

    • Ontology DevelopmentCombining Axes in AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI, DFKI)

    • Semantic WebSemantic Annotation of Medical Documents with Metadata (UMLS in Protégé)

    Related Projects and Workshops

    • Project Proposal IKAR/OS on KM & Visualization in Life Sciences

    • OntoWeb SIG on LT in Ontology Development and Use

    • MuchMore Workshop with Invited Experts in Medical Information Access, CLIR and Semantic Annotation (September 2002)

    • ZInfo/MuchMore Workshop on Electronic Patient Records (Spring 2003)


    ad