automatic document indexing in large medical collections l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic Document Indexing in Large Medical Collections PowerPoint Presentation
Download Presentation
Automatic Document Indexing in Large Medical Collections

Loading in 2 Seconds...

play fullscreen
1 / 25

Automatic Document Indexing in Large Medical Collections - PowerPoint PPT Presentation


  • 235 Views
  • Uploaded on

Automatic Document Indexing in Large Medical Collections. Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University of Crete, Chania, Greece Evangelos E. Milios Dalhousie University, Halifax, Canada. Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automatic Document Indexing in Large Medical Collections' - Antony


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic document indexing in large medical collections

Automatic Document Indexing in Large Medical Collections

Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis

Technical University of Crete, Chania, Greece

Evangelos E. Milios

Dalhousie University, Halifax, Canada

AMTEx

overview
Overview
  • The need for automatic assignment of index terms in large medical collections
  • MMTx (by the US NLM)
  • The AMTEx approach to medical document indexing
  • AMTEx resources: MeSH & C/NC value
  • Experiments & evaluation
  • Discussion and future research

AMTEx

motivation and objectives
Motivation and Objectives
  • MeSH is a taxonomy of medical terms
    • Subset of UMLS Metathesaurus
  • MEDLINE is indexed by MeSH terms (assigned by experts)
  • Other medical texts need to be associated with MEDLINE, e.g. consumer medical literature
  • Need for automatic assignment of MeSH terms to any medical text

AMTEx

mmtx metamap transfer
MMTx (MetaMap Transfer)

Maps arbitrary text to UMLS Metathesaurus concepts:

  • Parsing to extract noun phrases

(syntactic analysis - linguistic filter)

  • Variant Generation

(uses SPECIALIST Lexicon)

  • Candidate Retrieval

(mapping process to Metathesaurus Concepts)

  • Candidate Evaluation

(criteria: centrality, variation, coverage, cohesiveness)

AMTEx

mmtx example
MMTx Example
  • Parsing
    • Shallow syntactic analysis of the input text
    • Linguistic filtering: isolates noun phrases
  • Variant Generation

e.g. “obstructive sleep apnea” has variants:

obstructive sleep apnea, sleep apnea, sleep, apnea, osa,…

  • Candidate Retrieval

Candidate Metathesaurus concepts for the variant “osa” :

osa [osa antigen],

osa [osa gene product]

osa [osa protein]

osa [obstructive sleep apnea]

  • Candidate Evaluation

Obstructive Sleep apnea 1000

Sleep Apnea 901

Apnea 827

… …

Sleeping 793

Sleepy 755

AMTEx

mmtx limitations
MMTx limitations
  • MMTx focus on UMLS rather than MeSH
    • ButMEDLINE indexing is based on MeSH
  • Exhaustive variant generation:

the initial phrase is iteratively expanded into all possible UMLS variants

      • term overgeneration
      • term concept diffusion
      • unrelated terms added to the final candidate list

AMTEx

the amtex method
The AMTEx method
  • New method for automatic indexing of medical documents
  • Main idea:
    • Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value
    • Extracts general single and multi-word terms
    • Extracted terms are validated against MeSH

AMTEx

x outline
ΑΜΤΕxOutline

INPUT:

Document

Collection

C/NC value

Multi-word Term Extraction

& Term Ranking

OUTPUT:

MeSH

Term Lists

MeSH

Term Validation

Single-word Term Extraction

Non-MeSH multi-word are broken

down & validated against MeSH

MeSH

Thesaurus

Resource

Variant Generation

Term Expansion

(MeSH)

AMTEx

mesh medical subject headings
MeSH: Medical Subject Headings

The NLM medical & biological terms thesaurus:

  • Organized in IS-A hierarchies
      • more than 15 taxonomies & more than 22,000 terms
      • a term may appear in multiple taxonomies
  • No PART-OF relationships
  • Terms organized into synonym sets called entry terms, including stemmed term forms

AMTEx

fragment of the mesh is a hierarchy
Fragment of the MeSH IS-A Hierarchy

Root

Nervous system

diseases

Cranial nerve

diseases

Neurologic

manifestations

pain

Facial

neuralgia

headache

neuralgia

AMTEx

the c nc value method
The C/NC value method
  • Hybrid (linguistic / statistical) term extraction method
  • Domain independent
  • Specifically designed for the identification of multi-word and nested terms:
    • compound & multi-word terms very common in biomedical domain
    • multi-word terms often used in indexing

AMTEx

c value
C-value
  • C-value: a phrase may be a term, if it often appears alone or within other candidate terms

otherwise

α: candidate term

f(α): frequencyTα: set of candidate terms containing α

P(Tα): number of such terms

AMTEx

nc value
NC-value
  • NC-value: a phrase is more likely a term, if it often appears in specific word context

w: context word

t(w): number of terms w appears with

n: number of all terms

fα(w): frequency of w as context word of α

AMTEx

amtex step 1 c nc value multi word term extraction ranking
AMTEx step 1: C/NC valueMulti-word Term Extraction & Ranking
  • Part-of-Speech Tagging
  • Linguistic filtering:
        • N+ N
        • (A|N)+ N
        • ( (A|N)+ | ( (A|N)* (N P)? ) (A|N)* ) N
  • Candidate term ranking based on

C/NC-value

  • Keep terms with NC-value > T1

AMTEx

amtex step 2 mesh term validation
AMTEx step 2: MeSH Term Validation
  • Candidate terms are validated against the MeSH Thesaurus (simple string matching)
  • Only candidate terms matching MeSH are kept
  • Multi-word candidates not matching MeSH may still contain (shorter) MeSH terms

AMTEx

amtex step 3 single word term extraction
AMTEx step 3: Single-word Term Extraction

For multi-word terms not matching MeSH:

  • Multi-word are split into single-word terms
  • Single-word terms matched against MeSH
  • Matched MeSH terms added to term list

AMTEx

amtex step 4 term variant generation
AMTEx step 4: Term Variant Generation

Variants are added to the list of terms:

  • Inflectional variants of the extracted terms identified during term extraction

(C/NC-value)

  • Stemmed term-forms available in MeSH

AMTEx

amtex step 5 term expansion19
AMTEx step 5: Term Expansion
  • Each term in the list is expanded with neighbouring terms in MeSH hierarchy
  • The expansion may include terms more than one level higher or lower than the original term, depending on similarity threshold T
  • Semantic similarity metric by Li et al.Y. Li, Z. A. Bandar, and D. McLean. An Approach forMeasuring Semantic Similarity between Words UsingMultiple Information Sources. IEEE Trans. on Knowledgeand Data Engineering, 15(4):871–882, July/Aug. 2003.

AMTEx

example
Example

Input:Full text article

MEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis,Knee/complications”, “Osteoarthritis, Knee/diagnosis”,“Pain/classification”, “Pain/etiology”, “ProspectiveStudies”, “Research Support, Non-U.S. Gov’t”

MMTx terms: “osteoarthritis knee”, “retention”, “peat”,“rheumatology”, “acetylcholine”, “lysine acetate”,“potassium acetate”, “questionnaires”, “target population”,“population”, “selection bias”, “creativeness”,“reproduction”, “cohort studies”, “europe”, “couples”,“naloxone”, “sample size”, “arthritis”, “datacollection”,“mail” ‘health status”, “respondents”, “ontario”, “universities”,“dna”, “baseline survey”, “medical records”,“informatics”, “general practitioners”, “gender”, “beliefs”,“logistic regression”, “female”, “marital status”,“employment status”, “comprehension”, “surveys”,“age distribution”, “manual”,“occupations”, “manuals”,“persons”, “females”, “minor”, “minority groups”,“incentives”, “business”, “ability”, “comparativestudy”, “odds ratio”, “biomedical research”, “pubmed”,“copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”,“skin diseases”, “government”,“norepinephrine”, “social sciences”, “survey methods”,“tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”,“cycloheximide”, “rheum”, “jordan”,“cadmium”, “radiopharmaceuticals”, “community”,“disease progression”, “history”

AMTExterms: “health surveys”, “pain”, “review publicationtype”, “data collection”, “osteoarthritis knee”, “knee”,“science”, “health services needs and demand”, “population”,“research”, “questionnaires”,“informatics”,“health”

AMTEx

evaluation
Evaluation

Precision and Recall measures

  • Dataset:
      • 61 full MEDLINE documents (not abstracts), from PMC database of NCBI Pubmed
      • MEDLINE documents are paired to respective MeSH index terms, manually assigned by experts
  • Ground Truth:
      • the set of MeSH document index terms
  • Benchmark method:
      • MMTx against our AMTEx

AMTEx

multi word terms only
Multi-Word Terms only

T: term expansion threshold, lower T means further expansion

AMTEx

conclusions amtex
Conclusions: AMTEx
  • Designed for indexing and retrieval of MEDLINE documents
  • Focuses on multi-word term extraction using valid linguistic & statistical criteria
  • Based on MeSH -- similarly to human indexing
  • Selectively expands into term variants, synonyms
  • Outperforms the current benchmark MMTx method, in both precision & recall

AMTEx

future work
Future Work
  • Better ranking of terms, using semantic similarity
  • Learning of thresholds T1, T
  • Word sense disambiguation to detect the correct sense for expansion rather than the most common sense
  • Handling shorter documents

AMTEx