1 / 42

James J. Cimino Department of Medical Informatics Columbia University

Battling Scylla and Charybdis: The Search for Redundancy and Ambiguity in the 2001 UMLS Metathesuarus. James J. Cimino Department of Medical Informatics Columbia University. 2001 Metathesaurus. 99 sources (92 in 2000) 1,734,707 strings (1,598,176 in 2000) 797,360 concepts (730,155 in 2000).

iris-ball
Download Presentation

James J. Cimino Department of Medical Informatics Columbia University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Battling Scylla and Charybdis:The Search for Redundancy and Ambiguity in the 2001 UMLS Metathesuarus James J. Cimino Department of Medical Informatics Columbia University

  2. 2001 Metathesaurus • 99 sources (92 in 2000) • 1,734,707 strings (1,598,176 in 2000) • 797,360 concepts (730,155 in 2000)

  3. Cold (temperature) COLD (temperature) Cold (infection) COLD (COPD) Redundancy! Lumping vs. Splitting Cold (temperature) COLD (temperature) Cold (infection) COLD (COPD) Ambiguity!

  4. Three Auditing Methods • Ambiguity through of multiple semantic types • Redundancy through semantic string matching • Inconsistency in parent-child semantic types

  5. * * Cimino JJ. Auditing the Unified Medical Language System with semantic methods. Journal of the American Medical Informatics Association; 1998;5:41-51. Previous Results: 1995 Possible ambiguity 1,817 Possible redundancy 5,031 Actually redundancy 3,274 Parent-Child problems 544

  6. Tools and Rules • Simple Metathesaurus data model • Normalized word index • “Mutually exclusive semantic types” • “Mutual concept subsumption”

  7. L0486186: S0837575: “Chronic Obstructive Airway Disease” L0486186: S0837576: “Chronic Obstructive Lung Disease” Simple Metathesaurus Data Model C0024117: Chronic Obstructive Airway Disease L0009264: S0829315: “COLD <3>” S0474508: “COLD” Semantic type: T04: Disease or Syndrome

  8. Simple Metathesaurus Data Model C0024117: Chronic Obstructive Airway Disease S0837575: “Chronic Obstructive Airway Disease” S0837576: “Chronic Obstructive Lung Disease” S0829315: “COLD <3>” S0474508: “COLD” Semantic type: T04: Disease or Syndrome

  9. Simple Metathesaurus Data Model C0024117: Chronic Obstructive Airway Disease “Chronic Obstructive Airway Disease” “Chronic Obstructive Lung Disease” “COLD <3>” “COLD” Semantic type: T04: Disease or Syndrome

  10. C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease Chronic Obstructive Airway Disease Chronic Obstructive Lung Disease COLD <3> COLD Semantic type: T04: Disease or Syndrome Simple Metathesaurus Data Model

  11. Substance Animal Plant Invertebrate Food Alga UMLS Semantic Types Physical Object Organism

  12. Mutually Inclusive Semantic Types Physical Object Organism Substance Animal Plant Invertebrate Food Alga

  13. Mutually Exclusive Semantic Types Physical Object Organism Substance Animal Plant Food Invertebrate Alga

  14. Rules for Multiple Semantic Types 3. Concepts can have two Substance types, except: a) Element, Ion or Isotope and Chemicals Viewed Structurally b) Inorganic Chemical and Organic Chemicals 5. Concepts can have two Conceptual Entity types, except: Molecular Sequence and Geographic Area Molecular Sequence and Body Location or Region Geographic Area and Body Location or Region 7. Concepts can have two Event types, except: Diagnostic Procedure and Laboratory Procedure 8. Concepts can have two types that ancestors/descendants

  15. Detection of Ambiguity by Mutually Exclusive Semantic Types If a concept has multiple semantic types And if any pair of the types are mutually exclusive Then the concept may have multiple meanings (ambiguity) Or the semantic type assignment is incorrect

  16. Ambiguity Examples C0015155: Euglena gracilis Alga and Invertebrate C0223537: Fourth lumbar vertebra Body Part, Organ, or Organ Component and Disease or Syndrome C0035510: Toxicodendron Plant and Disease or Syndrome C0242789: Crown-Rump Length Organism Attribute and Diagnostic Procedure C0007608: Cell Movement Cell Function and Biomedical Occupation or Discipline C0030756: Lice Infestations Invertebrate and Disease or Syndrome C0008715: Chronically Ill Disease or Syndrome and Patient or Disabled Group

  17. Normalized Word Index • UMLS Normalized Word Index • e.g., “lungs”  “lung” • 293,004 words • Keyword synonyms • e.g., “lung”  “pulmonary” • 9,650 mappings • Translated strings • Built word index

  18. C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease Chronic Obstructive Airway Disease Chronic Obstructive Lung Disease COLD <3> COLD Semantic type: T04: Disease or Syndrome Word Normalization

  19. Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disease chronic obstructive lung disease cold 3 cold Semantic type: T04: Disease or Syndrome

  20. Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disease chronic obstructive pulmonary disease cold 3 cold Semantic type: T04: Disease or Syndrome

  21. Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold 3 cold Semantic type: T04: Disease or Syndrome

  22. Word Normalization C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold Semantic type: T04: Disease or Syndrome

  23. Word Index C0035242: Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a) C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder airway chronic cold disorder obstructive pulmonary three chronic obstructive pulmonary disorder cold three cold Semantic type: T04: Disease or Syndrome

  24. Mutual String Subsumption 1) If Concept A has String A1 And all words in A1 are in Concept B’s word list Then B subsumes A1 2) If B subsumes any string in A And A subsumes any string in B Then A and B are mutually subsumptive

  25. C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome Mutual String Subsumption

  26. Mutual String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome

  27. Mutual String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome

  28. Mutual String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome

  29. Detection of Redundancy by String Subsumption If A and B are mutually subsumptive And semantic types of A and B are mutually inclusive Then A and B may be redundant

  30. Detection of Redundancy by String Subsumption C0009443: Common Cold C0009264: cold temperature common cold cold two cold cold common two cold temperature cold one cold cold one temperature T04: Disease or Syndrome T070: Natural Phenomenon or Process C0024117: Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold airway chronic cold disorder obstructive pulmonary three T04: Disease or Syndrome

  31. Redundancy Examples C0673603: NPS-R-467 (Organic Chemical) C0673604: NPS R-467 (Organic Chemical) C0673769: des-Arg(10)-(Leu(9))kallidin (Amino Acid, Peptide or Protein) C0673771: kallidin, des-Arg(10)-(Leu(9))-) (Amino Acid, Peptide or Protein) C0266133: Congenital diverticulum of esophagus (Congenital Abnormality) C0555218: Congenital esophageal pouch (Congenital Abnormality)

  32. Redundancy False Positives • Partial names as synonyms: C0687720: Central Diabetes Insipidus has “Diabetes Insipidus” as synonym so it is mutually subsumptive with C0011848: Diabetes Insipidus • Incorrect synonymy (MeSH translations) C0013005: Dolphins has synonyms “ORCA” (Span.) and "FALSA BALEIA ASSASSINA“ (Port.) so it is mutually subsumptive with C0325138: Whale, False Killer which has synonym "FALSA ORCA" (Span.)

  33. Detecting Semantic Type Problems through Parent-Child Relations If Concept A is Parent of Concept B And Concept A has semantic type X And Concept B has semantic type Y And if X and Y are different And X is not an ancestor of Y (in Semantic Net) Then one (or both) semantic types are wrong Or the parent-child relation is wrong

  34. Skate (manufactured object) Shark (vertebrate) Stingray (animal) Dogfish (fish) Detecting Semantic Type Problems through Parent-Child Relations Cartilaginous Fish (vertebrate) Parent-Child Relations OK Wrong Type or Wrong Concept Nonspecific Semantic Type OK

  35. Parent-Child Examples C00013769: Elbow has type Body Location or Regions which is in the Conceptual Entity hierarchy Is parent of: C0230353: Right elbow has type Body Part, Organ, or Organ Component which is in the Physical Object hierarchy

  36. Results: 1995 VS. 2001 Possible ambiguity 1,817 Possible redundancy 5,031 Actually redundant 3,274 Parent-Child problems 544 8,082 38,140 not done 2,868 Number of concepts: 222,927 797,359 (3.6x) Parent-Child relations 100,586 607,043 (6.0x)

  37. Results: 1995 VS. 2001 Possible ambiguity 1,817 (0.82%) 8,082 (1.01%) Possible redundancy 5,031 (2.26%) 38,140 (4.78%) Actually redundant 3,274 (1.47%) not done Parent-Child problems 544 (0.54%) 2,868 (0.47%) Number of concepts: 222,927 797,359 (3.6x) Parent-Child relations 100,586 607,043 (6.0x)

  38. Discussion: Ambiguity Detection • Small number (1.01%) is a good sign • Allows focusing manual review • Semantic type definitions need to be clarified • Semantic type assignment rules need to be clarified

  39. Discussion: Redundancy Detection • Specificity is worse, without improved sensitivity • Normalized string index is part of the reason • “Incomplete” names are a bigger part of the reason • Manual review will be relatively inefficient • Incorrect mappings detected, especially foreign language

  40. Discussion: Parent-Child Relations • Mostly detects errors in semantic type assignment • Strict hierarchy in Semantic Net causes problems

  41. Conclusions • Specific “answers” not possible • Domain expertise needed for assessment of chemical names • Assessments are necessarily subjective • NLM gets to make the rules • NLM hasn’t finished making the rules • Methods provide focus for manual review • Methods highlight where clearer definitions are needed • The results show the UMLS is doing well at a difficult task

  42. Acknowledgments • NLM: Bill Hole, Alexa McCray and Betsy Humphreys • Home: Rachel and Rebecca Cimino

More Related