1 / 21

Automatic String Matching for Reduction of Drug Name Confusion

Automatic String Matching for Reduction of Drug Name Confusion. Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta December 4, 2003. Study Method for Testing Drug Name Similarity. Overview: Phonological string matching for ranking similarity between drug names

paul
Download Presentation

Automatic String Matching for Reduction of Drug Name Confusion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic String Matchingfor Reduction of Drug Name Confusion Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta December 4, 2003

  2. Study Method for Testing Drug Name Similarity • Overview: Phonological string matching for ranking similarity between drug names • Validation of Study Method: Precision and Recall against gold standard • Optimal Design of Study: Interface for assessing appropriateness of newly proposed drug name • Strengths and weaknesses: Each algorithm retrieves and misses correct answers that the others do not.

  3. Overview: Drugname Matching • String matching to rank similarity between drug names • Two classes of string matching • orthographic: Compare strings in terms of spelling without reference to sound • phonological: Compare strings on the basis of a phonetic representation • Two methods of matching • distance: How far apart are two strings? • similarity: How close are two strings?

  4. Orthographic and Phonological Distance/Similarity • Orthographic Distance: Levenshtein/string-edit Similarity: LCSR, DICE • Phonological Distance: Soundex Similarity: ALINE • Distance compared to Similarity: [dist(w1,w2)] comparable to [1 − sim(w1,w2)]

  5. Orthographic Distance:Levenshtein (string-edit) • Levenshtein/string-edit: Count up the number of steps it takes to transform one string into another • Examples: • Distance between zantac and contac is 2. • Distance between zantac and xanax is 3. • For “global distance”, we can divide by length of longest string: • 2/max(6,6) = 2/6 = .33 • 3/max(5,6) = 3/6 = .5

  6. Orthographic Similarity: LCSR, DICE • LCSR: Double the length of the longest common sub-sequence and divide by total number of chars in each string Examples: • Similarity between zantac & contac: (2 ∙ 4)/12 = 8/12 = .67 • Similarity between zantac & xanax: (2 ∙ 3)/11 = 6/11 = .55 • DICE: Double the number of shared bigrams and divide by total number of bigrams in each string Examples: • Similarity between {za,an,nt,ta,ac}&{co,on,nt,ta,ac} is (2 ∙ 3)/(5+5) = 6/10 = .6 • Similarity between {za,an,nt,ta,ac}& {xa,an,na,ax} is (2 ∙ 1)/(5+4) = 1/9 = .22

  7. Phonological Distance: Soundex • Soundex: Transform all but first consonant to numeric codes, delete 0’s, and truncate resulting string to 4 characters • Character Conversion: 0 =(a,e,h,i,o,u,w,y); 1=(b,f,p,v); 2=(c,g,j,k,q,s,x,z); 3=(d,t); 4=(l); 5=(m,n); 6=(r) • Examples: Match: king and khyngge (k52,k52) Mismatch: knight and night (k523,n23) Match: pulpit and phlebotomy (p413,p413) Mismatch: zantac and contac (z532, c532) Mismatch: zantac and xanax (z532, x52) • Alternative: compare syllable count, initial/final sounds, stress locations. Misses sefotan(3)/seftin(2); Gelpad/ hypergel

  8. Phonological Similarity: ALINE • Use phonological features to compare two words by their sounds. (Kondrak, 2000) • x#→k(s): +consonantal, +velar, +stop, -voice • #x→z: +consonantal, +alveolar, +fricative, +voice • Use entire string, vowels, decomposable features. • Developed originally for identifying cognates in vocabularies of related languages (colour vs. couleur) • Feature weights can be tuned for specific application. • Phonological similarity of two words: Optimal match between their phonological features. • Zantac • Xanax

  9. ALINE Example: Osmitrol and Esmolol S =58 - Identifies identical pronunciation of different letters. - Identifies non-identical but similar sounds.

  10. The vocal tract

  11. Places of articulation: Numerical Values bilabial labiodental dental alveolar retroflex 1.0 0.95 0.9 0.85 0.8 palato-alveolar palatal velar uvular 0.7 0.6 0.5 0.75

  12. ALINE Features: Weights and Values

  13. Validation: Comparison of Outputs • EDIT: 0.667 zantac contac 0.500 zantac xanax 0.333 xanax contac • DICE: 0.600 zantac contac 0.222 zantac xanax 0.000 xanax contac • LCSR: 0.667 zantac contac 0.545 zantac xanax 0.364 xanax contac • ALINE: 0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac

  14. Validation: Precision and Recall • Precision and recall against online gold standard: USP Quality Review, Mar, 2001. • 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) • Example (using DICE):+ 0.889 atgam ratgam+ 0.875 herceptin perceptin- 0.870 zolmitriptan zolomitriptan+ 0.857 quinidine quinine- 0.857 cytosar cytosar-u+ 0.842 amantadine rimantadine: : : :- 0.800 erythrocin erythromycin

  15. Validation: Comparison of Precision at Different Recall Values

  16. Validation: Precision of Techniques with Phonetic Transcription

  17. Optimal Design of Study • Develop and use a web-based interface that allows applicants to enter newly proposed names • Interface displays a set of scores produced by each approach individually, as well as combined scores based on the union of all the approaches. • Applicant compares score to pre-determined threshold to assess appropriateness • In advance, run experiments with different algorithms and their combinations against gold standard for: • Determining the “appropriateness” threshold • Fine-tuning: calculate weights for drugname matching

  18. Optimal Design of Study (continued) • Parameters have default settings for cognate matching task, but not appropriate for drugname matching • Parameter tuning: • calculate weights for drugname matching • “Hill Climbing” search against gold standard • Tuned parameters for drugname task • maximum score • insertion/deletion penalty • vowel penalty • phonological feature values

  19. Strengths and Weaknesses • ALINE matches: • ultram voltaren • nasarel nizoral • lortab luride • DICE matches: • lanoksin lasiks • gelpad hypergel • levodopa methyldopa • LCSR matches: • edekrin euleksin • verelan virilon • nefroks nifereks

  20. Strengths and Weaknesses • ALINE • Highest interpolated precision; easily tuned to the task; matches similar sounding words with difference in initial characters (ultram/voltaren) • Misses some words with high bigram count (lanoksin/lasiks) and weight-tuning process may induce overfitting to data (bupivacaine/ropivacaine vs. brevital/revia). • DICE • Matches parts of words (bigrams) to detect confusable names that would otherwise be dissimilar (gelpad/hypergel). • Misses similar sounding names (ultram/voltaren) that have no shared bigrams. • LCSR • Matches words where number of shared bigrams/sounds is small (edekrin/euleksin) • Misses similar sounding names (lortab/luride) that have a low subsequence overlap.

  21. Conclusion • Experimentation with different algorithms and their combinations against gold standard. • Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). • Strong foundation for search modules in automating the minimization of medication errors • Solution: combined approach that benefits from strengths of all algorithms (increased recall), without severe degradation in precision (false positives).

More Related