Automatic String Matching for Reduction of Drug Name Confusion

Automatic String Matchingfor Reduction of Drug Name Confusion Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta December 4, 2003

Study Method for Testing Drug Name Similarity • Overview: Phonological string matching for ranking similarity between drug names • Validation of Study Method: Precision and Recall against gold standard • Optimal Design of Study: Interface for assessing appropriateness of newly proposed drug name • Strengths and weaknesses: Each algorithm retrieves and misses correct answers that the others do not.

Overview: Drugname Matching • String matching to rank similarity between drug names • Two classes of string matching • orthographic: Compare strings in terms of spelling without reference to sound • phonological: Compare strings on the basis of a phonetic representation • Two methods of matching • distance: How far apart are two strings? • similarity: How close are two strings?

Orthographic and Phonological Distance/Similarity • Orthographic Distance: Levenshtein/string-edit Similarity: LCSR, DICE • Phonological Distance: Soundex Similarity: ALINE • Distance compared to Similarity: [dist(w1,w2)] comparable to [1 − sim(w1,w2)]

Orthographic Distance:Levenshtein (string-edit) • Levenshtein/string-edit: Count up the number of steps it takes to transform one string into another • Examples: • Distance between zantac and contac is 2. • Distance between zantac and xanax is 3. • For “global distance”, we can divide by length of longest string: • 2/max(6,6) = 2/6 = .33 • 3/max(5,6) = 3/6 = .5

Orthographic Similarity: LCSR, DICE • LCSR: Double the length of the longest common sub-sequence and divide by total number of chars in each string Examples: • Similarity between zantac & contac: (2 ∙ 4)/12 = 8/12 = .67 • Similarity between zantac & xanax: (2 ∙ 3)/11 = 6/11 = .55 • DICE: Double the number of shared bigrams and divide by total number of bigrams in each string Examples: • Similarity between {za,an,nt,ta,ac}&{co,on,nt,ta,ac} is (2 ∙ 3)/(5+5) = 6/10 = .6 • Similarity between {za,an,nt,ta,ac}& {xa,an,na,ax} is (2 ∙ 1)/(5+4) = 1/9 = .22

Phonological Distance: Soundex • Soundex: Transform all but first consonant to numeric codes, delete 0’s, and truncate resulting string to 4 characters • Character Conversion: 0 =(a,e,h,i,o,u,w,y); 1=(b,f,p,v); 2=(c,g,j,k,q,s,x,z); 3=(d,t); 4=(l); 5=(m,n); 6=(r) • Examples: Match: king and khyngge (k52,k52) Mismatch: knight and night (k523,n23) Match: pulpit and phlebotomy (p413,p413) Mismatch: zantac and contac (z532, c532) Mismatch: zantac and xanax (z532, x52) • Alternative: compare syllable count, initial/final sounds, stress locations. Misses sefotan(3)/seftin(2); Gelpad/ hypergel

Phonological Similarity: ALINE • Use phonological features to compare two words by their sounds. (Kondrak, 2000) • x#→k(s): +consonantal, +velar, +stop, -voice • #x→z: +consonantal, +alveolar, +fricative, +voice • Use entire string, vowels, decomposable features. • Developed originally for identifying cognates in vocabularies of related languages (colour vs. couleur) • Feature weights can be tuned for specific application. • Phonological similarity of two words: Optimal match between their phonological features. • Zantac • Xanax

ALINE Example: Osmitrol and Esmolol S =58 - Identifies identical pronunciation of different letters. - Identifies non-identical but similar sounds.

The vocal tract

Places of articulation: Numerical Values bilabial labiodental dental alveolar retroflex 1.0 0.95 0.9 0.85 0.8 palato-alveolar palatal velar uvular 0.7 0.6 0.5 0.75

ALINE Features: Weights and Values

Validation: Comparison of Outputs • EDIT: 0.667 zantac contac 0.500 zantac xanax 0.333 xanax contac • DICE: 0.600 zantac contac 0.222 zantac xanax 0.000 xanax contac • LCSR: 0.667 zantac contac 0.545 zantac xanax 0.364 xanax contac • ALINE: 0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac

Validation: Precision and Recall • Precision and recall against online gold standard: USP Quality Review, Mar, 2001. • 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) • Example (using DICE):+ 0.889 atgam ratgam+ 0.875 herceptin perceptin- 0.870 zolmitriptan zolomitriptan+ 0.857 quinidine quinine- 0.857 cytosar cytosar-u+ 0.842 amantadine rimantadine: : : :- 0.800 erythrocin erythromycin

Validation: Comparison of Precision at Different Recall Values

Validation: Precision of Techniques with Phonetic Transcription

Optimal Design of Study • Develop and use a web-based interface that allows applicants to enter newly proposed names • Interface displays a set of scores produced by each approach individually, as well as combined scores based on the union of all the approaches. • Applicant compares score to pre-determined threshold to assess appropriateness • In advance, run experiments with different algorithms and their combinations against gold standard for: • Determining the “appropriateness” threshold • Fine-tuning: calculate weights for drugname matching

Optimal Design of Study (continued) • Parameters have default settings for cognate matching task, but not appropriate for drugname matching • Parameter tuning: • calculate weights for drugname matching • “Hill Climbing” search against gold standard • Tuned parameters for drugname task • maximum score • insertion/deletion penalty • vowel penalty • phonological feature values

Strengths and Weaknesses • ALINE matches: • ultram voltaren • nasarel nizoral • lortab luride • DICE matches: • lanoksin lasiks • gelpad hypergel • levodopa methyldopa • LCSR matches: • edekrin euleksin • verelan virilon • nefroks nifereks

Strengths and Weaknesses • ALINE • Highest interpolated precision; easily tuned to the task; matches similar sounding words with difference in initial characters (ultram/voltaren) • Misses some words with high bigram count (lanoksin/lasiks) and weight-tuning process may induce overfitting to data (bupivacaine/ropivacaine vs. brevital/revia). • DICE • Matches parts of words (bigrams) to detect confusable names that would otherwise be dissimilar (gelpad/hypergel). • Misses similar sounding names (ultram/voltaren) that have no shared bigrams. • LCSR • Matches words where number of shared bigrams/sounds is small (edekrin/euleksin) • Misses similar sounding names (lortab/luride) that have a low subsequence overlap.

Conclusion • Experimentation with different algorithms and their combinations against gold standard. • Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). • Strong foundation for search modules in automating the minimization of medication errors • Solution: combined approach that benefits from strengths of all algorithms (increased recall), without severe degradation in precision (false positives).

Automatic String Matching for Reduction of Drug Name Confusion

Automatic String Matching for Reduction of Drug Name Confusion

Presentation Transcript

String Matching

Approximate String Matching

A Comparison of String Matching Distance Metrics for Name-Matching Tasks

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching II

String Matching

String Matching

String Matching Algorithms

String Matching

String matching

String Matching

String Matching

A Comparison of String Matching Distance Metrics for Name-Matching Tasks

String Matching