1 / 13

A Comparison of String Matching Distance Metrics for Name-Matching Tasks

A Comparison of String Matching Distance Metrics for Name-Matching Tasks. William Cohen, Pradeep RaviKumar, Stephen Fienberg. Motivating Example. List of people and some attributes compiled by one source Updates by another source need to be merged Need to locate matching records

Download Presentation

A Comparison of String Matching Distance Metrics for Name-Matching Tasks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg

  2. Motivating Example • List of people and some attributes compiled by one source • Updates by another source need to be merged • Need to locate matching records • Forcing exact match not sufficient • Typographical errors (letter “B” vs. letter “V”) • Scanning errors (letter “I” vs. numeral “1”) • Such errors exceed 20% in some cases • Decide when two records match  Decide when two strings (or words) are identical

  3. History – String Matching • Statistics • Treat as a classification problem [Fellegi & Sunter] • Use of other prior knowledge • String represented as a feature vector • Databases • No prior knowledge • Use of distance functions – edit distance, Monge & Elkan, TFIDF • Knowledge-intensive approaches • User interaction [Hernandez & Stolfo] • Artificial Intelligence • Learn the parameters of the edit distance functions • Combine the results of different distance functions • Compare string matching distance functions for the task of name matching

  4. Edit Distance • Number of edit operations needed to go from string s to string t • Operations: insert, delete, substitution • Levenstein: assigns unit cost • Distance (“smile”, “mile”) = 1 • Distance (“meet”, “meat”) = 1 • Computed by dynamic programming • Reordering of words can be misleading • “Cohen, William” vs. “William Cohen”

  5. Edit Distance • Monger-Elkan: assigns relatively lower cost to sequence of insertions or deletions • A + B*(n – 1) for n insertions or deletions (B < A) • Other methods that assign decreasing costs to subsequent insertions

  6. Edit Distance • Jaro (s, t) • s’ be characters in s common with t • t’ be characters in t common with s • T (s’, t’) be half the number of transpositions in for s’ and t’

  7. Improvements to Jaro • McLaughlin • Exact match – weight of 1.0 • Similar characters – weight of 0.3 • Scanning error (“I” vs. “1”) • Typographical error (“B” vs. “V”) • Pollock and Zamora • Error rates increase as the position in string moves to the right • Adjust output of Jaro by fixed amount depending upon how many of the first 4 characters match

  8. Term Based • Treat strings s & t as bags S and T of words • Examples • Jaccard similarity = |S∩T| / |SUT| • TFIDF

  9. Term Based • Words may be weighted to make the common words count less • Advantages • Exploits frequency information • Ordering of words doesn’t matter (Cohen, William vs. William Cohen) • Disadvantages • Sensitive to errors in spelling (Cohen vs. Cohon) and abbreviations (Univ. vs. University) • Ordering of words ignored (City National Bank vs. National City Bank)

  10. Hybrid Distance Functions • Recursive Matching • Let s = (a1, a2, … aK) and t = (b1, b2, …, bL) • Sim’ is the level two matching function

  11. Blocking / Pruning Methods • Comparing all pairs – too expensive when lists are large • A pair (s, t) is a candidate for match if they share some substring v that appears in at most a fraction f of all names • Using a v of length 4 and f = 1% finds on an average of 99% correct pairs

  12. Results - Metric • Output of each algorithm is a list of candidate pairs ranked by distance • Non-interpolated average precision of a ranking • Other metrics used • Interpolated precision

  13. Results - Matching • Term based: TFIDF most accurate • Edit distance based: Monge-Elkan most accurate • Jaro as accurate as Monge-Elkan, but much faster • Combine TFIDF and Jaro

More Related