Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac.in

The de-duplication problem Given a list of semi-structured records, find all records that refer to a same entity • Example applications: • Data warehousing: merging name/address lists • Entity: • Person • Household • Automatic citation databases (Citeseer): references • Entity: paper

Example Link together people from several data sources DERYCK ,D ,SOZA ,03 ,GERA VILLA ,VIMAN NAGAR PUNE ,411014 ,1 SOUZA ,D ,D ,, GORA VILLA VIMAN NAGAR 411014 , CHAFEKAR ,RAMCHANDRA …. Taxpayers Land records Passport Transport Telephone Duplicates: SOUZA ,D ,D ,,GORA VILLA ,,VIMAN NAGAR ,,,411014 , DERYCK ,D ,SOZA ,03 ,GERA VILLA ,,VIMAN NAGAR PUNE, 411014 Non-duplicates: CHAFEKAR ,RAMCHANDRA ,DAMODAR ,SHOP 8 ,H NO 509 NARAYAN PETH PUNE 411030 CHITRAV ,RAMCHANDRA ,D ,FLAT 5 ,H NO 2105 SADASHIV PETH PUNE 411 030

Challenges • Errors and inconsistencies in data • Spotting duplicates might be hard as they may be spread far apart: • may not be group-able using obvious keys • Domain-specific • Existing manual approaches require retuning with every new domain

How are Such Problems Created? • Human factors • Incorrect data entry • Ambiguity during data transformations • Application factors • Erroneous applications populating databases • Faulty database design (constraints not enforced) • Obsolence • Real-world is dynamic

Database level linkage Authors Titles Authors Patent database (XML) Inventor Title Assignee Abstract Inventor list Computer science publications (DBLP) Authors Titles Abstracts Probabilistic links Exploit various kinds of information in decreasing order of information content and increasing computation overhead Medical publications

Vendor Vendor-id Name Address Joining-date Parts part-id description importance replacement frequency Replacement part-id ship-id date Multi-table data Ship ship-id name weight capacity Trip-log ship-id trip start date trip length destination Orders Part-id Vendor-id Order date Price Duplicates can be in all of the interlinked tables Goal: resolving them simultaneously could be better

Outline • Part I: Motivation, similarity measures (90 min) • Data quality, applications • Linkage methodology, core measures • Learning core measures • Linkage based measures • Part II: Efficient algorithms for approximate join (60 min) • Part III: Clustering/partitioning algorithms (30 min)

String similarity measures • Token-based • Examples • Jaccard • TF-IDF Cosine similarities • Suitable for large documents • Character-based • Examples: • Edit-distance and variants like Levenshtein, Jaro-Winkler • Soundex • Suitable for short strings with spelling mistakes • Hybrids

Token based • Tokens/words • ‘AT&T Corporation’ -> ‘AT&T’ , ‘Corporation’ • Similarity: various measures of overlap of two sets S,T • Jaccard(S,T) = |ST|/|ST| • Example • S = ‘AT&T Corporation’ -> ‘AT&T’ , ‘Corporation’ • T = ‘AT&T Corp’ -> ‘AT&T’ , ‘Corp.’ • Jaccard(S,T) = 1/3 • Variants: weights attached with each token • Useful for large strings: example web documents

Cosine similarity with TF/IDF weights • Cosine similarity: • Sets transformed to vectors with each term as dimension • Similarity: dot-product of two vectors each normalized to unit length •  cosine of angle between them • Term weight == TF/IDF • log (tf+1) * log idf where • tf : frequency of ‘term’ in a document d • idf : number of documents / number of documents containing ‘term’ • Intuitively: rare ‘terms’ are more important • Widely used in traditional IR • Example: • ‘AT&T Corporation’, ‘AT&T Corp’ or ‘AT&T Inc’ • Low weights for ‘Corporation’,’Corp’,’Inc’, Higher weight for ‘AT&T’

Edit Distance [G98] • Given two strings, S,T, edit(S,T): • Minimum cost sequence of operations to transform S to T. • Character Operations: I (insert), D (delete), R (Replace). • Example: edit(Error,Eror) = 1, edit(great,grate) = 2 • Folklore dynamic programming algorithm to compute edit(); • O(m2) versus O(2m log m) for token-based measures • Several variants (gaps,weights) --- becomes NP-complete easily. • Varying costs of operations: can be learnt [RY97]. • Observations • Suitable for common typing mistakes on small strings • Comprehensive vs Comprenhensive • Problematic for specific domains

Edit Distance with affine gaps • Differences between ‘duplicates’ often due to abbreviations or whole word insertions. • IBM Corp. closer to ATT Corp. than IBM Corporation • John Smith vs John Edward Smith vs John E. Smith • Allow sequences of mis-matched characters (gaps) in the alignment of two strings. • Penalty: using the affine cost model • Cost(g) = s+e  l • s: cost of opening a gap • e: cost of extending the gap • l: length of a gap • Similar dynamic programming algorithm • Parameters domain-dependent, learnable, e.g., [BM03, MBP05]

Approximate edit-distance: Jaro Rule • Given strings s = a1,…,ak and t = b1,…,bL • ai in s is common to a character in t if there is a bj in t such that ai = bj i-H  j  i+H where H = min(|s|,|t|)/2 • Let s’ = a1’,…,ak’’ and t’ = b1’,…,bL’’ characters in s (t) common with t (s) • Ts’,t’ = number of transpositions in s’ and t’ • Jaro(s,t) = • Martha vs Marhta • H = 3, s’ = Martha, t’ = Marhta, Ts’,t’ = 2, Jaro(Martha,Marhta) = (1+1+1/6)/3=0.7 • Jonathan vs Janathon (H=4) • s’ = jnathn t’ = jnathn Ts’,t’ = 0,Jaro(Jonathan,Janathon)=0.5

Hybrids [CRF03] • Example: Edward, John Vs Jon Edwerd • Let S = {a1,…,aK}, T = {b1,…bL} sets of terms: • Sim(S,T) = • Sim’() some other similarity function • C(t,S,T) = {wS s.t  v  T, sim’(w,v) > t} • D(w,T) = maxvTsim’(w,v), w  C(t,S,T) • sTFIDF =

Soundex Encoding • A phonetic algorithm that indexes names by their sounds when pronounced in english. • Consists of the first letter of the name followed by three numbers. Numbers encode similar sounding consonants. • Remove all W, H • B, F, P, V encoded as 1, C,G,J,K,Q,S,X,Z as 2 • D,T as 3, L as 4, M,N as 5, R as 6, Remove vowels • Concatenate first letter of string with first 3 numerals • Ex: great and grate become G6EA3 and G6A3E and then G63 • More recent, metaphone, double metaphone etc.

Learning similarity functions • Per attribute • Term based (vector space) • Edit based • Learning constants in character-level distance measures like levenshtein distances • Useful for short strings with systematic errors (e.g., OCRs) or domain specific error (e.g.,st., street) • Multi-attribute records • Useful when relative importance of match along different attributes highly domain dependent • Example: comparison shopping website • Match on title more indicative in books than on electronics • Difference in price less indicative in books than electronics

Machine Learning approach Given examples of duplicates and non-duplicate pairs, learn to predict if pair is duplicate or not. Input features: • Various kinds of similarity functions between attributes • Edit distance, Soundex, N-grams on text attributes • Absolute difference on numeric attributes • Capture domain-specific knowledge on comparing data

Similarity functions YearDifference > 1 1.0 0.4 … 0.2 1 0.0 0.1 … 0.3 0 0.3 0.4 … 0.4 1 0.0 0.1 … 0.3 ? 1.0 0.4 … 0.2 ? 0.6 0.2 … 0.5 ? 0.7 0.1 … 0.6 ? 0.3 0.4 … 0.4 ? 0.0 0.1 … 0.1 ? 0.3 0.8 … 0.1 ? 0.6 0.1 … 0.5 ? Classifier All-Ngrams  0.48 Non-Duplicate Non Duplicate AuthorTitleNgrams  0.4 Duplicate TitleIsNull < 1 PageMatch  0.5 Duplicate Unlabeled list 0.0 0.1 … 0.3 0 1.0 0.4 … 0.2 1 0.6 0.2 … 0.5 0 0.7 0.1 … 0.6 0 0.3 0.4 … 0.4 1 0.0 0.1 … 0.1 0 0.3 0.8 … 0.1 1 0.6 0.1 … 0.5 1 Record 6 Record 7 Record 8 Record 9 Record 10 Record 11 Duplicate AuthorEditDist 0.8 Mapped examples Duplicate Non-Duplicate The learning approach Example labeled pairs Similarity functions f1 f2 …fn Record 1 D Record 2 Record 1 N Record 3 Record 4 D Record 5

examine instances that are similar on one attribute but dissimilar on another Experiences with the learning approach • Too much manual search in preparing training data • Hard to spot challenging and covering sets of duplicates in large lists • Even harder to find close non-duplicates that will capture the nuances Active learning is a generalization of this!

1.0 0.4 … 0.2 1 0.0 0.1 … 0.3 0 0.0 0.1 … 0.3 ? 1.0 0.4 … 0.2 ? 0.6 0.2 … 0.5 ? 0.7 0.1 … 0.6 ? 0.3 0.4 … 0.4 ? 0.0 0.1 … 0.1 ? 0.3 0.8 … 0.1 ? 0.6 0.1 … 0.5 ? 0.7 0.1 … 0.6 1 0.3 0.4 … 0.4 0 0.7 0.1 … 0.6 ? 0.3 0.4 … 0.4 ? Unlabeled list Record 6 Record 7 Record 8 Record 9 Record 10 Record 11 Active learner The active learning approach Example labeled pairs Similarity functions f1 f2 …fn Record 1 D Record 2 Record 3 N Record 4 Classifier

The ALIAS deduplication system • Interactive discovery of deduplication function using active learning • Efficient active learning on large lists using novel indexing mechanisms • Efficient application of learnt function on large lists using • Novel cluster-based evaluation engine • Cost-based optimizer

Experimental analysis • 250 references from Citeseer  32000 pairs of which only 150 duplicates • Citeseer’s script used to segment into author, title, year, page and rest. • 20 text and integer similarity functions • Average of 20 runs • Default classifier: decision tree • Initial labeled set: just two pairs

Benefits of active learning • Active learning much better than random • With only 100 active instances • 97% accuracy, Random only 30% • Committee-based selection close to optimal

Analyzing selected instances • Fraction of duplicates in selected instances: 44% starting with only 0.5% • Is the gain due to increased fraction of duplicates? • Replaced non-duplicates in selected set with random non-dups • Result  only 40% accuracy!!!

Finding all duplicate pairs in large lists • Input: a large list of records R with string attributes • Output: all pairs (S,T) of records in R which satisfy a Similarity Criteria: • Jaccard(S,T) > 0.7 • Overlapping tokens (S,T) > 5 • TF-IDF-Cosine(S,T) > 0.8 • Edit-distance(S,T) < k • More complicated similarity functions use these as filters (high recall, low precision) • Naïve method: for each record pair, compute similarity score • I/O and CPU intensive, not scalable to millions of records • Goal: reduce O(n2) cost to O(n*w), where w << n • Reduce number of pairs on which similarity is computed

General template for similarity functions • Sets: r,s threshold Common tokens

Approximating edit distance [GIJ+01] • EditDistance(s,t) ≤ d  |q-grams(s)  q-grams(t)|  max(|s|,|t|) - (d-1)*q – 1 • Q-grams (sequence of q-characters in a field) • ‘AT&T Corporation’ • 3-grams: {‘AT&’,’T&T’,’&T ‘, ‘T C’,’ Co’,’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’} • Typically, q=3  Large q-gram sets • Approximate large q-gram sets to smaller sets

Data 2 2 1 t1 t2 Pair-Count • Step 1: Pass over data, for each token create list of sets that contain it • Step 2: generate pairs of sets, count and output those with count > T Inverted index Self-join lists The pair-counting table could be large  too memory intensive Not good when list lengths highly skewed (Broder et al WWW 1997)

Data w1 w2 w1, w4,wk Probe-Count Step 1: Create inverted index Step 2: Using each record, probe merge lists to find rids in T of them Inverted index Heap

Threshold sensitive list merge Heap Except T-1 largest, organize rest in heap (T=3) Lists to be merged Sort by increasing size Heap Search in large lists in increasing order Use lower bounds to terminate early Pop from heap successively (SK04, CGK06)

Summary of the pair creation step • Can be extended to the weighted case fitting the general framework. • More complicated similarity functions use set similarity functions as filters • Set sizes can be reduced through techniques like MinHash (weighted versions also exist) • Small sets (average set size < 20), most database entities with word tokens: use as-is • Large sets: web documents, sets of q-grams • Use Minhash or random projection

7 8 2 9 3 1 4 5 10 6 3 disagreements Creating partitions • Transitive closure • Dangers: unrelated records collapsed into a single cluster 7 8 2 9 3 1 4 5 10 6 • Correlation clustering (Bansal et al 2002) • Partition to minimize total disagreements • Edges across partitions • Missing edges within partition • More appealing than clustering: • No magic constants: number of clusters, similarity thresholds, diameter, etc • Extends to real-valued scores • NP Hard: many approximate algorithms

Empirical results on data partitioning Digital cameras Camcoder Luggage (From: Bilenko et al, 2005) • Setup: Online comparison shopping, • Fields: name, model, description, price • Learner: Online perceptron learner • Complete-link clustering >> single-link clustering(transitive closure) • An issue: when to stop merging clusters

References • [CGGM04] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani: Robust and Efficient Fuzzy Match for Online Data Cleaning. SIGMOD Conference 2003: 313-324 • [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918-920 • [CGK06] Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A primitive operator for similarity [CRF03] William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg: A Comparison of String Distance Metrics for Name-Matching Tasks. IIWeb 2003: 73-78 • [J89] M. A. Jaro: Advances in record linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84: 414-420. • [ME97] Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD 1997 • [RY97] E. Ristad, P. Yianilos : Learning string edit distance. IEEE Pattern analysis and machine intelligence 1998. • [SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754 • [W99] William E. Winkler: The state of record linkage and current research problems. IRS publication R99/04 (http://www.census.gov/srd/www/byname.html) • [Y02] William E. Yancey: BigMatch: A program for extracting probable matches from a large file for record linkage. RRC 2002-01. Statistical Research Division, U.S. Bureau of the Census.

Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac