**Efficient Record Linkage in Large Data Sets** Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003

**Motivation ** • Correlate data from different data sources (e.g., data integration) • Data is often dirty • Needs to be cleansed before being used • Example: • A hospital needs to merge patient records from different data sources • They have different formats, typos, and abbreviations

**Example** Table R Table S • Find records from different datasets that could be the same entity

**Another Example** • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981) • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

**Record linkage ** Problem statement: “Given two relations, identify the potentially matched records • Efficiently and • Effectively”

**Challenges ** • How to define good similarity functions? • Many functions proposed (edit distance, cosine similarity, …) • Domain knowledge is critical • Names: “Wall Street Journal” and “LA Times” • Address: “Main Street” versus “Main St” • How to do matching efficiently • Offline join version • Online (interactive) search • Nearest search • Range search

**Outline** • Motivation of record linkage • Single-attribute case: two-step approach • Multi-attribute linkage • Conclusion and related work

**Single-attribute Case** • Given • two sets of strings, R and S • a similarity function f between strings (metric space) • Reflexive: f(s1,s2) = 0 iff s1=s2 • Symmetric: f(s1,s2) = d(s2, s1) • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) • a threshold k • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S

**Nested-loop?** • Not desirable for large data sets • 5 hours for 30K strings!

**Our 2-step approach** • Step 1: map strings (in a metric space) to objects in a Euclidean space • Step 2: do a similarity join in the Euclidean space

**Advantages** • Applicable to many metric similarity functions • Use edit distance as an example • Other similarity functions also tried, e.g., q-gram-based similarity • Open to existing algorithms • Mapping techniques • Join techniques

**Step 1** Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

**Example: Edit Distance** • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

**Mapping: StringMap** • Input: A list of strings • Output: Points in a high-dimensional Euclidean space that preserve the original distances well • A variation of FastMap • Each step greedily picks two strings (pivots) to form an axis • All axes are orthogonal

**Can it preserve distances?** • Data Sources: • IMDB star names: 54,000 • German names: 132,000 • Distribution of string lengths:

**Can it preserve distances?** • Use data set 1 (54K names) as an example • k=2, d=20 • Use k’=5.2 to differentiate similar and dissimilar pairs.

**Choose Dimensionality d** Increase d? • Good : • better to differentiate similar pairs from dissimilar ones. • Bad : • Step 1: Efficiency ↓ • Step 2: “curse of dimensionality”

**# of pairs within distance w** Cost= # of similar pairs Choose dimensionality d using sampling • Sample 1Kx1K strings, find their similar pairs (within distance k) • Calculate maximum of their new distances w • Define “Cost” of finding a similar pair:

**Choose Dimensionality d** d=15 ~ 25

**Choose new threshold k’ ** • Closely related to the mapping property • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. • Choose k’ using sampling • Sample 1Kx1K strings, find similar pairs • Calculate their maximum new distance as k’ • repeat multiple times, choose their maximum

**New threshold k’ in step 2** d=20

**Step 2: Similarity Join** • Input: Two sets of points in Euclidean space. • Output: Pairs of two points whose distance is less than new threshold k’. • Many join algorithms can be used

**Example** • Adopted an algorithm by Hjaltason and Samet. • Building two R-Trees. • Traverse two trees, find points whose distance is within k’. • Pruning during traversal (e.g., using MinDist).

**Final processing** • Among the pairs produced from the similarity-join step, check their edit distance. • Return those pairs satisfying the threshold k

**Running time**

**Recall** • Recall: (#of found similar pairs)/(#of all similar pairs)

**Outline** • Motivation of record linkage • Single-attribute case: two-step approach • Multi-attribute linkage • Conclusion and related work

**Multi-attribute linkage** • Example: title + name + year • Different attributes have different similarity functions and thresholds • Consider merge rules in disjunctive format:

**Evaluation strategies** • Many ways to evaluate rules • Finding an optimal one: NP-hard • Heuristics: • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

**Summary** • A novel two-step approach to record linkage. • Many existing mapping and join algorithms can be adopted • Applicable to many distance metrics. • Time and space efficient. • Multi-attribute case studied

**Related work** • Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] • Efficient merge and purge: [Hernandez and Stolfo, 1995] • String edit-distance join using DBMS: [Gravano et al, 2001]

**The Flamingo Project on Data Cleansing**