320 likes | 429 Views
Efficient Record Linkage in Large Data Sets. Liang Jin, Chen Li , Sharad Mehrotra University of California, Irvine. DASFAA, Kyoto, Japan, March 2003. Motivation . Correlate data from different data sources (e.g., data integration) Data is often dirty
E N D
Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003
Motivation • Correlate data from different data sources (e.g., data integration) • Data is often dirty • Needs to be cleansed before being used • Example: • A hospital needs to merge patient records from different data sources • They have different formats, typos, and abbreviations
Example Table R Table S • Find records from different datasets that could be the same entity
Another Example • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981) • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981
Record linkage Problem statement: “Given two relations, identify the potentially matched records • Efficiently and • Effectively”
Challenges • How to define good similarity functions? • Many functions proposed (edit distance, cosine similarity, …) • Domain knowledge is critical • Names: “Wall Street Journal” and “LA Times” • Address: “Main Street” versus “Main St” • How to do matching efficiently • Offline join version • Online (interactive) search • Nearest search • Range search
Outline • Motivation of record linkage • Single-attribute case: two-step approach • Multi-attribute linkage • Conclusion and related work
Single-attribute Case • Given • two sets of strings, R and S • a similarity function f between strings (metric space) • Reflexive: f(s1,s2) = 0 iff s1=s2 • Symmetric: f(s1,s2) = d(s2, s1) • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) • a threshold k • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S
Nested-loop? • Not desirable for large data sets • 5 hours for 30K strings!
Our 2-step approach • Step 1: map strings (in a metric space) to objects in a Euclidean space • Step 2: do a similarity join in the Euclidean space
Advantages • Applicable to many metric similarity functions • Use edit distance as an example • Other similarity functions also tried, e.g., q-gram-based similarity • Open to existing algorithms • Mapping techniques • Join techniques
Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space
Example: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2
Mapping: StringMap • Input: A list of strings • Output: Points in a high-dimensional Euclidean space that preserve the original distances well • A variation of FastMap • Each step greedily picks two strings (pivots) to form an axis • All axes are orthogonal
Can it preserve distances? • Data Sources: • IMDB star names: 54,000 • German names: 132,000 • Distribution of string lengths:
Can it preserve distances? • Use data set 1 (54K names) as an example • k=2, d=20 • Use k’=5.2 to differentiate similar and dissimilar pairs.
Choose Dimensionality d Increase d? • Good : • better to differentiate similar pairs from dissimilar ones. • Bad : • Step 1: Efficiency ↓ • Step 2: “curse of dimensionality”
# of pairs within distance w Cost= # of similar pairs Choose dimensionality d using sampling • Sample 1Kx1K strings, find their similar pairs (within distance k) • Calculate maximum of their new distances w • Define “Cost” of finding a similar pair:
Choose Dimensionality d d=15 ~ 25
Choose new threshold k’ • Closely related to the mapping property • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. • Choose k’ using sampling • Sample 1Kx1K strings, find similar pairs • Calculate their maximum new distance as k’ • repeat multiple times, choose their maximum
Step 2: Similarity Join • Input: Two sets of points in Euclidean space. • Output: Pairs of two points whose distance is less than new threshold k’. • Many join algorithms can be used
Example • Adopted an algorithm by Hjaltason and Samet. • Building two R-Trees. • Traverse two trees, find points whose distance is within k’. • Pruning during traversal (e.g., using MinDist).
Final processing • Among the pairs produced from the similarity-join step, check their edit distance. • Return those pairs satisfying the threshold k
Recall • Recall: (#of found similar pairs)/(#of all similar pairs)
Outline • Motivation of record linkage • Single-attribute case: two-step approach • Multi-attribute linkage • Conclusion and related work
Multi-attribute linkage • Example: title + name + year • Different attributes have different similarity functions and thresholds • Consider merge rules in disjunctive format:
Evaluation strategies • Many ways to evaluate rules • Finding an optimal one: NP-hard • Heuristics: • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.
Summary • A novel two-step approach to record linkage. • Many existing mapping and join algorithms can be adopted • Applicable to many distance metrics. • Time and space efficient. • Multi-attribute case studied
Related work • Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] • Efficient merge and purge: [Hernandez and Stolfo, 1995] • String edit-distance join using DBMS: [Gravano et al, 2001]