efficient record linkage in large data sets l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient Record Linkage in Large Data Sets PowerPoint Presentation
Download Presentation
Efficient Record Linkage in Large Data Sets

Loading in 2 Seconds...

play fullscreen
1 / 32

Efficient Record Linkage in Large Data Sets - PowerPoint PPT Presentation


  • 281 Views
  • Uploaded on

Efficient Record Linkage in Large Data Sets. Liang Jin, Chen Li , Sharad Mehrotra University of California, Irvine. DASFAA, Kyoto, Japan, March 2003. Motivation . Correlate data from different data sources (e.g., data integration) Data is often dirty

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Efficient Record Linkage in Large Data Sets


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003

    2. Motivation • Correlate data from different data sources (e.g., data integration) • Data is often dirty • Needs to be cleansed before being used • Example: • A hospital needs to merge patient records from different data sources • They have different formats, typos, and abbreviations

    3. Example Table R Table S • Find records from different datasets that could be the same entity

    4. Another Example • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981) • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

    5. Record linkage Problem statement: “Given two relations, identify the potentially matched records • Efficiently and • Effectively”

    6. Challenges • How to define good similarity functions? • Many functions proposed (edit distance, cosine similarity, …) • Domain knowledge is critical • Names: “Wall Street Journal” and “LA Times” • Address: “Main Street” versus “Main St” • How to do matching efficiently • Offline join version • Online (interactive) search • Nearest search • Range search

    7. Outline • Motivation of record linkage • Single-attribute case: two-step approach • Multi-attribute linkage • Conclusion and related work

    8. Single-attribute Case • Given • two sets of strings, R and S • a similarity function f between strings (metric space) • Reflexive: f(s1,s2) = 0 iff s1=s2 • Symmetric: f(s1,s2) = d(s2, s1) • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) • a threshold k • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S

    9. Nested-loop? • Not desirable for large data sets • 5 hours for 30K strings!

    10. Our 2-step approach • Step 1: map strings (in a metric space) to objects in a Euclidean space • Step 2: do a similarity join in the Euclidean space

    11. Advantages • Applicable to many metric similarity functions • Use edit distance as an example • Other similarity functions also tried, e.g., q-gram-based similarity • Open to existing algorithms • Mapping techniques • Join techniques

    12. Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

    13. Example: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

    14. Mapping: StringMap • Input: A list of strings • Output: Points in a high-dimensional Euclidean space that preserve the original distances well • A variation of FastMap • Each step greedily picks two strings (pivots) to form an axis • All axes are orthogonal

    15. Can it preserve distances? • Data Sources: • IMDB star names: 54,000 • German names: 132,000 • Distribution of string lengths:

    16. Can it preserve distances? • Use data set 1 (54K names) as an example • k=2, d=20 • Use k’=5.2 to differentiate similar and dissimilar pairs.

    17. Choose Dimensionality d Increase d? • Good : • better to differentiate similar pairs from dissimilar ones. • Bad : • Step 1: Efficiency ↓ • Step 2: “curse of dimensionality”

    18. # of pairs within distance w Cost= # of similar pairs Choose dimensionality d using sampling • Sample 1Kx1K strings, find their similar pairs (within distance k) • Calculate maximum of their new distances w • Define “Cost” of finding a similar pair:

    19. Choose Dimensionality d d=15 ~ 25

    20. Choose new threshold k’ • Closely related to the mapping property • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. • Choose k’ using sampling • Sample 1Kx1K strings, find similar pairs • Calculate their maximum new distance as k’ • repeat multiple times, choose their maximum

    21. New threshold k’ in step 2 d=20

    22. Step 2: Similarity Join • Input: Two sets of points in Euclidean space. • Output: Pairs of two points whose distance is less than new threshold k’. • Many join algorithms can be used

    23. Example • Adopted an algorithm by Hjaltason and Samet. • Building two R-Trees. • Traverse two trees, find points whose distance is within k’. • Pruning during traversal (e.g., using MinDist).

    24. Final processing • Among the pairs produced from the similarity-join step, check their edit distance. • Return those pairs satisfying the threshold k

    25. Running time

    26. Recall • Recall: (#of found similar pairs)/(#of all similar pairs)

    27. Outline • Motivation of record linkage • Single-attribute case: two-step approach • Multi-attribute linkage • Conclusion and related work

    28. Multi-attribute linkage • Example: title + name + year • Different attributes have different similarity functions and thresholds • Consider merge rules in disjunctive format:

    29. Evaluation strategies • Many ways to evaluate rules • Finding an optimal one: NP-hard • Heuristics: • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

    30. Summary • A novel two-step approach to record linkage. • Many existing mapping and join algorithms can be adopted • Applicable to many distance metrics. • Time and space efficient. • Multi-attribute case studied

    31. Related work • Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] • Efficient merge and purge: [Hernandez and Stolfo, 1995] • String edit-distance join using DBMS: [Gravano et al, 2001]

    32. The Flamingo Project on Data Cleansing