Efficient record linkage in large data sets
Download
1 / 32

Efficient Record Linkage in Large Data Sets - PowerPoint PPT Presentation


  • 261 Views
  • Uploaded on

Efficient Record Linkage in Large Data Sets. Liang Jin, Chen Li , Sharad Mehrotra University of California, Irvine. DASFAA, Kyoto, Japan, March 2003. Motivation . Correlate data from different data sources (e.g., data integration) Data is often dirty

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Record Linkage in Large Data Sets' - ivanbritt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient record linkage in large data sets l.jpg

Efficient Record Linkage in Large Data Sets

Liang Jin, Chen Li, Sharad Mehrotra

University of California, Irvine

DASFAA, Kyoto, Japan, March 2003


Motivation l.jpg
Motivation

  • Correlate data from different data sources (e.g., data integration)

    • Data is often dirty

    • Needs to be cleansed before being used

  • Example:

    • A hospital needs to merge patient records from different data sources

    • They have different formats, typos, and abbreviations


Example l.jpg
Example

Table R

Table S

  • Find records from different datasets that could be the same entity


Another example l.jpg
Another Example

  • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981)

  • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981


Record linkage l.jpg
Record linkage

Problem statement:

“Given two relations, identify the potentially matched records

  • Efficiently and

  • Effectively”


Challenges l.jpg
Challenges

  • How to define good similarity functions?

    • Many functions proposed (edit distance, cosine similarity, …)

    • Domain knowledge is critical

      • Names: “Wall Street Journal” and “LA Times”

      • Address: “Main Street” versus “Main St”

  • How to do matching efficiently

    • Offline join version

    • Online (interactive) search

      • Nearest search

      • Range search


Outline l.jpg
Outline

  • Motivation of record linkage

  • Single-attribute case: two-step approach

  • Multi-attribute linkage

  • Conclusion and related work


Single attribute case l.jpg
Single-attribute Case

  • Given

    • two sets of strings, R and S

    • a similarity function f between strings (metric space)

      • Reflexive: f(s1,s2) = 0 iff s1=s2

      • Symmetric: f(s1,s2) = d(s2, s1)

      • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3)

    • a threshold k

  • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k.

R

S


Nested loop l.jpg
Nested-loop?

  • Not desirable for large data sets

  • 5 hours for 30K strings!


Our 2 step approach l.jpg
Our 2-step approach

  • Step 1: map strings (in a metric space) to objects in a Euclidean space

  • Step 2: do a similarity join in the Euclidean space


Advantages l.jpg
Advantages

  • Applicable to many metric similarity functions

    • Use edit distance as an example

    • Other similarity functions also tried, e.g., q-gram-based similarity

  • Open to existing algorithms

    • Mapping techniques

    • Join techniques


Step 1 l.jpg
Step 1

Map strings into a high-dimensional Euclidean space

Metric Space

Euclidean Space


Example edit distance l.jpg
Example: Edit Distance

  • A widely used metric to define string similarity

  • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2

  • Example:

    s1: Tom Hanks

    s2: Ton Hank

    ed(s1,s2) = 2


Mapping stringmap l.jpg
Mapping: StringMap

  • Input: A list of strings

  • Output: Points in a high-dimensional Euclidean space that preserve the original distances well

  • A variation of FastMap

    • Each step greedily picks two strings (pivots) to form an axis

    • All axes are orthogonal


Can it preserve distances l.jpg
Can it preserve distances?

  • Data Sources:

    • IMDB star names: 54,000

    • German names: 132,000

  • Distribution of string lengths:


Can it preserve distances16 l.jpg
Can it preserve distances?

  • Use data set 1 (54K names) as an example

  • k=2, d=20

    • Use k’=5.2 to differentiate similar and dissimilar pairs.


Choose dimensionality d l.jpg
Choose Dimensionality d

Increase d?

  • Good :

    • better to differentiate similar pairs from dissimilar ones.

  • Bad :

    • Step 1: Efficiency ↓

    • Step 2: “curse of dimensionality”


Choose dimensionality d using sampling l.jpg

# of pairs within distance w

Cost=

# of similar pairs

Choose dimensionality d using sampling

  • Sample 1Kx1K strings, find their similar pairs (within distance k)

  • Calculate maximum of their new distances w

  • Define “Cost” of finding a similar pair:



Choose new threshold k l.jpg
Choose new threshold k’

  • Closely related to the mapping property

  • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’.

  • Choose k’ using sampling

    • Sample 1Kx1K strings, find similar pairs

    • Calculate their maximum new distance as k’

    • repeat multiple times, choose their maximum



Step 2 similarity join l.jpg
Step 2: Similarity Join

  • Input: Two sets of points in Euclidean space.

  • Output: Pairs of two points whose distance is less than new threshold k’.

  • Many join algorithms can be used


Example23 l.jpg
Example

  • Adopted an algorithm by Hjaltason and Samet.

    • Building two R-Trees.

    • Traverse two trees, find points whose distance is within k’.

    • Pruning during traversal (e.g., using MinDist).


Final processing l.jpg
Final processing

  • Among the pairs produced from the similarity-join step, check their edit distance.

  • Return those pairs satisfying the threshold k



Recall l.jpg
Recall

  • Recall: (#of found similar pairs)/(#of all similar pairs)


Outline27 l.jpg
Outline

  • Motivation of record linkage

  • Single-attribute case: two-step approach

  • Multi-attribute linkage

  • Conclusion and related work


Multi attribute linkage l.jpg
Multi-attribute linkage

  • Example: title + name + year

  • Different attributes have different similarity functions and thresholds

  • Consider merge rules in disjunctive format:


Evaluation strategies l.jpg
Evaluation strategies

  • Many ways to evaluate rules

  • Finding an optimal one: NP-hard

  • Heuristics:

    • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct.

    • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.


Summary l.jpg
Summary

  • A novel two-step approach to record linkage.

  • Many existing mapping and join algorithms can be adopted

  • Applicable to many distance metrics.

  • Time and space efficient.

  • Multi-attribute case studied


Related work l.jpg
Related work

  • Learning similarity functions: [Sarawagi and Bhamidipaty, 2003]

  • Efficient merge and purge: [Hernandez and Stolfo, 1995]

  • String edit-distance join using DBMS: [Gravano et al, 2001]



ad