Loading in 5 sec....

Efficient Record Linkage in Large Data SetsPowerPoint Presentation

Efficient Record Linkage in Large Data Sets

- 262 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Efficient Record Linkage in Large Data Sets' - ivanbritt

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Efficient Record Linkage in Large Data Sets

Liang Jin, Chen Li, Sharad Mehrotra

University of California, Irvine

DASFAA, Kyoto, Japan, March 2003

Motivation

- Correlate data from different data sources (e.g., data integration)
- Data is often dirty
- Needs to be cleansed before being used

- Example:
- A hospital needs to merge patient records from different data sources
- They have different formats, typos, and abbreviations

Another Example

- P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981)
- Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Record linkage

Problem statement:

“Given two relations, identify the potentially matched records

- Efficiently and
- Effectively”

Challenges

- How to define good similarity functions?
- Many functions proposed (edit distance, cosine similarity, …)
- Domain knowledge is critical
- Names: “Wall Street Journal” and “LA Times”
- Address: “Main Street” versus “Main St”

- How to do matching efficiently
- Offline join version
- Online (interactive) search
- Nearest search
- Range search

Outline

- Motivation of record linkage
- Single-attribute case: two-step approach
- Multi-attribute linkage
- Conclusion and related work

Single-attribute Case

- Given
- two sets of strings, R and S
- a similarity function f between strings (metric space)
- Reflexive: f(s1,s2) = 0 iff s1=s2
- Symmetric: f(s1,s2) = d(s2, s1)
- Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3)

- a threshold k

- Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k.

R

S

Nested-loop?

- Not desirable for large data sets
- 5 hours for 30K strings!

Our 2-step approach

- Step 1: map strings (in a metric space) to objects in a Euclidean space
- Step 2: do a similarity join in the Euclidean space

Advantages

- Applicable to many metric similarity functions
- Use edit distance as an example
- Other similarity functions also tried, e.g., q-gram-based similarity

- Open to existing algorithms
- Mapping techniques
- Join techniques

Example: Edit Distance

- A widely used metric to define string similarity
- Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2
- Example:
s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Mapping: StringMap

- Input: A list of strings
- Output: Points in a high-dimensional Euclidean space that preserve the original distances well
- A variation of FastMap
- Each step greedily picks two strings (pivots) to form an axis
- All axes are orthogonal

Can it preserve distances?

- Data Sources:
- IMDB star names: 54,000
- German names: 132,000

- Distribution of string lengths:

Can it preserve distances?

- Use data set 1 (54K names) as an example
- k=2, d=20
- Use k’=5.2 to differentiate similar and dissimilar pairs.

Choose Dimensionality d

Increase d?

- Good :
- better to differentiate similar pairs from dissimilar ones.

- Bad :
- Step 1: Efficiency ↓
- Step 2: “curse of dimensionality”

Cost=

# of similar pairs

Choose dimensionality d using sampling- Sample 1Kx1K strings, find their similar pairs (within distance k)
- Calculate maximum of their new distances w
- Define “Cost” of finding a similar pair:

Choose Dimensionality d

d=15 ~ 25

Choose new threshold k’

- Closely related to the mapping property
- Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’.
- Choose k’ using sampling
- Sample 1Kx1K strings, find similar pairs
- Calculate their maximum new distance as k’
- repeat multiple times, choose their maximum

Step 2: Similarity Join

- Input: Two sets of points in Euclidean space.
- Output: Pairs of two points whose distance is less than new threshold k’.
- Many join algorithms can be used

Example

- Adopted an algorithm by Hjaltason and Samet.
- Building two R-Trees.
- Traverse two trees, find points whose distance is within k’.
- Pruning during traversal (e.g., using MinDist).

Final processing

- Among the pairs produced from the similarity-join step, check their edit distance.
- Return those pairs satisfying the threshold k

Recall

- Recall: (#of found similar pairs)/(#of all similar pairs)

Outline

- Motivation of record linkage
- Single-attribute case: two-step approach
- Multi-attribute linkage
- Conclusion and related work

Multi-attribute linkage

- Example: title + name + year
- Different attributes have different similarity functions and thresholds
- Consider merge rules in disjunctive format:

Evaluation strategies

- Many ways to evaluate rules
- Finding an optimal one: NP-hard
- Heuristics:
- Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct.
- Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Summary

- A novel two-step approach to record linkage.
- Many existing mapping and join algorithms can be adopted
- Applicable to many distance metrics.
- Time and space efficient.
- Multi-attribute case studied

Related work

- Learning similarity functions: [Sarawagi and Bhamidipaty, 2003]
- Efficient merge and purge: [Hernandez and Stolfo, 1995]
- String edit-distance join using DBMS: [Gravano et al, 2001]

Download Presentation

Connecting to Server..