Searching and integrating information on the web
This presentation is the property of its rightful owner.
Sponsored Links
1 / 73

Searching and Integrating Information on the Web PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Searching and Integrating Information on the Web. Seminar 3: Data Cleansing Professor Chen Li UC Irvine. Paper readings. Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995 Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,

Download Presentation

Searching and Integrating Information on the Web

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Searching and integrating information on the web

Searching and Integrating Information on the Web

Seminar 3: Data Cleansing

Professor Chen Li

UC Irvine


Paper readings

Paper readings

  • Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995

  • Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,

  • Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003

  • Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003

Seminar 3


Motivation

Motivation

  • Correlate data from different data sources (e.g., data integration)

    • Data is often dirty

    • Needs to be cleansed before being used

  • Example:

    • A hospital needs to merge patient records from different data sources

    • They have different formats, typos, and abbreviations

Seminar 3


Example

Example

Table R

Table S

  • Find records from different datasets that could be the same entity

Seminar 3


Another example

Another Example

  • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981)

  • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Seminar 3


Record linkage

Record linkage

Problem statement:

“Given two relations, identify the potentially matched records

  • Efficiently and

  • Effectively”

Seminar 3


Challenges

Challenges

  • How to define good similarity functions?

    • Many functions proposed (edit distance, cosine similarity, …)

    • Domain knowledge is critical

      • Names: “Wall Street Journal” and “LA Times”

      • Address: “Main Street” versus “Main St”

  • How to do matching efficiently

    • Offline join version

    • Online (interactive) search

      • Nearest search

      • Range search

Seminar 3


Outline

Outline

  • Supporting string-similarity joins using RDBMS

  • Using mapping techniques

  • Interactive deduplication

Seminar 3


Edit distance

Edit Distance

  • A widely used metric to define string similarity

  • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2

  • Example:

    s1: Tom Hanks

    s2: Ton Hank

    ed(s1,s2) = 2

Seminar 3


Approximate string joins

Approximate String Joins

  • We want to join tuples with “similar” string fields

  • Similarity measure: Edit Distance

  • Each Insertion, Deletion, Replacement increases distance by one

Seminar 3


Focus approximate string joins over relational dbmss

Focus: Approximate String Joins over Relational DBMSs

  • Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K

  • Solve the problem in a database-friendly way

    (if possible with an existing "vanilla" RDBMS)

Seminar 3


Current approaches for processing approximate string joins

Current Approaches for Processing Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions:

  • Join data outside of DBMS

  • Join data via user-defined functions (UDFs) inside the DBMS

Seminar 3


Approximate string joins outside of a dbms

Approximate String Joins outside of a DBMS

  • Export data

  • Join outside of DBMS

  • Import the result

    Main advantage:

    We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality

    Disadvantages:

  • Substantial amounts of data to be exported/imported

  • Cannot be easily integrated with further processing steps in the DBMS

Seminar 3


Approximate string joins with udfs

Approximate String Joins with UDFs

  • Write a UDF to check if two strings match within distance K

  • Write an SQL statement that applies the UDF to the string pairs

    SELECT R.stringAttr, S.stringAttr

    FROM R, S

    WHERE edit_distance(R.stringAttr, S.stringAttr, K)

    Main advantage:

    Ease of implementation

    Main disadvantage:

    UDF applied to entire cross-product of relations

Seminar 3


Our approach approximate string joins over an unmodified rdbms

Our Approach: Approximate String Joins over an Unmodified RDBMS

  • Preprocess data and generate auxiliary tables

  • Perform join exploiting standard RDBMS capabilities

    Advantages

  • No modification of underlying RDBMS needed.

  • Can leverage the RDBMS query optimizer.

  • Much more efficient than the approach based on naive UDFs

Seminar 3


Intuition and roadmap

Intuition and Roadmap

  • Intuition:

    • Similar strings have many common substrings

    • Use exact joins to perform approximate joins (current

      DBMSs are good for exact joins)

    • A good candidate set can be verified for false positives

      [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]

  • Roadmap:

    • Break strings into substrings of length q (q-grams)

    • Perform an exact join on the q-grams

    • Find candidate string pairs based on the results

    • Check only candidate pairs with a UDF to obtain final answer

Seminar 3


What is a q gram

What is a “Q-gram”?

  • Q-gram: A sequence of q characters of the original string

    Example for q=3

    vacations

    {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

    String with length L → L + q - 1 q-grams

  • Similar stringshave a many common q-grams

Seminar 3


Q grams and edit distance operations

Q-grams and Edit Distance Operations

  • With no edits: L + q - 1 common q-grams

  • Replacement: (L + q – 1) - q common q-grams

    Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

    Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

  • Insertion: (Lmax + q – 1) - q common q-grams

    Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

    Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}

  • Deletion: (Lmax + q – 1) - q common q-grams

    Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

    Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Seminar 3


Number of common q grams and edit distance

Number of Common Q-grams and Edit Distance

  • For Edit Distance = K, there could be at most K replacements, insertions, deletions

  • Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common

  • Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Seminar 3


Using a dbms for q gram joins

Using a DBMS for Q-gram Joins

  • If we have the q-grams in the DBMS, we can perform this counting efficiently.

  • Create auxiliary tables with tuples of the form:

    <sid, strlen, qgram>

    and join these tables

  • A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Seminar 3


Eliminating candidate pairs count filtering

Eliminating Candidate Pairs: COUNT FILTERING

SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q

The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Seminar 3


Eliminating candidate pairs further length filtering

Eliminating Candidate Pairs Further: LENGTH FILTERING

Strings with length difference larger than K cannot be within Edit Distance K

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as LENGTH FILTERING

Seminar 3


Exploiting q gram positions for filtering

Exploiting Q-gram Positions for Filtering

  • Consider strings aabbzzaacczz and aacczzaabbzz

  • Strings are at edit distance 4

  • Strings have identical q-grams for q=3

    Problem: Matching q-grams that are at different positions in both strings

    • Either q-grams do not "originate" from same q-gram, or

    • Too many edit operations "caused" spurious q-grams at various parts of strings to match

Seminar 3


Position filtering filtering using positions

POSITION FILTERING - Filtering using positions

  • Keep the position of the q-grams <sid, strlen, pos, qgram>

  • Do not match q-grams that are more than K positions away

    SELECT R.sid, S.sid

    FROM R, S

    WHERE R.qgram=S.qgram

    AND abs(R.strlen - S.strlen)<=K

    AND abs(R.pos - S.pos)<=K

    GROUP BY R.sid, S.sid

    HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

    We refer to this filter as POSITION FILTERING

Seminar 3


The actual complete sql statement

The Actual, Complete SQL Statement

SELECT R1.string, S1.string, R1.sid, S1.sid

FROM R1, S1, R, S,

WHERE R1.sid=R.sid

AND S1.sid=S.sid

AND R.qgram=S.qgram

AND abs(strlen(R1.string)–strlen(S1.string))<=K

AND abs(R.pos - S.pos)<=K

GROUP BY R1.sid, S1.sid, R1.string, S1.string

HAVING COUNT(*) >=

(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

Seminar 3


Summary of 1 st paper

Summary of 1st paper

  • Introduced a technique for mapping approximate string joins into a “vanilla” SQL expression

  • Our technique does not require modifying the underlying RDBMS

Seminar 3


Outline1

Outline

  • Supporting string-similarity joins using RDBMS

  • Using mapping techniques

  • Interactive deduplication

Seminar 3


Single attribute case

Single-attribute Case

  • Given

    • two sets of strings, R and S

    • a similarity function f between strings (metric space)

      • Reflexive: f(s1,s2) = 0 iff s1=s2

      • Symmetric: f(s1,s2) = d(s2, s1)

      • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3)

    • a threshold k

  • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k.

R

S

Seminar 3


Nested loop

Nested-loop?

  • Not desirable for large data sets

  • 5 hours for 30K strings!

Seminar 3


Our 2 step approach

Our 2-step approach

  • Step 1: map strings (in a metric space) to objects in a Euclidean space

  • Step 2: do a similarity join in the Euclidean space

Seminar 3


Advantages

Advantages

  • Applicable to many metric similarity functions

    • Use edit distance as an example

    • Other similarity functions also tried, e.g., q-gram-based similarity

  • Open to existing algorithms

    • Mapping techniques

    • Join techniques

Seminar 3


Step 1

Step 1

Map strings into a high-dimensional Euclidean space

Metric Space

Euclidean Space

Seminar 3


Mapping stringmap

Mapping: StringMap

  • Input: A list of strings

  • Output: Points in a high-dimensional Euclidean space that preserve the original distances well

  • A variation of FastMap

    • Each step greedily picks two strings (pivots) to form an axis

    • All axes are orthogonal

Seminar 3


Can it preserve distances

Can it preserve distances?

  • Data Sources:

    • IMDB star names: 54,000

    • German names: 132,000

  • Distribution of string lengths:

Seminar 3


Can it preserve distances1

Can it preserve distances?

  • Use data set 1 (54K names) as an example

  • k=2, d=20

    • Use k’=5.2 to differentiate similar and dissimilar pairs.

Seminar 3


Choose dimensionality d

Choose Dimensionality d

Increase d?

  • Good :

    • better to differentiate similar pairs from dissimilar ones.

  • Bad :

    • Step 1: Efficiency ↓

    • Step 2: “curse of dimensionality”

Seminar 3


Choose dimensionality d using sampling

# of pairs within distance w

Cost=

# of similar pairs

Choose dimensionality d using sampling

  • Sample 1Kx1K strings, find their similar pairs (within distance k)

  • Calculate maximum of their new distances w

  • Define “Cost” of finding a similar pair:

Seminar 3


Choose dimensionality d1

Choose Dimensionality d

d=15 ~ 25

Seminar 3


Choose new threshold k

Choose new threshold k’

  • Closely related to the mapping property

  • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’.

  • Choose k’ using sampling

    • Sample 1Kx1K strings, find similar pairs

    • Calculate their maximum new distance as k’

    • repeat multiple times, choose their maximum

Seminar 3


New threshold k in step 2

New threshold k’ in step 2

d=20

Seminar 3


Step 2 similarity join

Step 2: Similarity Join

  • Input: Two sets of points in Euclidean space.

  • Output: Pairs of two points whose distance is less than new threshold k’.

  • Many join algorithms can be used

Seminar 3


Example1

Example

  • Adopted an algorithm by Hjaltason and Samet.

    • Building two R-Trees.

    • Traverse two trees, find points whose distance is within k’.

    • Pruning during traversal (e.g., using MinDist).

Seminar 3


Final processing

Final processing

  • Among the pairs produced from the similarity-join step, check their edit distance.

  • Return those pairs satisfying the threshold k

Seminar 3


Running time

Running time

Seminar 3


Recall

Recall

  • Recall: (#of found similar pairs)/(#of all similar pairs)

Seminar 3


Multi attribute linkage

Multi-attribute linkage

  • Example: title + name + year

  • Different attributes have different similarity functions and thresholds

  • Consider merge rules in disjunctive format:

Seminar 3


Evaluation strategies

Evaluation strategies

  • Many ways to evaluate rules

  • Finding an optimal one: NP-hard

  • Heuristics:

    • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct.

    • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Seminar 3


Summary of 2 nd paper

Summary of 2nd paper

  • A novel two-step approach to record linkage.

  • Many existing mapping and join algorithms can be adopted

  • Applicable to many distance metrics.

  • Time and space efficient.

  • Multi-attribute case studied

Seminar 3


Outline2

Outline

  • Supporting string-similarity joins using RDBMS

  • Using mapping techniques

  • Interactive deduplication

Seminar 3


Problems with existing deduplication methods

Matching Functions

Calculate similarity scores, thresholds

Tedious coding

Learning-based Methods

Require large-sized training set for accuracy

(static training set)

Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function

Problems with Existing Deduplication Methods

Seminar 3


New approach

New Approach

  • Relegate the task of finding the deduplication function to a machine learning algorithm

  • Design goal:

    • Less training instances

    • Interactive response

    • Fast convergence

    • High accuracy

       Design an interactive deduplication system called ALIAS

ALIAS: Active Learning led Interactive Alias SuppressionSeminar 3


Alias system

ALIAS SYSTEM

  • A learned-based method

  • Exploit existing similarity functions

  • Use Active Learning

    - an active learner actively picks unlabeled instances with the most information gain in the training set

  • Produce a deduplication function that can identify duplicates

Seminar 3


Overall architecture

Overall Architecture

L

Feedback

From user

Lp

T

Training data

Mapper

F

Similarity

Functions

Train

classifiers

Dp

Mapper

Select n

instances

for labeling

S

Learner

D

Predicate for uncertain region

Similarity

Indices

Seminar 3


Primary inputs for alias

Primary Inputs for ALIAS

L

  • A set of initial training pairs (L)

    • less than 10 labeled records

    • arranged in pairs of duplicates and non-duplicates

  • A set of similarity functions (F)

    • Ex: word-match, qgram-match…

    • To compute similarities scores between 2 records based on any subset of attributes.

    • Learner will find the right way of combining those scores to get the final deduplication function

  • A database of unlabeled records(D)

  • Number of classifiers (<5)

F

Similarity

Functions

D

Seminar 3


Mapped labeled instances

Mapped Labeled Instances

L

Lp

  • Take r1, r2 from input L

  • Record r1(a1, a2, a3)

  • Record r2(a1, a2, a3)

  • Use similarity functions f1, f2…fn to compute similarity scores s between r1 and r2

  • New record

    • r1&r2 (s1, s2…sn, y/n)

    • y = duplicate;

    • n = non-duplicate

  • Put new record in Lp

Mapper

Similarity

Functions

F

D

Seminar 3


Mapped unlabeled instances

Mapped Unlabeled Instances

L

Lp

Take r1, r2 from D x D

  • Record r1(a1, a2, a3)

  • Record r2(a1, a2, a3)

  • Use similarity functions f1, f2…fn to compute similarity scores between r1 and r2

  • New record

    • r1&r2 (s1, s2…sn)

    • No y/n field

  • Put new record in Dp

Mapper

Similarity

Functions

F

Mapper

Dp

D

Seminar 3


Active learner

Active Learner

L

Feedback

From user

Lp

T

Training data

Mapper

Similarity

Functions

F

Train

Classifiers

Dp

Mapper

Select set

S of n

instances

for labeling

s

Learner

D

Seminar 3


Alias algorithm

ALIAS Algorithm

  • 1. Input: L, D, F.

  • 2. Create pairs Lp from the labeled data L and F.

  • 3. Create pairs Dp from the unlabeled data D and F.

  • 4. Initial training set T = Lp

  • 5. Loop until user satisfaction

    • Train classier C using T.

    • Use C to select a set S of n instances from Dp for labeling.

    • If S is empty, exit loop.

    • Collect user feedback on the labels of S.

    • Add S to T and remove S from Dp.

  • 6. Output classifier C

Seminar 3


The indexing component

The Indexing Component

  • Purpose:

    • Avoid mapping all pairs of records in D x D

  • 3 Methods:

    • Grouping

    • Sampling

    • Indexing

Seminar 3


The indexing component1

The Indexing Component

  • Grouping

    • Example: group records in D according to the field year of publication

    • Mapped pairs are formed only within records of a group.

  • Sampling

    • Sample in units of a group instead of individual records.

Seminar 3


The indexing component2

The Indexing Component

  • Indexing

    • A similarity function:

      “fraction of common words between two text attributes >=0.4”

    • we can create an index on the words of the text attributes

Seminar 3


The learning component

Contain a number of classifiers

A classifier: a machine learning algorithm such as decision tree (D-tree), naïve Bayes (NB), Support Vector Machine (SVM)…to classify instances

A classifier is trained using a training data set

The learning component

Seminar 3


Criteria for a classifier

Criteria for a classifier

  • Accuracy

  • Interpretability

  • Indexability

  • Efficient training

Seminar 3


Accuracy of a classifier

Measured by the mean F of recall r and precision p

r = fraction of duplicates correctly classified

p = fraction correct amongst all instances actually libeled duplicate

Accuracy of a Classifier

Seminar 3


Accuracy cont

Accuracy (cont.)

  • Example

    • A case with 1% duplicates; a classifier labels all pairs as non-duplicates

      recall r = 0  mean F = 0  accuracy = 0%

    • A case with 1% duplicates; a classifier identifies all duplicates correctly but misclassifies 1% non-duplicates

      recall r = 1, p = 0.5  F = 0.667  accuracy = 66.7%

       If we don’t use r and p, then they both have 99% accuracy!

Seminar 3


Criteria for a classifier cont

Criteria for a classifier (cont.)

  • Interpretability

    • Final deduplication rule is easy to understand and interpret

  • Indexability

    • Final deduplication rule has indexable predicates

  • Efficient training

    • Fast to train

Seminar 3


Comparison of different classifiers

Comparison of Different Classifiers

Seminar 3


Active learning

Active Learning

The goal is to seek out from the unlabeled pool the instances which when labeled will help strengthen the classifier at the fastest possible rate.

Seminar 3


A simple example

A simple Example

Most uncertain instance

  • Assume all are unlabeled except a and b

  • Suppose r-coordinate = 0, b-coordinate = 1

  • Any unlabeled point x to the left of r and to the right of b will have no effect in reducing the region of uncertainty

  • By including m in the training set, the size of the uncertain region will reduce by half

Seminar 3


How to select an unlabeled instance

How to select an unlabeled instance

  • Uncertainty

    • The instance about which the learner was most unsure was also the instance for which the expected reduction in confusion was the largest

    • Uncertainty score

      • the disagreement among the predictions the instance get from a committee of N classifiers

      • A sure duplicate or non-duplicate would get the same prediction from all members

  • Representativeness

    • An uncertain instance representing a larger number of unlabeled instances has greater impact to the classifier

Seminar 3


Example2

Example

  • 3 similarity functions: word match f1, qgram match f2, string edit distance f3.

  • Take from Mapped Unlabeled Instances (Dp)

    • r1&r2 (s1, s2…sn)

    • r3&r4 (s1, s2…sn)

    • s1, s2…sn: scores using functions f1, f2, f3

  • 3 classifiers: D-tree, Naïve Bayes, SVM

selected

Seminar 3


How to combine uncertainty and representativeness

How to Combine Uncertainty and Representativeness

  • Proposed two approaches

  • 1st approach: weighted sum

    • Cluster the unlabeled instances

    • Estimate the density of points around it

    • The instances are scored using a weighted sum of its density and uncertainty value

    • n highest scoring instances selected

  • 2nd approach: sampling

Seminar 3


Conclusion

Conclusion

ALIAS:

  • Makes deduplication much easier (less training instances)

  • Provide interaction response to the user

  • High accuracy

Seminar 3


  • Login