Searching and integrating information on the web
Download
1 / 73

Searching and Integrating Information on the Web - PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on

Searching and Integrating Information on the Web. Seminar 3: Data Cleansing Professor Chen Li UC Irvine. Paper readings. Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995 Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Searching and Integrating Information on the Web' - donnelly-davis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Searching and integrating information on the web

Searching and Integrating Information on the Web

Seminar 3: Data Cleansing

Professor Chen Li

UC Irvine


Paper readings
Paper readings

  • Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995

  • Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,

  • Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003

  • Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003

Seminar 3


Motivation
Motivation

  • Correlate data from different data sources (e.g., data integration)

    • Data is often dirty

    • Needs to be cleansed before being used

  • Example:

    • A hospital needs to merge patient records from different data sources

    • They have different formats, typos, and abbreviations

Seminar 3


Example
Example

Table R

Table S

  • Find records from different datasets that could be the same entity

Seminar 3


Another example
Another Example

  • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981)

  • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Seminar 3


Record linkage
Record linkage

Problem statement:

“Given two relations, identify the potentially matched records

  • Efficiently and

  • Effectively”

Seminar 3


Challenges
Challenges

  • How to define good similarity functions?

    • Many functions proposed (edit distance, cosine similarity, …)

    • Domain knowledge is critical

      • Names: “Wall Street Journal” and “LA Times”

      • Address: “Main Street” versus “Main St”

  • How to do matching efficiently

    • Offline join version

    • Online (interactive) search

      • Nearest search

      • Range search

Seminar 3


Outline
Outline

  • Supporting string-similarity joins using RDBMS

  • Using mapping techniques

  • Interactive deduplication

Seminar 3


Edit distance
Edit Distance

  • A widely used metric to define string similarity

  • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2

  • Example:

    s1: Tom Hanks

    s2: Ton Hank

    ed(s1,s2) = 2

Seminar 3


Approximate string joins
Approximate String Joins

  • We want to join tuples with “similar” string fields

  • Similarity measure: Edit Distance

  • Each Insertion, Deletion, Replacement increases distance by one

Seminar 3


Focus approximate string joins over relational dbmss
Focus: Approximate String Joins over Relational DBMSs

  • Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K

  • Solve the problem in a database-friendly way

    (if possible with an existing "vanilla" RDBMS)

Seminar 3


Current approaches for processing approximate string joins
Current Approaches for Processing Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions:

  • Join data outside of DBMS

  • Join data via user-defined functions (UDFs) inside the DBMS

Seminar 3


Approximate string joins outside of a dbms
Approximate String Joins outside of a DBMS

  • Export data

  • Join outside of DBMS

  • Import the result

    Main advantage:

    We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality

    Disadvantages:

  • Substantial amounts of data to be exported/imported

  • Cannot be easily integrated with further processing steps in the DBMS

Seminar 3


Approximate string joins with udfs
Approximate String Joins with UDFs

  • Write a UDF to check if two strings match within distance K

  • Write an SQL statement that applies the UDF to the string pairs

    SELECT R.stringAttr, S.stringAttr

    FROM R, S

    WHERE edit_distance(R.stringAttr, S.stringAttr, K)

    Main advantage:

    Ease of implementation

    Main disadvantage:

    UDF applied to entire cross-product of relations

Seminar 3


Our approach approximate string joins over an unmodified rdbms
Our Approach: Approximate String Joins over an Unmodified RDBMS

  • Preprocess data and generate auxiliary tables

  • Perform join exploiting standard RDBMS capabilities

    Advantages

  • No modification of underlying RDBMS needed.

  • Can leverage the RDBMS query optimizer.

  • Much more efficient than the approach based on naive UDFs

Seminar 3


Intuition and roadmap
Intuition and Roadmap RDBMS

  • Intuition:

    • Similar strings have many common substrings

    • Use exact joins to perform approximate joins (current

      DBMSs are good for exact joins)

    • A good candidate set can be verified for false positives

      [Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]

  • Roadmap:

    • Break strings into substrings of length q (q-grams)

    • Perform an exact join on the q-grams

    • Find candidate string pairs based on the results

    • Check only candidate pairs with a UDF to obtain final answer

Seminar 3


What is a q gram
What is a “Q-gram”? RDBMS

  • Q-gram: A sequence of q characters of the original string

    Example for q=3

    vacations

    {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

    String with length L → L + q - 1 q-grams

  • Similar stringshave a many common q-grams

Seminar 3


Q grams and edit distance operations
Q-grams and Edit Distance Operations RDBMS

  • With no edits: L + q - 1 common q-grams

  • Replacement: (L + q – 1) - q common q-grams

    Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

    Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

  • Insertion: (Lmax + q – 1) - q common q-grams

    Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

    Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}

  • Deletion: (Lmax + q – 1) - q common q-grams

    Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

    Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Seminar 3


Number of common q grams and edit distance
Number of Common Q-grams and Edit Distance RDBMS

  • For Edit Distance = K, there could be at most K replacements, insertions, deletions

  • Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common

  • Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Seminar 3


Using a dbms for q gram joins
Using a DBMS for Q-gram Joins RDBMS

  • If we have the q-grams in the DBMS, we can perform this counting efficiently.

  • Create auxiliary tables with tuples of the form:

    <sid, strlen, qgram>

    and join these tables

  • A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Seminar 3


Eliminating candidate pairs count filtering
Eliminating Candidate Pairs: COUNT FILTERING RDBMS

SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q

The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Seminar 3


Eliminating candidate pairs further length filtering
Eliminating Candidate Pairs Further: LENGTH FILTERING RDBMS

Strings with length difference larger than K cannot be within Edit Distance K

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as LENGTH FILTERING

Seminar 3


Exploiting q gram positions for filtering
Exploiting Q-gram Positions for Filtering RDBMS

  • Consider strings aabbzzaacczz and aacczzaabbzz

  • Strings are at edit distance 4

  • Strings have identical q-grams for q=3

    Problem: Matching q-grams that are at different positions in both strings

    • Either q-grams do not "originate" from same q-gram, or

    • Too many edit operations "caused" spurious q-grams at various parts of strings to match

Seminar 3


Position filtering filtering using positions
POSITION FILTERING - Filtering using positions RDBMS

  • Keep the position of the q-grams <sid, strlen, pos, qgram>

  • Do not match q-grams that are more than K positions away

    SELECT R.sid, S.sid

    FROM R, S

    WHERE R.qgram=S.qgram

    AND abs(R.strlen - S.strlen)<=K

    AND abs(R.pos - S.pos)<=K

    GROUP BY R.sid, S.sid

    HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

    We refer to this filter as POSITION FILTERING

Seminar 3


The actual complete sql statement
The Actual, Complete SQL Statement RDBMS

SELECT R1.string, S1.string, R1.sid, S1.sid

FROM R1, S1, R, S,

WHERE R1.sid=R.sid

AND S1.sid=S.sid

AND R.qgram=S.qgram

AND abs(strlen(R1.string)–strlen(S1.string))<=K

AND abs(R.pos - S.pos)<=K

GROUP BY R1.sid, S1.sid, R1.string, S1.string

HAVING COUNT(*) >=

(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

Seminar 3


Summary of 1 st paper
Summary of 1 RDBMSst paper

  • Introduced a technique for mapping approximate string joins into a “vanilla” SQL expression

  • Our technique does not require modifying the underlying RDBMS

Seminar 3


Outline1
Outline RDBMS

  • Supporting string-similarity joins using RDBMS

  • Using mapping techniques

  • Interactive deduplication

Seminar 3


Single attribute case
Single-attribute Case RDBMS

  • Given

    • two sets of strings, R and S

    • a similarity function f between strings (metric space)

      • Reflexive: f(s1,s2) = 0 iff s1=s2

      • Symmetric: f(s1,s2) = d(s2, s1)

      • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3)

    • a threshold k

  • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k.

R

S

Seminar 3


Nested loop
Nested-loop? RDBMS

  • Not desirable for large data sets

  • 5 hours for 30K strings!

Seminar 3


Our 2 step approach
Our 2-step approach RDBMS

  • Step 1: map strings (in a metric space) to objects in a Euclidean space

  • Step 2: do a similarity join in the Euclidean space

Seminar 3


Advantages
Advantages RDBMS

  • Applicable to many metric similarity functions

    • Use edit distance as an example

    • Other similarity functions also tried, e.g., q-gram-based similarity

  • Open to existing algorithms

    • Mapping techniques

    • Join techniques

Seminar 3


Step 1
Step 1 RDBMS

Map strings into a high-dimensional Euclidean space

Metric Space

Euclidean Space

Seminar 3


Mapping stringmap
Mapping: StringMap RDBMS

  • Input: A list of strings

  • Output: Points in a high-dimensional Euclidean space that preserve the original distances well

  • A variation of FastMap

    • Each step greedily picks two strings (pivots) to form an axis

    • All axes are orthogonal

Seminar 3


Can it preserve distances
Can it preserve distances? RDBMS

  • Data Sources:

    • IMDB star names: 54,000

    • German names: 132,000

  • Distribution of string lengths:

Seminar 3


Can it preserve distances1
Can it preserve distances? RDBMS

  • Use data set 1 (54K names) as an example

  • k=2, d=20

    • Use k’=5.2 to differentiate similar and dissimilar pairs.

Seminar 3


Choose dimensionality d
Choose Dimensionality RDBMSd

Increase d?

  • Good :

    • better to differentiate similar pairs from dissimilar ones.

  • Bad :

    • Step 1: Efficiency ↓

    • Step 2: “curse of dimensionality”

Seminar 3


Choose dimensionality d using sampling

# of pairs within distance w RDBMS

Cost=

# of similar pairs

Choose dimensionality d using sampling

  • Sample 1Kx1K strings, find their similar pairs (within distance k)

  • Calculate maximum of their new distances w

  • Define “Cost” of finding a similar pair:

Seminar 3


Choose dimensionality d1
Choose Dimensionality RDBMSd

d=15 ~ 25

Seminar 3


Choose new threshold k
Choose new threshold k’ RDBMS

  • Closely related to the mapping property

  • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’.

  • Choose k’ using sampling

    • Sample 1Kx1K strings, find similar pairs

    • Calculate their maximum new distance as k’

    • repeat multiple times, choose their maximum

Seminar 3


New threshold k in step 2
New threshold k’ in step 2 RDBMS

d=20

Seminar 3


Step 2 similarity join
Step 2: Similarity Join RDBMS

  • Input: Two sets of points in Euclidean space.

  • Output: Pairs of two points whose distance is less than new threshold k’.

  • Many join algorithms can be used

Seminar 3


Example1
Example RDBMS

  • Adopted an algorithm by Hjaltason and Samet.

    • Building two R-Trees.

    • Traverse two trees, find points whose distance is within k’.

    • Pruning during traversal (e.g., using MinDist).

Seminar 3


Final processing
Final processing RDBMS

  • Among the pairs produced from the similarity-join step, check their edit distance.

  • Return those pairs satisfying the threshold k

Seminar 3


Running time
Running time RDBMS

Seminar 3


Recall
Recall RDBMS

  • Recall: (#of found similar pairs)/(#of all similar pairs)

Seminar 3


Multi attribute linkage
Multi-attribute linkage RDBMS

  • Example: title + name + year

  • Different attributes have different similarity functions and thresholds

  • Consider merge rules in disjunctive format:

Seminar 3


Evaluation strategies
Evaluation strategies RDBMS

  • Many ways to evaluate rules

  • Finding an optimal one: NP-hard

  • Heuristics:

    • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct.

    • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Seminar 3


Summary of 2 nd paper
Summary of 2 RDBMSnd paper

  • A novel two-step approach to record linkage.

  • Many existing mapping and join algorithms can be adopted

  • Applicable to many distance metrics.

  • Time and space efficient.

  • Multi-attribute case studied

Seminar 3


Outline2
Outline RDBMS

  • Supporting string-similarity joins using RDBMS

  • Using mapping techniques

  • Interactive deduplication

Seminar 3


Problems with existing deduplication methods

Matching Functions RDBMS

Calculate similarity scores, thresholds

Tedious coding

Learning-based Methods

Require large-sized training set for accuracy

(static training set)

Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function

Problems with Existing Deduplication Methods

Seminar 3


New approach
New Approach RDBMS

  • Relegate the task of finding the deduplication function to a machine learning algorithm

  • Design goal:

    • Less training instances

    • Interactive response

    • Fast convergence

    • High accuracy

       Design an interactive deduplication system called ALIAS

ALIAS: Active Learning led Interactive Alias SuppressionSeminar 3


Alias system
ALIAS SYSTEM RDBMS

  • A learned-based method

  • Exploit existing similarity functions

  • Use Active Learning

    - an active learner actively picks unlabeled instances with the most information gain in the training set

  • Produce a deduplication function that can identify duplicates

Seminar 3


Overall architecture
Overall Architecture RDBMS

L

Feedback

From user

Lp

T

Training data

Mapper

F

Similarity

Functions

Train

classifiers

Dp

Mapper

Select n

instances

for labeling

S

Learner

D

Predicate for uncertain region

Similarity

Indices

Seminar 3


Primary inputs for alias
Primary Inputs for ALIAS RDBMS

L

  • A set of initial training pairs (L)

    • less than 10 labeled records

    • arranged in pairs of duplicates and non-duplicates

  • A set of similarity functions (F)

    • Ex: word-match, qgram-match…

    • To compute similarities scores between 2 records based on any subset of attributes.

    • Learner will find the right way of combining those scores to get the final deduplication function

  • A database of unlabeled records(D)

  • Number of classifiers (<5)

F

Similarity

Functions

D

Seminar 3


Mapped labeled instances
Mapped Labeled Instances RDBMS

L

Lp

  • Take r1, r2 from input L

  • Record r1(a1, a2, a3)

  • Record r2(a1, a2, a3)

  • Use similarity functions f1, f2…fn to compute similarity scores s between r1 and r2

  • New record

    • r1&r2 (s1, s2…sn, y/n)

    • y = duplicate;

    • n = non-duplicate

  • Put new record in Lp

Mapper

Similarity

Functions

F

D

Seminar 3


Mapped unlabeled instances
Mapped Unlabeled Instances RDBMS

L

Lp

Take r1, r2 from D x D

  • Record r1(a1, a2, a3)

  • Record r2(a1, a2, a3)

  • Use similarity functions f1, f2…fn to compute similarity scores between r1 and r2

  • New record

    • r1&r2 (s1, s2…sn)

    • No y/n field

  • Put new record in Dp

Mapper

Similarity

Functions

F

Mapper

Dp

D

Seminar 3


Active learner
Active Learner RDBMS

L

Feedback

From user

Lp

T

Training data

Mapper

Similarity

Functions

F

Train

Classifiers

Dp

Mapper

Select set

S of n

instances

for labeling

s

Learner

D

Seminar 3


Alias algorithm
ALIAS Algorithm RDBMS

  • 1. Input: L, D, F.

  • 2. Create pairs Lp from the labeled data L and F.

  • 3. Create pairs Dp from the unlabeled data D and F.

  • 4. Initial training set T = Lp

  • 5. Loop until user satisfaction

    • Train classier C using T.

    • Use C to select a set S of n instances from Dp for labeling.

    • If S is empty, exit loop.

    • Collect user feedback on the labels of S.

    • Add S to T and remove S from Dp.

  • 6. Output classifier C

Seminar 3


The indexing component
The Indexing Component RDBMS

  • Purpose:

    • Avoid mapping all pairs of records in D x D

  • 3 Methods:

    • Grouping

    • Sampling

    • Indexing

Seminar 3


The indexing component1
The Indexing Component RDBMS

  • Grouping

    • Example: group records in D according to the field year of publication

    • Mapped pairs are formed only within records of a group.

  • Sampling

    • Sample in units of a group instead of individual records.

Seminar 3


The indexing component2
The Indexing Component RDBMS

  • Indexing

    • A similarity function:

      “fraction of common words between two text attributes >=0.4”

    • we can create an index on the words of the text attributes

Seminar 3


The learning component

Contain a number of classifiers RDBMS

A classifier: a machine learning algorithm such as decision tree (D-tree), naïve Bayes (NB), Support Vector Machine (SVM)…to classify instances

A classifier is trained using a training data set

The learning component

Seminar 3


Criteria for a classifier
Criteria for a classifier RDBMS

  • Accuracy

  • Interpretability

  • Indexability

  • Efficient training

Seminar 3


Accuracy of a classifier

Measured by the mean F of recall r and precision p RDBMS

r = fraction of duplicates correctly classified

p = fraction correct amongst all instances actually libeled duplicate

Accuracy of a Classifier

Seminar 3


Accuracy cont
Accuracy (cont.) RDBMS

  • Example

    • A case with 1% duplicates; a classifier labels all pairs as non-duplicates

      recall r = 0  mean F = 0  accuracy = 0%

    • A case with 1% duplicates; a classifier identifies all duplicates correctly but misclassifies 1% non-duplicates

      recall r = 1, p = 0.5  F = 0.667  accuracy = 66.7%

       If we don’t use r and p, then they both have 99% accuracy!

Seminar 3


Criteria for a classifier cont
Criteria for a classifier (cont.) RDBMS

  • Interpretability

    • Final deduplication rule is easy to understand and interpret

  • Indexability

    • Final deduplication rule has indexable predicates

  • Efficient training

    • Fast to train

Seminar 3



Active learning
Active Learning RDBMS

The goal is to seek out from the unlabeled pool the instances which when labeled will help strengthen the classifier at the fastest possible rate.

Seminar 3


A simple example
A simple Example RDBMS

Most uncertain instance

  • Assume all are unlabeled except a and b

  • Suppose r-coordinate = 0, b-coordinate = 1

  • Any unlabeled point x to the left of r and to the right of b will have no effect in reducing the region of uncertainty

  • By including m in the training set, the size of the uncertain region will reduce by half

Seminar 3


How to select an unlabeled instance
How to select an unlabeled instance RDBMS

  • Uncertainty

    • The instance about which the learner was most unsure was also the instance for which the expected reduction in confusion was the largest

    • Uncertainty score

      • the disagreement among the predictions the instance get from a committee of N classifiers

      • A sure duplicate or non-duplicate would get the same prediction from all members

  • Representativeness

    • An uncertain instance representing a larger number of unlabeled instances has greater impact to the classifier

Seminar 3


Example2
Example RDBMS

  • 3 similarity functions: word match f1, qgram match f2, string edit distance f3.

  • Take from Mapped Unlabeled Instances (Dp)

    • r1&r2 (s1, s2…sn)

    • r3&r4 (s1, s2…sn)

    • s1, s2…sn: scores using functions f1, f2, f3

  • 3 classifiers: D-tree, Naïve Bayes, SVM

selected

Seminar 3


How to combine uncertainty and representativeness
How to Combine Uncertainty and Representativeness RDBMS

  • Proposed two approaches

  • 1st approach: weighted sum

    • Cluster the unlabeled instances

    • Estimate the density of points around it

    • The instances are scored using a weighted sum of its density and uncertainty value

    • n highest scoring instances selected

  • 2nd approach: sampling

Seminar 3


Conclusion
Conclusion RDBMS

ALIAS:

  • Makes deduplication much easier (less training instances)

  • Provide interaction response to the user

  • High accuracy

Seminar 3


ad