searching and integrating information on the web
Download
Skip this Video
Download Presentation
Searching and Integrating Information on the Web

Loading in 2 Seconds...

play fullscreen
1 / 73

Searching and Integrating Information on the Web - PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on

Searching and Integrating Information on the Web. Seminar 3: Data Cleansing Professor Chen Li UC Irvine. Paper readings. Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995 Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Searching and Integrating Information on the Web' - donnelly-davis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
searching and integrating information on the web

Searching and Integrating Information on the Web

Seminar 3: Data Cleansing

Professor Chen Li

UC Irvine

paper readings
Paper readings
  • Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995
  • Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,
  • Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003
  • Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003

Seminar 3

motivation
Motivation
  • Correlate data from different data sources (e.g., data integration)
    • Data is often dirty
    • Needs to be cleansed before being used
  • Example:
    • A hospital needs to merge patient records from different data sources
    • They have different formats, typos, and abbreviations

Seminar 3

example
Example

Table R

Table S

  • Find records from different datasets that could be the same entity

Seminar 3

another example
Another Example
  • P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981)
  • Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Seminar 3

record linkage
Record linkage

Problem statement:

“Given two relations, identify the potentially matched records

  • Efficiently and
  • Effectively”

Seminar 3

challenges
Challenges
  • How to define good similarity functions?
    • Many functions proposed (edit distance, cosine similarity, …)
    • Domain knowledge is critical
      • Names: “Wall Street Journal” and “LA Times”
      • Address: “Main Street” versus “Main St”
  • How to do matching efficiently
    • Offline join version
    • Online (interactive) search
      • Nearest search
      • Range search

Seminar 3

outline
Outline
  • Supporting string-similarity joins using RDBMS
  • Using mapping techniques
  • Interactive deduplication

Seminar 3

edit distance
Edit Distance
  • A widely used metric to define string similarity
  • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2
  • Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Seminar 3

approximate string joins
Approximate String Joins
  • We want to join tuples with “similar” string fields
  • Similarity measure: Edit Distance
  • Each Insertion, Deletion, Replacement increases distance by one

Seminar 3

focus approximate string joins over relational dbmss
Focus: Approximate String Joins over Relational DBMSs
  • Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K
  • Solve the problem in a database-friendly way

(if possible with an existing "vanilla" RDBMS)

Seminar 3

current approaches for processing approximate string joins
Current Approaches for Processing Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions:

  • Join data outside of DBMS
  • Join data via user-defined functions (UDFs) inside the DBMS

Seminar 3

approximate string joins outside of a dbms
Approximate String Joins outside of a DBMS
  • Export data
  • Join outside of DBMS
  • Import the result

Main advantage:

We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality

Disadvantages:

  • Substantial amounts of data to be exported/imported
  • Cannot be easily integrated with further processing steps in the DBMS

Seminar 3

approximate string joins with udfs
Approximate String Joins with UDFs
  • Write a UDF to check if two strings match within distance K
  • Write an SQL statement that applies the UDF to the string pairs

SELECT R.stringAttr, S.stringAttr

FROM R, S

WHERE edit_distance(R.stringAttr, S.stringAttr, K)

Main advantage:

Ease of implementation

Main disadvantage:

UDF applied to entire cross-product of relations

Seminar 3

our approach approximate string joins over an unmodified rdbms
Our Approach: Approximate String Joins over an Unmodified RDBMS
  • Preprocess data and generate auxiliary tables
  • Perform join exploiting standard RDBMS capabilities

Advantages

  • No modification of underlying RDBMS needed.
  • Can leverage the RDBMS query optimizer.
  • Much more efficient than the approach based on naive UDFs

Seminar 3

intuition and roadmap
Intuition and Roadmap
  • Intuition:
    • Similar strings have many common substrings
    • Use exact joins to perform approximate joins (current

DBMSs are good for exact joins)

    • A good candidate set can be verified for false positives

[Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]

  • Roadmap:
    • Break strings into substrings of length q (q-grams)
    • Perform an exact join on the q-grams
    • Find candidate string pairs based on the results
    • Check only candidate pairs with a UDF to obtain final answer

Seminar 3

what is a q gram
What is a “Q-gram”?
  • Q-gram: A sequence of q characters of the original string

Example for q=3

vacations

{##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

String with length L → L + q - 1 q-grams

  • Similar stringshave a many common q-grams

Seminar 3

q grams and edit distance operations
Q-grams and Edit Distance Operations
  • With no edits: L + q - 1 common q-grams
  • Replacement: (L + q – 1) - q common q-grams

Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

  • Insertion: (Lmax + q – 1) - q common q-grams

Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}

  • Deletion: (Lmax + q – 1) - q common q-grams

Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Seminar 3

number of common q grams and edit distance
Number of Common Q-grams and Edit Distance
  • For Edit Distance = K, there could be at most K replacements, insertions, deletions
  • Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common
  • Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Seminar 3

using a dbms for q gram joins
Using a DBMS for Q-gram Joins
  • If we have the q-grams in the DBMS, we can perform this counting efficiently.
  • Create auxiliary tables with tuples of the form:

<sid, strlen, qgram>

and join these tables

  • A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Seminar 3

eliminating candidate pairs count filtering
Eliminating Candidate Pairs: COUNT FILTERING

SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q

The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Seminar 3

eliminating candidate pairs further length filtering
Eliminating Candidate Pairs Further: LENGTH FILTERING

Strings with length difference larger than K cannot be within Edit Distance K

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as LENGTH FILTERING

Seminar 3

exploiting q gram positions for filtering
Exploiting Q-gram Positions for Filtering
  • Consider strings aabbzzaacczz and aacczzaabbzz
  • Strings are at edit distance 4
  • Strings have identical q-grams for q=3

Problem: Matching q-grams that are at different positions in both strings

    • Either q-grams do not "originate" from same q-gram, or
    • Too many edit operations "caused" spurious q-grams at various parts of strings to match

Seminar 3

position filtering filtering using positions
POSITION FILTERING - Filtering using positions
  • Keep the position of the q-grams <sid, strlen, pos, qgram>
  • Do not match q-grams that are more than K positions away

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram

AND abs(R.strlen - S.strlen)<=K

AND abs(R.pos - S.pos)<=K

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as POSITION FILTERING

Seminar 3

the actual complete sql statement
The Actual, Complete SQL Statement

SELECT R1.string, S1.string, R1.sid, S1.sid

FROM R1, S1, R, S,

WHERE R1.sid=R.sid

AND S1.sid=S.sid

AND R.qgram=S.qgram

AND abs(strlen(R1.string)–strlen(S1.string))<=K

AND abs(R.pos - S.pos)<=K

GROUP BY R1.sid, S1.sid, R1.string, S1.string

HAVING COUNT(*) >=

(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

Seminar 3

summary of 1 st paper
Summary of 1st paper
  • Introduced a technique for mapping approximate string joins into a “vanilla” SQL expression
  • Our technique does not require modifying the underlying RDBMS

Seminar 3

outline1
Outline
  • Supporting string-similarity joins using RDBMS
  • Using mapping techniques
  • Interactive deduplication

Seminar 3

single attribute case
Single-attribute Case
  • Given
    • two sets of strings, R and S
    • a similarity function f between strings (metric space)
      • Reflexive: f(s1,s2) = 0 iff s1=s2
      • Symmetric: f(s1,s2) = d(s2, s1)
      • Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3)
    • a threshold k
  • Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k.

R

S

Seminar 3

nested loop
Nested-loop?
  • Not desirable for large data sets
  • 5 hours for 30K strings!

Seminar 3

our 2 step approach
Our 2-step approach
  • Step 1: map strings (in a metric space) to objects in a Euclidean space
  • Step 2: do a similarity join in the Euclidean space

Seminar 3

advantages
Advantages
  • Applicable to many metric similarity functions
    • Use edit distance as an example
    • Other similarity functions also tried, e.g., q-gram-based similarity
  • Open to existing algorithms
    • Mapping techniques
    • Join techniques

Seminar 3

step 1
Step 1

Map strings into a high-dimensional Euclidean space

Metric Space

Euclidean Space

Seminar 3

mapping stringmap
Mapping: StringMap
  • Input: A list of strings
  • Output: Points in a high-dimensional Euclidean space that preserve the original distances well
  • A variation of FastMap
    • Each step greedily picks two strings (pivots) to form an axis
    • All axes are orthogonal

Seminar 3

can it preserve distances
Can it preserve distances?
  • Data Sources:
    • IMDB star names: 54,000
    • German names: 132,000
  • Distribution of string lengths:

Seminar 3

can it preserve distances1
Can it preserve distances?
  • Use data set 1 (54K names) as an example
  • k=2, d=20
    • Use k’=5.2 to differentiate similar and dissimilar pairs.

Seminar 3

choose dimensionality d
Choose Dimensionality d

Increase d?

  • Good :
    • better to differentiate similar pairs from dissimilar ones.
  • Bad :
    • Step 1: Efficiency ↓
    • Step 2: “curse of dimensionality”

Seminar 3

choose dimensionality d using sampling

# of pairs within distance w

Cost=

# of similar pairs

Choose dimensionality d using sampling
  • Sample 1Kx1K strings, find their similar pairs (within distance k)
  • Calculate maximum of their new distances w
  • Define “Cost” of finding a similar pair:

Seminar 3

choose dimensionality d1
Choose Dimensionality d

d=15 ~ 25

Seminar 3

choose new threshold k
Choose new threshold k’
  • Closely related to the mapping property
  • Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’.
  • Choose k’ using sampling
    • Sample 1Kx1K strings, find similar pairs
    • Calculate their maximum new distance as k’
    • repeat multiple times, choose their maximum

Seminar 3

step 2 similarity join
Step 2: Similarity Join
  • Input: Two sets of points in Euclidean space.
  • Output: Pairs of two points whose distance is less than new threshold k’.
  • Many join algorithms can be used

Seminar 3

example1
Example
  • Adopted an algorithm by Hjaltason and Samet.
    • Building two R-Trees.
    • Traverse two trees, find points whose distance is within k’.
    • Pruning during traversal (e.g., using MinDist).

Seminar 3

final processing
Final processing
  • Among the pairs produced from the similarity-join step, check their edit distance.
  • Return those pairs satisfying the threshold k

Seminar 3

running time
Running time

Seminar 3

recall
Recall
  • Recall: (#of found similar pairs)/(#of all similar pairs)

Seminar 3

multi attribute linkage
Multi-attribute linkage
  • Example: title + name + year
  • Different attributes have different similarity functions and thresholds
  • Consider merge rules in disjunctive format:

Seminar 3

evaluation strategies
Evaluation strategies
  • Many ways to evaluate rules
  • Finding an optimal one: NP-hard
  • Heuristics:
    • Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct.
    • Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Seminar 3

summary of 2 nd paper
Summary of 2nd paper
  • A novel two-step approach to record linkage.
  • Many existing mapping and join algorithms can be adopted
  • Applicable to many distance metrics.
  • Time and space efficient.
  • Multi-attribute case studied

Seminar 3

outline2
Outline
  • Supporting string-similarity joins using RDBMS
  • Using mapping techniques
  • Interactive deduplication

Seminar 3

problems with existing deduplication methods
Matching Functions

Calculate similarity scores, thresholds

Tedious coding

Learning-based Methods

Require large-sized training set for accuracy

(static training set)

Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function

Problems with Existing Deduplication Methods

Seminar 3

new approach
New Approach
  • Relegate the task of finding the deduplication function to a machine learning algorithm
  • Design goal:
    • Less training instances
    • Interactive response
    • Fast convergence
    • High accuracy

 Design an interactive deduplication system called ALIAS

ALIAS: Active Learning led Interactive Alias SuppressionSeminar 3

alias system
ALIAS SYSTEM
  • A learned-based method
  • Exploit existing similarity functions
  • Use Active Learning

- an active learner actively picks unlabeled instances with the most information gain in the training set

  • Produce a deduplication function that can identify duplicates

Seminar 3

overall architecture
Overall Architecture

L

Feedback

From user

Lp

T

Training data

Mapper

F

Similarity

Functions

Train

classifiers

Dp

Mapper

Select n

instances

for labeling

S

Learner

D

Predicate for uncertain region

Similarity

Indices

Seminar 3

primary inputs for alias
Primary Inputs for ALIAS

L

  • A set of initial training pairs (L)
    • less than 10 labeled records
    • arranged in pairs of duplicates and non-duplicates
  • A set of similarity functions (F)
    • Ex: word-match, qgram-match…
    • To compute similarities scores between 2 records based on any subset of attributes.
    • Learner will find the right way of combining those scores to get the final deduplication function
  • A database of unlabeled records(D)
  • Number of classifiers (<5)

F

Similarity

Functions

D

Seminar 3

mapped labeled instances
Mapped Labeled Instances

L

Lp

  • Take r1, r2 from input L
  • Record r1(a1, a2, a3)
  • Record r2(a1, a2, a3)
  • Use similarity functions f1, f2…fn to compute similarity scores s between r1 and r2
  • New record
    • r1&r2 (s1, s2…sn, y/n)
    • y = duplicate;
    • n = non-duplicate
  • Put new record in Lp

Mapper

Similarity

Functions

F

D

Seminar 3

mapped unlabeled instances
Mapped Unlabeled Instances

L

Lp

Take r1, r2 from D x D

  • Record r1(a1, a2, a3)
  • Record r2(a1, a2, a3)
  • Use similarity functions f1, f2…fn to compute similarity scores between r1 and r2
  • New record
    • r1&r2 (s1, s2…sn)
    • No y/n field
  • Put new record in Dp

Mapper

Similarity

Functions

F

Mapper

Dp

D

Seminar 3

active learner
Active Learner

L

Feedback

From user

Lp

T

Training data

Mapper

Similarity

Functions

F

Train

Classifiers

Dp

Mapper

Select set

S of n

instances

for labeling

s

Learner

D

Seminar 3

alias algorithm
ALIAS Algorithm
  • 1. Input: L, D, F.
  • 2. Create pairs Lp from the labeled data L and F.
  • 3. Create pairs Dp from the unlabeled data D and F.
  • 4. Initial training set T = Lp
  • 5. Loop until user satisfaction
    • Train classier C using T.
    • Use C to select a set S of n instances from Dp for labeling.
    • If S is empty, exit loop.
    • Collect user feedback on the labels of S.
    • Add S to T and remove S from Dp.
  • 6. Output classifier C

Seminar 3

the indexing component
The Indexing Component
  • Purpose:
    • Avoid mapping all pairs of records in D x D
  • 3 Methods:
    • Grouping
    • Sampling
    • Indexing

Seminar 3

the indexing component1
The Indexing Component
  • Grouping
    • Example: group records in D according to the field year of publication
    • Mapped pairs are formed only within records of a group.
  • Sampling
    • Sample in units of a group instead of individual records.

Seminar 3

the indexing component2
The Indexing Component
  • Indexing
    • A similarity function:

“fraction of common words between two text attributes >=0.4”

    • we can create an index on the words of the text attributes

Seminar 3

the learning component
Contain a number of classifiers

A classifier: a machine learning algorithm such as decision tree (D-tree), naïve Bayes (NB), Support Vector Machine (SVM)…to classify instances

A classifier is trained using a training data set

The learning component

Seminar 3

criteria for a classifier
Criteria for a classifier
  • Accuracy
  • Interpretability
  • Indexability
  • Efficient training

Seminar 3

accuracy of a classifier
Measured by the mean F of recall r and precision p

r = fraction of duplicates correctly classified

p = fraction correct amongst all instances actually libeled duplicate

Accuracy of a Classifier

Seminar 3

accuracy cont
Accuracy (cont.)
  • Example
    • A case with 1% duplicates; a classifier labels all pairs as non-duplicates

recall r = 0  mean F = 0  accuracy = 0%

    • A case with 1% duplicates; a classifier identifies all duplicates correctly but misclassifies 1% non-duplicates

recall r = 1, p = 0.5  F = 0.667  accuracy = 66.7%

 If we don’t use r and p, then they both have 99% accuracy!

Seminar 3

criteria for a classifier cont
Criteria for a classifier (cont.)
  • Interpretability
    • Final deduplication rule is easy to understand and interpret
  • Indexability
    • Final deduplication rule has indexable predicates
  • Efficient training
    • Fast to train

Seminar 3

active learning
Active Learning

The goal is to seek out from the unlabeled pool the instances which when labeled will help strengthen the classifier at the fastest possible rate.

Seminar 3

a simple example
A simple Example

Most uncertain instance

  • Assume all are unlabeled except a and b
  • Suppose r-coordinate = 0, b-coordinate = 1
  • Any unlabeled point x to the left of r and to the right of b will have no effect in reducing the region of uncertainty
  • By including m in the training set, the size of the uncertain region will reduce by half

Seminar 3

how to select an unlabeled instance
How to select an unlabeled instance
  • Uncertainty
    • The instance about which the learner was most unsure was also the instance for which the expected reduction in confusion was the largest
    • Uncertainty score
      • the disagreement among the predictions the instance get from a committee of N classifiers
      • A sure duplicate or non-duplicate would get the same prediction from all members
  • Representativeness
    • An uncertain instance representing a larger number of unlabeled instances has greater impact to the classifier

Seminar 3

example2
Example
  • 3 similarity functions: word match f1, qgram match f2, string edit distance f3.
  • Take from Mapped Unlabeled Instances (Dp)
    • r1&r2 (s1, s2…sn)
    • r3&r4 (s1, s2…sn)
    • s1, s2…sn: scores using functions f1, f2, f3
  • 3 classifiers: D-tree, Naïve Bayes, SVM

selected

Seminar 3

how to combine uncertainty and representativeness
How to Combine Uncertainty and Representativeness
  • Proposed two approaches
  • 1st approach: weighted sum
    • Cluster the unlabeled instances
    • Estimate the density of points around it
    • The instances are scored using a weighted sum of its density and uncertainty value
    • n highest scoring instances selected
  • 2nd approach: sampling

Seminar 3

conclusion
Conclusion

ALIAS:

  • Makes deduplication much easier (less training instances)
  • Provide interaction response to the user
  • High accuracy

Seminar 3

ad