Loading in 5 sec....

Searching and Integrating Information on the WebPowerPoint Presentation

Searching and Integrating Information on the Web

- 62 Views
- Uploaded on
- Presentation posted in: General

Searching and Integrating Information on the Web

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Searching and Integrating Information on the Web

Seminar 3: Data Cleansing

Professor Chen Li

UC Irvine

- Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995
- Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,
- Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003
- Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003

Seminar 3

- Correlate data from different data sources (e.g., data integration)
- Data is often dirty
- Needs to be cleansed before being used

- Example:
- A hospital needs to merge patient records from different data sources
- They have different formats, typos, and abbreviations

Seminar 3

Table R

Table S

- Find records from different datasets that could be the same entity

Seminar 3

- P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981)
- Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Seminar 3

Problem statement:

“Given two relations, identify the potentially matched records

- Efficiently and
- Effectively”

Seminar 3

- How to define good similarity functions?
- Many functions proposed (edit distance, cosine similarity, …)
- Domain knowledge is critical
- Names: “Wall Street Journal” and “LA Times”
- Address: “Main Street” versus “Main St”

- How to do matching efficiently
- Offline join version
- Online (interactive) search
- Nearest search
- Range search

Seminar 3

- Supporting string-similarity joins using RDBMS
- Using mapping techniques
- Interactive deduplication

Seminar 3

- A widely used metric to define string similarity
- Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2
- Example:
s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Seminar 3

- We want to join tuples with “similar” string fields
- Similarity measure: Edit Distance
- Each Insertion, Deletion, Replacement increases distance by one

Seminar 3

- Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K
- Solve the problem in a database-friendly way
(if possible with an existing "vanilla" RDBMS)

Seminar 3

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions:

- Join data outside of DBMS
- Join data via user-defined functions (UDFs) inside the DBMS

Seminar 3

- Export data
- Join outside of DBMS
- Import the result
Main advantage:

We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality

Disadvantages:

- Substantial amounts of data to be exported/imported
- Cannot be easily integrated with further processing steps in the DBMS

Seminar 3

- Write a UDF to check if two strings match within distance K
- Write an SQL statement that applies the UDF to the string pairs
SELECT R.stringAttr, S.stringAttr

FROM R, S

WHERE edit_distance(R.stringAttr, S.stringAttr, K)

Main advantage:

Ease of implementation

Main disadvantage:

UDF applied to entire cross-product of relations

Seminar 3

- Preprocess data and generate auxiliary tables
- Perform join exploiting standard RDBMS capabilities
Advantages

- No modification of underlying RDBMS needed.
- Can leverage the RDBMS query optimizer.
- Much more efficient than the approach based on naive UDFs

Seminar 3

- Intuition:
- Similar strings have many common substrings
- Use exact joins to perform approximate joins (current
DBMSs are good for exact joins)

- A good candidate set can be verified for false positives
[Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]

- Roadmap:
- Break strings into substrings of length q (q-grams)
- Perform an exact join on the q-grams
- Find candidate string pairs based on the results
- Check only candidate pairs with a UDF to obtain final answer

Seminar 3

- Q-gram: A sequence of q characters of the original string
Example for q=3

vacations

{##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

String with length L → L + q - 1 q-grams

- Similar stringshave a many common q-grams

Seminar 3

- With no edits: L + q - 1 common q-grams
- Replacement: (L + q – 1) - q common q-grams
Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

- Insertion: (Lmax + q – 1) - q common q-grams
Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}

- Deletion: (Lmax + q – 1) - q common q-grams
Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}

Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Seminar 3

- For Edit Distance = K, there could be at most K replacements, insertions, deletions
- Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common
- Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Seminar 3

- If we have the q-grams in the DBMS, we can perform this counting efficiently.
- Create auxiliary tables with tuples of the form:
<sid, strlen, qgram>

and join these tables

- A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Seminar 3

SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q

The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Seminar 3

Strings with length difference larger than K cannot be within Edit Distance K

SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as LENGTH FILTERING

Seminar 3

- Consider strings aabbzzaacczz and aacczzaabbzz
- Strings are at edit distance 4
- Strings have identical q-grams for q=3
Problem: Matching q-grams that are at different positions in both strings

- Either q-grams do not "originate" from same q-gram, or
- Too many edit operations "caused" spurious q-grams at various parts of strings to match

Seminar 3

- Keep the position of the q-grams <sid, strlen, pos, qgram>
- Do not match q-grams that are more than K positions away
SELECT R.sid, S.sid

FROM R, S

WHERE R.qgram=S.qgram

AND abs(R.strlen - S.strlen)<=K

AND abs(R.pos - S.pos)<=K

GROUP BY R.sid, S.sid

HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as POSITION FILTERING

Seminar 3

SELECT R1.string, S1.string, R1.sid, S1.sid

FROM R1, S1, R, S,

WHERE R1.sid=R.sid

AND S1.sid=S.sid

AND R.qgram=S.qgram

AND abs(strlen(R1.string)–strlen(S1.string))<=K

AND abs(R.pos - S.pos)<=K

GROUP BY R1.sid, S1.sid, R1.string, S1.string

HAVING COUNT(*) >=

(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

Seminar 3

- Introduced a technique for mapping approximate string joins into a “vanilla” SQL expression
- Our technique does not require modifying the underlying RDBMS

Seminar 3

- Supporting string-similarity joins using RDBMS
- Using mapping techniques
- Interactive deduplication

Seminar 3

- Given
- two sets of strings, R and S
- a similarity function f between strings (metric space)
- Reflexive: f(s1,s2) = 0 iff s1=s2
- Symmetric: f(s1,s2) = d(s2, s1)
- Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3)

- a threshold k

- Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k.

R

S

Seminar 3

- Not desirable for large data sets
- 5 hours for 30K strings!

Seminar 3

- Step 1: map strings (in a metric space) to objects in a Euclidean space
- Step 2: do a similarity join in the Euclidean space

Seminar 3

- Applicable to many metric similarity functions
- Use edit distance as an example
- Other similarity functions also tried, e.g., q-gram-based similarity

- Open to existing algorithms
- Mapping techniques
- Join techniques

Seminar 3

Map strings into a high-dimensional Euclidean space

Metric Space

Euclidean Space

Seminar 3

- Input: A list of strings
- Output: Points in a high-dimensional Euclidean space that preserve the original distances well
- A variation of FastMap
- Each step greedily picks two strings (pivots) to form an axis
- All axes are orthogonal

Seminar 3

- Data Sources:
- IMDB star names: 54,000
- German names: 132,000

- Distribution of string lengths:

Seminar 3

- Use data set 1 (54K names) as an example
- k=2, d=20
- Use k’=5.2 to differentiate similar and dissimilar pairs.

Seminar 3

Increase d?

- Good :
- better to differentiate similar pairs from dissimilar ones.

- Bad :
- Step 1: Efficiency ↓
- Step 2: “curse of dimensionality”

Seminar 3

# of pairs within distance w

Cost=

# of similar pairs

- Sample 1Kx1K strings, find their similar pairs (within distance k)
- Calculate maximum of their new distances w
- Define “Cost” of finding a similar pair:

Seminar 3

d=15 ~ 25

Seminar 3

- Closely related to the mapping property
- Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’.
- Choose k’ using sampling
- Sample 1Kx1K strings, find similar pairs
- Calculate their maximum new distance as k’
- repeat multiple times, choose their maximum

Seminar 3

d=20

Seminar 3

- Input: Two sets of points in Euclidean space.
- Output: Pairs of two points whose distance is less than new threshold k’.
- Many join algorithms can be used

Seminar 3

- Adopted an algorithm by Hjaltason and Samet.
- Building two R-Trees.
- Traverse two trees, find points whose distance is within k’.
- Pruning during traversal (e.g., using MinDist).

Seminar 3

- Among the pairs produced from the similarity-join step, check their edit distance.
- Return those pairs satisfying the threshold k

Seminar 3

Seminar 3

- Recall: (#of found similar pairs)/(#of all similar pairs)

Seminar 3

- Example: title + name + year
- Different attributes have different similarity functions and thresholds
- Consider merge rules in disjunctive format:

Seminar 3

- Many ways to evaluate rules
- Finding an optimal one: NP-hard
- Heuristics:
- Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct.
- Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Seminar 3

- A novel two-step approach to record linkage.
- Many existing mapping and join algorithms can be adopted
- Applicable to many distance metrics.
- Time and space efficient.
- Multi-attribute case studied

Seminar 3

- Supporting string-similarity joins using RDBMS
- Using mapping techniques
- Interactive deduplication

Seminar 3

Matching Functions

Calculate similarity scores, thresholds

Tedious coding

Learning-based Methods

Require large-sized training set for accuracy

(static training set)

Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function

Seminar 3

- Relegate the task of finding the deduplication function to a machine learning algorithm
- Design goal:
- Less training instances
- Interactive response
- Fast convergence
- High accuracy
Design an interactive deduplication system called ALIAS

ALIAS: Active Learning led Interactive Alias SuppressionSeminar 3

- A learned-based method
- Exploit existing similarity functions
- Use Active Learning
- an active learner actively picks unlabeled instances with the most information gain in the training set

- Produce a deduplication function that can identify duplicates

Seminar 3

L

Feedback

From user

Lp

T

Training data

Mapper

F

Similarity

Functions

Train

classifiers

Dp

Mapper

Select n

instances

for labeling

S

Learner

D

Predicate for uncertain region

Similarity

Indices

Seminar 3

L

- A set of initial training pairs (L)
- less than 10 labeled records
- arranged in pairs of duplicates and non-duplicates

- A set of similarity functions (F)
- Ex: word-match, qgram-match…
- To compute similarities scores between 2 records based on any subset of attributes.
- Learner will find the right way of combining those scores to get the final deduplication function

- A database of unlabeled records(D)
- Number of classifiers (<5)

F

Similarity

Functions

D

Seminar 3

L

Lp

- Take r1, r2 from input L
- Record r1(a1, a2, a3)
- Record r2(a1, a2, a3)
- Use similarity functions f1, f2…fn to compute similarity scores s between r1 and r2
- New record
- r1&r2 (s1, s2…sn, y/n)
- y = duplicate;
- n = non-duplicate

- Put new record in Lp

Mapper

Similarity

Functions

F

D

Seminar 3

L

Lp

Take r1, r2 from D x D

- Record r1(a1, a2, a3)
- Record r2(a1, a2, a3)
- Use similarity functions f1, f2…fn to compute similarity scores between r1 and r2
- New record
- r1&r2 (s1, s2…sn)
- No y/n field

- Put new record in Dp

Mapper

Similarity

Functions

F

Mapper

Dp

D

Seminar 3

L

Feedback

From user

Lp

T

Training data

Mapper

Similarity

Functions

F

Train

Classifiers

Dp

Mapper

Select set

S of n

instances

for labeling

s

Learner

D

Seminar 3

- 1. Input: L, D, F.
- 2. Create pairs Lp from the labeled data L and F.
- 3. Create pairs Dp from the unlabeled data D and F.
- 4. Initial training set T = Lp
- 5. Loop until user satisfaction
- Train classier C using T.
- Use C to select a set S of n instances from Dp for labeling.
- If S is empty, exit loop.
- Collect user feedback on the labels of S.
- Add S to T and remove S from Dp.

- 6. Output classifier C

Seminar 3

- Purpose:
- Avoid mapping all pairs of records in D x D

- 3 Methods:
- Grouping
- Sampling
- Indexing

Seminar 3

- Grouping
- Example: group records in D according to the field year of publication
- Mapped pairs are formed only within records of a group.

- Sampling
- Sample in units of a group instead of individual records.

Seminar 3

- Indexing
- A similarity function:
“fraction of common words between two text attributes >=0.4”

- we can create an index on the words of the text attributes

- A similarity function:

Seminar 3

Contain a number of classifiers

A classifier: a machine learning algorithm such as decision tree (D-tree), naïve Bayes (NB), Support Vector Machine (SVM)…to classify instances

A classifier is trained using a training data set

Seminar 3

- Accuracy
- Interpretability
- Indexability
- Efficient training

Seminar 3

Measured by the mean F of recall r and precision p

r = fraction of duplicates correctly classified

p = fraction correct amongst all instances actually libeled duplicate

Seminar 3

- Example
- A case with 1% duplicates; a classifier labels all pairs as non-duplicates
recall r = 0 mean F = 0 accuracy = 0%

- A case with 1% duplicates; a classifier identifies all duplicates correctly but misclassifies 1% non-duplicates
recall r = 1, p = 0.5 F = 0.667 accuracy = 66.7%

If we don’t use r and p, then they both have 99% accuracy!

- A case with 1% duplicates; a classifier labels all pairs as non-duplicates

Seminar 3

- Interpretability
- Final deduplication rule is easy to understand and interpret

- Indexability
- Final deduplication rule has indexable predicates

- Efficient training
- Fast to train

Seminar 3

Seminar 3

The goal is to seek out from the unlabeled pool the instances which when labeled will help strengthen the classifier at the fastest possible rate.

Seminar 3

Most uncertain instance

- Assume all are unlabeled except a and b
- Suppose r-coordinate = 0, b-coordinate = 1
- Any unlabeled point x to the left of r and to the right of b will have no effect in reducing the region of uncertainty
- By including m in the training set, the size of the uncertain region will reduce by half

Seminar 3

- Uncertainty
- The instance about which the learner was most unsure was also the instance for which the expected reduction in confusion was the largest
- Uncertainty score
- the disagreement among the predictions the instance get from a committee of N classifiers
- A sure duplicate or non-duplicate would get the same prediction from all members

- Representativeness
- An uncertain instance representing a larger number of unlabeled instances has greater impact to the classifier

Seminar 3

- 3 similarity functions: word match f1, qgram match f2, string edit distance f3.
- Take from Mapped Unlabeled Instances (Dp)
- r1&r2 (s1, s2…sn)
- r3&r4 (s1, s2…sn)
- s1, s2…sn: scores using functions f1, f2, f3

- 3 classifiers: D-tree, Naïve Bayes, SVM

selected

Seminar 3

- Proposed two approaches
- 1st approach: weighted sum
- Cluster the unlabeled instances
- Estimate the density of points around it
- The instances are scored using a weighted sum of its density and uncertainty value
- n highest scoring instances selected

- 2nd approach: sampling

Seminar 3

ALIAS:

- Makes deduplication much easier (less training instances)
- Provide interaction response to the user
- High accuracy

Seminar 3