slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Date : 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia -ling, Koh PowerPoint Presentation
Download Presentation
Date : 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia -ling, Koh

Loading in 2 Seconds...

play fullscreen
1 / 28

Date : 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia -ling, Koh - PowerPoint PPT Presentation

  • Uploaded on

Frequency-aware Similarity Measures. Date : 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia -ling, Koh Speaker: Jiun Jia , Chiou. 1. Outline. Introduction Composing similarity Exploiting frequencies Partitioning strategies Experiment

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Date : 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia -ling, Koh' - hollie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Frequency-aware Similarity Measures

Date: 2011/12/26

Source: Dustin Lange et. al (CIKM’11)

Advisor: Jia-ling, Koh

Speaker: JiunJia, Chiou


  • Introduction
  • Composing similarity
  • Exploiting frequencies
  • Partitioning strategies
  • Experiment
  • Conclusion


  • Propose a novel comparison method thatpartitions

the data usingvalue frequencyinformation and then

automaticallydeterminessimilarity measures for each

individual partition.

  • Use by partitioning compared record pairs according

tofrequencies of attribute values.

Partition 1contains all pairs with rare names.

Partition 2 allpairs with medium frequent names.

Partition 3 all pairs with frequent names.



  • Schufa, a credit rating agency that stores data of about 66 million citizens, which are in turn reported by banks , insurance agencies, etc.

queries about the rating of an individual must

beresponded to as precisely aspossible.

  • To ensure the quality of the data, it is necessary

to detect and fuse duplicates.

  • Why Arnold Schwarzenegger is Always a Duplicate ?

In a person table with U.S. citizens , this nameis a very rare name. If we find several Arnold Schwarzeneggersin it, it is very likely that these are duplicates.

  • they argue that address and date-of-birth similarity are less important than for rows with frequent names.

person's name, birth date, address

  • Determining the similarity (or distance) of two records in a database is a well-known, but challenging problem.
  • The problemcomprises two main difficulties:


outdated values

sloppy data or query entries.


The amount of data might be very large, thus prohibiting

exhaustive comparisons.

devising sophisticated similarity measures

Efficient algorithms and indexes that avoid comparing each entry with all

other entries.

composing similarity
Composing Similarity
  • Base Similarity Measures

Define: Simp(r1,r2) Simp: (R x R) → [0 ,1] ⊂ R

each responsible for calculating the similarity of a specific attribute

p of the compared records r1 and r2 from a set R of records.


SimName : Jaro-Winkler distance

SimBirthDate : relative distance

SimAddress : Euclidean distance

Also test for equality (e.g., for email addresses) or boolean

values(e.g., for gender).


m: the number of matching characters.

t: half the number of transpositions.

Jaro-Winkler distance

Jaro–Winkler distance dw :

dj:the Jaro distance for strings s1 and s2

:the length of common prefix at the start of the string up to a maximum of 4 characters

p : a constant scaling factor

p should not exceed 0.25, otherwise the distance can become larger than 1.

The standard value for this constant in Winkler's work is p = 0.1


Jaro-Winkler distance:


m = 6 , | s1 | = 6 , | s2 | = 6

t= =1 (H/T&T/H)

dj=()=0.944 , standard weight p = 0.1

s1:MARTHA s2 : MARHTA =3

dw= 0.944 + (3 * 0.1(1 − 0.944)) = 0.961



m = 4 , | s1 | = 6 , | s2 | = 5

t = 0

dj=()=0.822 , standard weight p = 0.1

s1:DWAYNE s2 : DUANE =1

dw = 0.822+ (1* 0.1(1 − 0.822)) = 0.84

composing similarity1
Composing Similarity
  • Composition of Base Similarity Measures

Integrate the base similarity measures into an overall judgementto

calculate the overall similarity of two records.

the classes are isSimilar and isDissimilar

  • The features are the results of the base similarity measures.
  • Toderive a general model:

employ machine learning techniques and have enough training

data for supervised learning methods.

logistic regression,

decision trees,



logistic regression

SVM(support vector machine) Decision Tree

exploiting frequencies
Exploiting frequencies
  • Frequency Function

Determine the value frequencies of the selected attributes for two

compared records.

Define a frequency function f : R x R → N (FirstName&LastName)

Goal :partition the dataaccording to the name frequencies.

  • Several data quality problems:

1.swapping of first and last name

2.typos (e. g., Arnold , Arnnold)

3. combining two attributes

(e. g., Schwarzenegger is more distinguishing than Arnold)



Josh : 3

Kevin: 1

Jack: 5





powell: 2

johnson : 0

wills: 5

powell: 1

johnson : 1

wills: 1

powell: 4

johnson : 3

wills: 0


Powell: 1

Johnson: 0

Wills: 5





Josh : 2

Kevin : 2

Jack: 2

Josh : 4

Kevin : 6

Jack: 5

exploiting frequencies1
Exploiting frequencies
  • Frequency-enriched Models

exploit frequency distributions is to alter the models that we

learned with the machine learning techniques

1. manually add rules to the models

2. integrate the frequencies directly into the machine learning models.

Ex: logistic regression,

"if the frequency of the name value is below10, then increase the

weight of the name similarity by 10% and appropriately decrease the

weights of the other similarity functions".

Drawback : Manually defining such rules is cumbersome and error-prone

where M is the maximum frequency in the data set.

partitioning strategies
Partitioning strategies
  • partition compared record pairs into n partitions using the determined frequencies.
  • Number of partition:

Too large in small partitions: Overfitting

0 10

Too small in large partitions:

discovering frequency-specific differences

0 100

partitioning strategies1
  • Define partitions:
  • The entire frequency space is divided into non-overlapping, continuous partitions by a set of thresholds:

Ɵ0= 0 and Ɵn= M + 1, where M is the maximum frequency in the data set.

  • Defined as frequency ranges Ii :
  • A partition covers a set of record pairs. A record pair(r1,r2) falls into a partition [Ɵi, Ɵi+1) iffthe frequency function value for this pair lies in the partition's range:
partitioning strategies2
  • Random partitioning:

randomly pick several thresholds Ɵi∈ {0,…….,M + 1}

The number of thresholds in each partitioning is also randomly chosen.

maximum of 20 partitions in one partitioning.

  • Equi-depth partitioning:

divide the frequency space into e partitions. Each partition contains

the same number of tuples from the original data set R.

e ∈ {2,…….,20} 1partition

20 partition


partitioning strategies3
  • Greedy partitioning:

define a list of threshold candidates C = {Ɵ0,……, Ɵn}

by dividing the frequency space into segments with the same number of tuples (similar to equi-depth partitioning, but with fixed, large e = 50).


1.learning a partition for the first candidate thresholds [Ɵ0,Ɵ1).

2.learn a second partition that extends the current partition by moving its upper threshold to the next threshold candidate: [Ɵ0, Ɵ2).

3. …………………… [Ɵ0, Ɵ3).


  • compare both partitions using F-measure.
partitioning strategies4
  • Greedy partitioning: (continue)
  • If the extended partition achieves better performance, the process is repeated for the next threshold slot.
  • If not, the smaller partition is kept and a new partitioning is started at its upper threshold; another iteration starts with this new partition.
  • This process is repeated until all threshold candidates have been processed.






0 1

0 2




0 3


2 3




F==0. 6894

2 4


2 5

partitioning strategies5
Partitioning strategies
  • Genetic Partitioning Algorithm
  • Initialization:

Create an initial population consisting of several random partitionings. These partitioningsare created as described above with the random partitioning approach.

  • Growth:

Learn one composite similarity function for each partition in the current set of partitionings.

  • Selection:

For each partition, determine the maximum F-measure that can be achieved by choosing an appropriate threshold for the similarity function.

Select the partitionings with highest weighted F- measure, then select the top five partitionings.

partitioning strategies6
Partitioning strategies
  • Reproduction:

build pairs of the selected best individuals and combine them to create new individuals.


First create the union of the thresholds of both partitionings. For each threshold, randomly decide whether to keep it in the result partition or not. Both decisions have equal chances.

b) Mutation:

Randomly decide whether to add another new (also ran-domlypicked) threshold and whether to delete a (randomlypicked) threshold from the current threshold list.

Define a minimum partition size (set this value to 20 record pairs ). Randomly created partitionings with too small partitions are discarded.








→ [ 0 , 1 ), [ 1 , 3 ), [ 3 , 4 )

→ [ 0 , 2 ), [ 2 , 4 ), [ 4 , 5 )

partitioning strategies7
Partitioning strategies
  • Termination:
  • The resulting partitions are evaluated and added to the set of evaluated partitions.
  • The selection/reproduction phases are repeated until a certain number of iterations is reached or until no significant improvement can be measured.
  • Require a minimum F-measure improvement of 0.001 after 5 iterations.

Evaluation on Schufa Data Set

data set consists of two parts:

a person data set and a query data set.

built record pairs of the form (query, correct result) or (query, incorrect result),


Evaluation on DBLP Data Set(bibliographic database for computer sciences)

(1) Two papers from the same author,

(2) Two papers from the same author with different name aliases

(3) Two papers from different authors with the same name,

(4) Two papers from different authors with different names.

For each paper pair, the matching task is to decide whether the two papers were written by the same author.

  • With this paper, introduced a novel approach for im-proving composite similarity measures.
  • Divide a data set consisting of record pairs into partitions according to frequencies of selected attributes.
  • Learn optimal similarity measures for each partition.
  • Experiments on differentreal-world data sets showed that partitioning the data can improve learning results and that genetic partitioning performs better than several other partitioning strategies.

Thank you

for your listening !