efficient parallel set similarity joins using mapreduce rares vernica michael j carey chen li l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li PowerPoint Presentation
Download Presentation
Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li

Loading in 2 Seconds...

play fullscreen
1 / 59

Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li - PowerPoint PPT Presentation


  • 206 Views
  • Uploaded on

Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li. Speaker : Razvan Belet. Outline. Motivating Scenarios Background Knowledge Parallel Set-Similarity Join Self Join R-S Join Evaluation Conclusions Strengths & Weaknesses.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li' - heath


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient parallel set similarity joins using mapreduce rares vernica michael j carey chen li

Efficient Parallel Set-Similarity Joins Using MapReduceRares Vernica, Michael J. Carey, Chen Li

Speaker : Razvan Belet

outline
Outline
  • Motivating Scenarios
  • Background Knowledge
  • Parallel Set-Similarity Join
    • Self Join
    • R-S Join
  • Evaluation
  • Conclusions
  • Strengths & Weaknesses
scenario detecting plagiarism
Scenario: Detecting Plagiarism
  • Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journal
scenario near duplicate elimination
Scenario: Near-duplicate elimination
  • The archive of a search engine can contain multiple copies of the same page
  • Reasons: re-crawling, different hosts holding the same redundant copies of a page, etc.
problem statement
Problem Statement

Problem Statement:

Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)> λ

  • Solution:
  • Similarity Join
motivation 2
Motivation(2)
  • Some of the collections are enormous:
    • Google N-gram database : ~1trillion records
    • GeneBank : 416GB of data
    • Facebook : 400 million active users

Try to process this data in a parallel, distributed way => MapReduce

outline7
Outline
  • Motivating Scenarios
  • Background Knowledge
  • Parallel Set-Similarity Join
    • Self Join
    • R-S Join
  • Evaluation
  • Conclusions
background knowledge
Background Knowledge
  • Set-Similarity Join
      • Join
      • Similarity Join
      • Set-Similarity Join
background knowledge join
Background Knowledge: Join
  • Logical operator heavily used in Databases
  • Whenever it is needed to associate records in 2 tables => use a JOIN
  • Associates records in the 2 input tables based on a predicate (pred)

Consider this information need: for each

employee find the department he

works in

Table Employees

Table Departments

background knowledge join10
Background Knowledge: Join
  • Example :For each employee find the department he works in

JOINpred

pred:

EMPLOYEES.DepID= DEPARTMENTS.DerpartmentID

background knowledge similarity join
Background Knowledge: Similarity Join
  • Special type of join, in which the predicate (pred) is a similarity metric/function: sim(obj1,obj2)
  • Return pair (obj1, ob2) if pred holds: sim(obj1,obj2) > threshold

T1:

Similarity Joinpred

pred: sim(T1.c,T2.c)>threshold

T2:

background knowledge similarity join12
Background Knowledge: Similarity Join
  • Examples of sim(obj1,obj2) functions:

sim(paper1,paper2) =

,

Si, most common words in page i

Tj, most common words in page j

similarity join
Similarity Join
  • sim(obj1,obj2) obj1,obj2 : documents, records in DB tables, user profiles, images, etc.
  • Particular class of similarity joins:

(string/text-) similarity join:obj1, obj2 are strings/texts

  • Many real-world application => of particular interest

SimilarityJoinpred

pred:

sim(T1.Name, T2.Name) > 2

sim(T1.Name,T2.Name)=#common words

set similarity join ssjoin

{word1,word2

….….

wordn}

{word1,word2

….….

wordn}

Set-Similarity Join(SSJoin)
  • SSJoin: a powerful primitive for supporting

(string-)similarity joins

  • Input: 2 collections of sets
  • Goal: Identify all pairs of highly similar sets

S1={…}

S2={…}

….

Sn={…}

T1={…}

T2={…}

Tn={…}

SSJoinpred

pred: sim(Si,Ti)>0.3

set similarity join
Set-Similarity Join

SSJoin

  • How can a (string-)similarity join be

reduced to a SSJoin?

  • Example:

BasedOn

SimilarityJoin

SSJoinpred

pred:

sim(T1.Name, T2.Name) > 0.5

set similarity join16
Set-Similarity Join
  • Most SSJoin algorithms are signature-based:

INPUT: Set collections R and S and threshold λ

1. For each r R, generate signature-set Sign(r)

2. For each s S, generate signature-set Sign(s)

3. Generate all candidate pairs (r, s), r R,s S satisfying

Sign(r) ∩ Sign(s)

4. Output any candidate pair (r, s) satisfying Sim(r, s) ≥ λ.

Filtering phase

Post-filtering phase

set similarity join17
Set-Similarity Join
  • Signatures:
    • Have a filtering effect: SSJoin algorithm compares only candidates not all pairs (in post-filtering phase)
    • Give the efficiency of the SSJoin algorithm: the smaller the number of candidate pairs, the better
    • Ensure correctness: Sign(r) ∩ Sign(s) , whenever Sim(r, s) ≥ λ;
set similarity join signatures example
Set-Similarity Join : Signatures Example
  • One possible signature scheme: Prefix-filtering
    • Compute Global Ordering of Tokens:

Marat …W. Safin ... Rafael ... Nadal ...P. … Smith …. John

    • Compute Signature of each input set: take the prefix of length n

Sign({John, W., Smith})=[W., Smith]

Sign({Marat,Safin})=[Marat, Safin]

Sign({Rafael, P., Nadal})=[Rafael,Nadal]

set similarity join19
Set-Similarity Join
  • Filtering Phase: Before doing the actual SSJoin, cluster/group the candidates
  • Run the SSjoin on each cluster => less workload

… … {Smith, John}

… … … {John, W., Smith}

… … {Safin,Marat,Michailowitsc}

… … … {Marat, Safin}

{Nadal , Rafael, Parera}

{Rafael, P., Nadal}

cluster/bucket1

cluster/bucket2

cluster/bucketN

outline20
Outline
  • Motivating Scenarios
  • Background Knowledge
  • Parallel Set-Similarity Join
    • Self Join
    • R-S Join
  • Evaluation
  • Conclusions
  • Strengths & Weaknesses
parallel set similarity join
Parallel Set-Similarity Join
  • Method comprises 3 stages:

Group candidates

based on signature

Generate actual

pairs of

joined records

Compute data

statistics for

good signatures

&

Compute SSJoin

Stage I:

Token Ordering

Stage II

RID-Pair Generation

Stage III:

Record Join

explanation of input data
Explanation of input data
  • RID = Row ID
  • a : join column
  • “A B C” is a string:
    • Address: “14th Saarbruecker Strasse”
    • Name: “John W. Smith”
stage i data statistics
Stage I: Data Statistics

Group candidates

based on signature

Generate actual

pairs of

joined records

Compute data

statistics for

good signatures

&

Compute SSJoin

Stage I:

Token Ordering

Stage II

RID-Pair Generation

Stage III:

Record Join

Basic Token

Ordering

One Phase

Token Ordering

token ordering
Token Ordering
  • Creates a global ordering of the tokens in the join column, based on their frequency

a

c

RID

b

Global Ordering:

(based on frequency)

basic token ordering bto
Basic Token Ordering(BTO)
  • 2 MapReduce cycles:
    • 1st : computing token frequencies
    • 2nd: ordering the tokens by their frequencies
basic token ordering 1 st mapreduce cycle

,

,

Basic Token Ordering – 1st MapReduce cycle
  • map:
    • tokenize the join
    • value of each record
    • emit each token
    • with no. of occurrences 1
  • reduce:
    • for each token, compute total
    • count (frequency)
basic token ordering 2nd mapreduce cycle
Basic Token Ordering – 2nd MapReduce cycle
  • reduce(use only 1 reducer):
    • emits the value
  • map:
    • interchange key
    • with value
one phase tokens ordering opto
One Phase Tokens Ordering (OPTO)
  • alternative to Basic Token Ordering (BTO):
    • Uses only one MapReduce Cycle (less I/O)
    • In-memory token sorting, instead of using a reducer
opto details

,

,

OPTO – Details

Use tear_down method to order the tokens in memory

  • map:
    • tokenize the join
    • value of each record
    • emit each token
    • with no. of occurrences 1
  • reduce:
    • for each token, compute total
    • count (frequency)
stage ii group candidates compute ssjoin
Stage II: Group Candidates & Compute SSJoin

Individual Tokens

Grouping

Grouped Tokens

Grouping

Group candidates

based on signature

Generate actual

pairs of

joined records

Compute data

statistics for

good signatures

&

Compute SSJoin

Stage I:

Token Ordering

Stage II

RID-Pair Generation

Stage III:

Record Join

PPJoin

Basic Kernel

rid pair generation
RID-Pair Generation
  • scans the original input data(records)
  • outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim)
  • consists of only one MapReduce cycle

Global ordering of tokens obtained in the previous stage

rid pair generation map phase
RID-Pair Generation: Map Phase
  • scan input records and for each record:
    • project it on RID & join attribute
    • tokenize it
    • extract prefix according to global ordering of tokens obtained in the Token Ordering stage
    • route tokens to appropriate reducer
grouping routing strategies
Grouping/Routing Strategies
  • Goal: distribute candidates to the right reducers to minimize reducers’ workload
  • Like hashing (projected)records to the corresponding candidate-buckets
  • Each reducer handles one/more candidate-buckets
  • 2 routing strategies:

Using Individual Tokens

Using Grouped Tokens

routing using individual tokens
Routing: using individual tokens

(projected) record

  • Treats each token as a key
  • For each record, generates a (key, value) pair for each of its prefix tokens:

token

  • Example:
  • Given the global ordering:
  • “A B C”
  • => prefix of length 2: A,B
  • => generate/emit 2 (key,value) pairs:
      • (A, (1,A B C))
      • (B, (1,A B C))
grouping routing using individual tokens
Grouping/Routing: using individual tokens
  • Advantage:
    • high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer)
  • Disadvantage:
    • high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)
routing using grouped tokens
Routing: Using Grouped Tokens
  • Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key)
  • For each record, generates a (key, value) pair for each the groups of the prefix tokens:
routing using grouped tokens37
Routing: Using Grouped Tokens
  • Example:
  • Given the global ordering:
  • “A B C” => prefix of length 2: A,B
  • Suppose A,B belong to group X and
  • C belongs to group Y
  • => generate/emit 2 (key,value) pairs:
      • (X, (1,A B C))
      • (Y, (1,A B C))
grouping routing using grouped tokens
Grouping/Routing: Using Grouped Tokens
  • The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner

A

D

F

B

G

E

C

Group2

Group1

Group3

  • Groups will be balanced w.r.t the sum of frequencies of token belonging to one specific group
grouping routing using grouped tokens39
Grouping/Routing: Using Grouped Tokens
  • Advantage:
    • Replication of data is not so pervasive
  • Disadvantage:
    • Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity)
rid pair generation reduce phase
RID-Pair Generation: Reduce Phase
  • This is the core of the entire method
  • Each reducer processes one/more buckets
  • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate

If the similarity of the 2 candidates >= threshold

=> output their ids and also their similarity

Bucket of candidates

rid pair generation reduce phase41
RID-Pair Generation: Reduce Phase
  • Computing similarity of the candidates in a bucket comes in 2 flavors:
      • Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket
      • Indexed Kernel : uses a PPJoin+ index
rid pair generation basic kernel
RID-Pair Generation: Basic Kernel
  • Straightforward method for finding candidates satisfying the join predicate
  • Quadratic complexity : O(#candidates2)

reduce:

foreach candidate in bucket

for each cand in bucket\{candidate}

if sim(candidate,cand)>= threshold

emit((candidateRID, candRID), sim)

rid pair generation ppjoin
RID-Pair Generation:PPJoin+
  • Uses a special index data structure
  • Not so straightforward to implement
  • Much more efficient

reduce:

probe PPJoinIndex with join attr value of current_candidate => a list RIDs satisfying the join predicate

add the current_candidate to the PPJoinIndex

stage iii generate pairs of joined records
Stage III: Generate pairs of joined records

Generate actual

pairs of

joined records

Group candidates

based on signature

Compute data

statistics for

good signatures

&

Compute SSJoin

Stage III

Stage I

Stage II

One Phase

Record Join

Basic Record Join

record join
Record Join
  • Until now we have only pairs of RIDs, but we need actual records
  • Use the RID pairs generated in the previous stage to join the actual records
  • Main idea:
    • bring in the rest of the each record (everything excepting the RID which we already have)
  • 2 approaches:
    • Basic Record Join (BRJ)
    • One-Phase Record Join (OPRJ)
record join basic record join
Record Join: Basic Record Join
  • Uses 2 MapReduce cycles
    • 1st cycle: fills in the record information for each half of each pair
    • 2nd cycle: brings together the previously filled in records
record join one phase record join
Record Join: One Phase Record Join
  • Uses only one MapReduce cycle
r s join
R-S Join
  • Challenge: We now have 2 different record sources => 2 different input streams
  • Map Reduce can work on only 1 input stream
  • 2nd and 3rd stage affected
  • Solution: extend (key, value) pairs so that it includes a relation tag for each record
outline49
Outline
  • Motivating Scenarios
  • Background Knowledge
  • Parallel Set-Similarity Join
    • Self Join
    • R-S Join
  • Evaluation
  • Conclusions
  • Strengths & Weaknesses
evaluation
Evaluation
  • Cluster: 10-node IBM x3650, running Hadoop
  • Data sets:
      • DBLP: 1.2M publications
      • CITESEERX: 1.3M publication
      • Consider only the header of each paper(i.e author, title, date of publication, etc.)
      • Data size synthetically increased (by various factors)
  • Measure:
      • Absolute running time
      • Speedup
      • Scaleup
self join running time
Self-Join running time
  • Best algorithm: BTO-PK-OPRJ
  • Most expensive stage: the RID-pair generation
self join speedup
Self-Join Speedup
  • Fixed data size, vary the cluster size
  • Best time: BTO-PK-OPRJ
self join scaleup
Self-Join Scaleup
  • Increase data size and cluster size together by the same factor
  • Best time: BTO-PK-OPRJ
r s join performance
R-S Join Performance
  • Mostly, the same behavior
outline56
Outline
  • Motivating Scenarios
  • Background Knowledge
  • Parallel Set-Similarity Join
    • Self Join
    • R-S Join
  • Evaluation
  • Conclusions
  • Strengths & Weaknesses
conclusions
Conclusions
  • Efficient way of computing Set-Similarity Join
  • Useful in many data cleaning scenarios
  • SSJoin and MapReduce: one solution for huge datasets
  • Very efficient when based on prefix-filtering and PPJoin+
  • Scales-up up nicely
strengths weaknesses
Strengths & Weaknesses
  • Strengths:
    • More efficient than single-node/local SSJoin
    • Failure safer than single-node SSJoin
    • Uses powerful filtering methods (routing strategies)
    • Uses PPJoinIndex (data structure optimized for SSJoin)
  • Weaknesses:
    • This implementation is applicable only to string-based input data
    • Supposes the dictionary and RID-pairs list fit in main memory
    • Repeated tokenization
    • Evaluation based on synthetically increased data
thank you

Thank you!

Questions