1 / 59

Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li

Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li. Speaker : Razvan Belet. Outline. Motivating Scenarios Background Knowledge Parallel Set-Similarity Join Self Join R-S Join Evaluation Conclusions Strengths & Weaknesses.

heath
Download Presentation

Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Parallel Set-Similarity Joins Using MapReduceRares Vernica, Michael J. Carey, Chen Li Speaker : Razvan Belet

  2. Outline • Motivating Scenarios • Background Knowledge • Parallel Set-Similarity Join • Self Join • R-S Join • Evaluation • Conclusions • Strengths & Weaknesses

  3. Scenario: Detecting Plagiarism • Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journal

  4. Scenario: Near-duplicate elimination • The archive of a search engine can contain multiple copies of the same page • Reasons: re-crawling, different hosts holding the same redundant copies of a page, etc.

  5. Problem Statement Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)> λ • Solution: • Similarity Join

  6. Motivation(2) • Some of the collections are enormous: • Google N-gram database : ~1trillion records • GeneBank : 416GB of data • Facebook : 400 million active users Try to process this data in a parallel, distributed way => MapReduce

  7. Outline • Motivating Scenarios • Background Knowledge • Parallel Set-Similarity Join • Self Join • R-S Join • Evaluation • Conclusions

  8. Background Knowledge • Set-Similarity Join • Join • Similarity Join • Set-Similarity Join

  9. Background Knowledge: Join • Logical operator heavily used in Databases • Whenever it is needed to associate records in 2 tables => use a JOIN • Associates records in the 2 input tables based on a predicate (pred) Consider this information need: for each employee find the department he works in Table Employees Table Departments

  10. Background Knowledge: Join • Example :For each employee find the department he works in JOINpred pred: EMPLOYEES.DepID= DEPARTMENTS.DerpartmentID

  11. Background Knowledge: Similarity Join • Special type of join, in which the predicate (pred) is a similarity metric/function: sim(obj1,obj2) • Return pair (obj1, ob2) if pred holds: sim(obj1,obj2) > threshold T1: Similarity Joinpred pred: sim(T1.c,T2.c)>threshold T2:

  12. Background Knowledge: Similarity Join • Examples of sim(obj1,obj2) functions: sim(paper1,paper2) = , Si, most common words in page i Tj, most common words in page j

  13. Similarity Join • sim(obj1,obj2) obj1,obj2 : documents, records in DB tables, user profiles, images, etc. • Particular class of similarity joins: (string/text-) similarity join:obj1, obj2 are strings/texts • Many real-world application => of particular interest SimilarityJoinpred pred: sim(T1.Name, T2.Name) > 2 sim(T1.Name,T2.Name)=#common words

  14. {word1,word2 ….…. wordn} {word1,word2 ….…. wordn} Set-Similarity Join(SSJoin) • SSJoin: a powerful primitive for supporting (string-)similarity joins • Input: 2 collections of sets • Goal: Identify all pairs of highly similar sets S1={…} S2={…} …. Sn={…} T1={…} T2={…} … Tn={…} SSJoinpred pred: sim(Si,Ti)>0.3

  15. Set-Similarity Join SSJoin • How can a (string-)similarity join be reduced to a SSJoin? • Example: BasedOn SimilarityJoin SSJoinpred pred: sim(T1.Name, T2.Name) > 0.5

  16. Set-Similarity Join • Most SSJoin algorithms are signature-based: INPUT: Set collections R and S and threshold λ 1. For each r R, generate signature-set Sign(r) 2. For each s S, generate signature-set Sign(s) 3. Generate all candidate pairs (r, s), r R,s S satisfying Sign(r) ∩ Sign(s) 4. Output any candidate pair (r, s) satisfying Sim(r, s) ≥ λ. Filtering phase Post-filtering phase

  17. Set-Similarity Join • Signatures: • Have a filtering effect: SSJoin algorithm compares only candidates not all pairs (in post-filtering phase) • Give the efficiency of the SSJoin algorithm: the smaller the number of candidate pairs, the better • Ensure correctness: Sign(r) ∩ Sign(s) , whenever Sim(r, s) ≥ λ;

  18. Set-Similarity Join : Signatures Example • One possible signature scheme: Prefix-filtering • Compute Global Ordering of Tokens: Marat …W. Safin ... Rafael ... Nadal ...P. … Smith …. John • Compute Signature of each input set: take the prefix of length n Sign({John, W., Smith})=[W., Smith] Sign({Marat,Safin})=[Marat, Safin] Sign({Rafael, P., Nadal})=[Rafael,Nadal]

  19. Set-Similarity Join • Filtering Phase: Before doing the actual SSJoin, cluster/group the candidates • Run the SSjoin on each cluster => less workload … … {Smith, John} … … … {John, W., Smith} … … {Safin,Marat,Michailowitsc} … … … {Marat, Safin} {Nadal , Rafael, Parera} {Rafael, P., Nadal} … cluster/bucket1 cluster/bucket2 cluster/bucketN

  20. Outline • Motivating Scenarios • Background Knowledge • Parallel Set-Similarity Join • Self Join • R-S Join • Evaluation • Conclusions • Strengths & Weaknesses

  21. Parallel Set-Similarity Join • Method comprises 3 stages: Group candidates based on signature Generate actual pairs of joined records Compute data statistics for good signatures & Compute SSJoin Stage I: Token Ordering Stage II RID-Pair Generation Stage III: Record Join

  22. Explanation of input data • RID = Row ID • a : join column • “A B C” is a string: • Address: “14th Saarbruecker Strasse” • Name: “John W. Smith”

  23. Stage I: Data Statistics Group candidates based on signature Generate actual pairs of joined records Compute data statistics for good signatures & Compute SSJoin Stage I: Token Ordering Stage II RID-Pair Generation Stage III: Record Join Basic Token Ordering One Phase Token Ordering

  24. Token Ordering • Creates a global ordering of the tokens in the join column, based on their frequency a c RID b Global Ordering: (based on frequency)

  25. Basic Token Ordering(BTO) • 2 MapReduce cycles: • 1st : computing token frequencies • 2nd: ordering the tokens by their frequencies

  26. , , Basic Token Ordering – 1st MapReduce cycle • map: • tokenize the join • value of each record • emit each token • with no. of occurrences 1 • reduce: • for each token, compute total • count (frequency)

  27. Basic Token Ordering – 2nd MapReduce cycle • reduce(use only 1 reducer): • emits the value • map: • interchange key • with value

  28. One Phase Tokens Ordering (OPTO) • alternative to Basic Token Ordering (BTO): • Uses only one MapReduce Cycle (less I/O) • In-memory token sorting, instead of using a reducer

  29. , , OPTO – Details Use tear_down method to order the tokens in memory • map: • tokenize the join • value of each record • emit each token • with no. of occurrences 1 • reduce: • for each token, compute total • count (frequency)

  30. Stage II: Group Candidates & Compute SSJoin Individual Tokens Grouping Grouped Tokens Grouping Group candidates based on signature Generate actual pairs of joined records Compute data statistics for good signatures & Compute SSJoin Stage I: Token Ordering Stage II RID-Pair Generation Stage III: Record Join PPJoin Basic Kernel

  31. RID-Pair Generation • scans the original input data(records) • outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim) • consists of only one MapReduce cycle Global ordering of tokens obtained in the previous stage

  32. RID-Pair Generation: Map Phase • scan input records and for each record: • project it on RID & join attribute • tokenize it • extract prefix according to global ordering of tokens obtained in the Token Ordering stage • route tokens to appropriate reducer

  33. Grouping/Routing Strategies • Goal: distribute candidates to the right reducers to minimize reducers’ workload • Like hashing (projected)records to the corresponding candidate-buckets • Each reducer handles one/more candidate-buckets • 2 routing strategies: Using Individual Tokens Using Grouped Tokens

  34. Routing: using individual tokens (projected) record • Treats each token as a key • For each record, generates a (key, value) pair for each of its prefix tokens: token • Example: • Given the global ordering: • “A B C” • => prefix of length 2: A,B • => generate/emit 2 (key,value) pairs: • (A, (1,A B C)) • (B, (1,A B C))

  35. Grouping/Routing: using individual tokens • Advantage: • high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer) • Disadvantage: • high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)

  36. Routing: Using Grouped Tokens • Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key) • For each record, generates a (key, value) pair for each the groups of the prefix tokens:

  37. Routing: Using Grouped Tokens • Example: • Given the global ordering: • “A B C” => prefix of length 2: A,B • Suppose A,B belong to group X and • C belongs to group Y • => generate/emit 2 (key,value) pairs: • (X, (1,A B C)) • (Y, (1,A B C))

  38. Grouping/Routing: Using Grouped Tokens • The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner A D F B G E C Group2 Group1 Group3 • Groups will be balanced w.r.t the sum of frequencies of token belonging to one specific group

  39. Grouping/Routing: Using Grouped Tokens • Advantage: • Replication of data is not so pervasive • Disadvantage: • Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity)

  40. RID-Pair Generation: Reduce Phase • This is the core of the entire method • Each reducer processes one/more buckets • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate If the similarity of the 2 candidates >= threshold => output their ids and also their similarity Bucket of candidates

  41. RID-Pair Generation: Reduce Phase • Computing similarity of the candidates in a bucket comes in 2 flavors: • Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket • Indexed Kernel : uses a PPJoin+ index

  42. RID-Pair Generation: Basic Kernel • Straightforward method for finding candidates satisfying the join predicate • Quadratic complexity : O(#candidates2) reduce: foreach candidate in bucket for each cand in bucket\{candidate} if sim(candidate,cand)>= threshold emit((candidateRID, candRID), sim)

  43. RID-Pair Generation:PPJoin+ • Uses a special index data structure • Not so straightforward to implement • Much more efficient reduce: probe PPJoinIndex with join attr value of current_candidate => a list RIDs satisfying the join predicate add the current_candidate to the PPJoinIndex

  44. Stage III: Generate pairs of joined records Generate actual pairs of joined records Group candidates based on signature Compute data statistics for good signatures & Compute SSJoin Stage III Stage I Stage II One Phase Record Join Basic Record Join

  45. Record Join • Until now we have only pairs of RIDs, but we need actual records • Use the RID pairs generated in the previous stage to join the actual records • Main idea: • bring in the rest of the each record (everything excepting the RID which we already have) • 2 approaches: • Basic Record Join (BRJ) • One-Phase Record Join (OPRJ)

  46. Record Join: Basic Record Join • Uses 2 MapReduce cycles • 1st cycle: fills in the record information for each half of each pair • 2nd cycle: brings together the previously filled in records

  47. Record Join: One Phase Record Join • Uses only one MapReduce cycle

  48. R-S Join • Challenge: We now have 2 different record sources => 2 different input streams • Map Reduce can work on only 1 input stream • 2nd and 3rd stage affected • Solution: extend (key, value) pairs so that it includes a relation tag for each record

  49. Outline • Motivating Scenarios • Background Knowledge • Parallel Set-Similarity Join • Self Join • R-S Join • Evaluation • Conclusions • Strengths & Weaknesses

  50. Evaluation • Cluster: 10-node IBM x3650, running Hadoop • Data sets: • DBLP: 1.2M publications • CITESEERX: 1.3M publication • Consider only the header of each paper(i.e author, title, date of publication, etc.) • Data size synthetically increased (by various factors) • Measure: • Absolute running time • Speedup • Scaleup

More Related