1 / 32

Similarity join problem with Pass-Join-K using Hadoop

Similarity join problem with Pass-Join-K using Hadoop. ---BY Yu Haiyang. Outline. Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop. Background. Similarity join: Find all similar pairs from two sets. Data Cleaning. Query Relaxation Spellchecking.

Download Presentation

Similarity join problem with Pass-Join-K using Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity join problem with Pass-Join-K using Hadoop ---BY Yu Haiyang

  2. Outline • Background • The introduction of Pass-Join-K • Combining Pass-Join-K with Hadoop http://datamining.xmu.edu.cn

  3. Background • Similarity join: Find all similar pairs from two sets. • Data Cleaning. • Query Relaxation • Spellchecking “PO BOX 23, Main St.” “P.O. Box 23, Main St” “imformation” “information” http://datamining.xmu.edu.cn

  4. Background How to define similarity? Jaccard distance Cosine distance Edit distance 2014/11/13 http://datamining.xmu.edu.cn 4/32

  5. Background Edit distance The minimum number of edit operations (insertion, deletion, and substitution) to transform one string to another. Insertion Bod Body Substitution Baby Body 2014/11/13 http://datamining.xmu.edu.cn 5/32

  6. Background How does the edit distance compare with other two? Accuracy: {“abcdefg”,”gfedcba”} Verification time: O(mn) -> O(m+n) 2014/11/13 http://datamining.xmu.edu.cn 6/32

  7. Background • Find similar pairs • We have two string sets ,one is {vldb,sigmod,….} ,the other is {pvldb,icde,…}. • Find some candidate pairs , and then verify these pairs. {<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….} <vldb,pvldb> Yes <vldb,icde> No http://datamining.xmu.edu.cn

  8. Background So we have to: Finding candidate pairs. There are O(N2) if we do not prune some pairs. verifying these pairs. O(mn) 2014/11/13 http://datamining.xmu.edu.cn 8/32

  9. Introduction of Pass-Join-K • Some obvious pruning techniques • Length –based: threshold = 2,<“ab”,”abcee”> • Shift-based: <“abcd”,”cdef”> http://datamining.xmu.edu.cn

  10. Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K=2and we have a pair <“abcdefghijk”,”abdefghk”> 2014/11/13 http://datamining.xmu.edu.cn 10/32

  11. Introduction of Pass-Join-K Partition Scheme We have seen that the longer the substrings are, the harder they could be marched. So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 2014/11/13 http://datamining.xmu.edu.cn 11/32

  12. Introduction of Pass-Join-K Partition Scheme So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 2014/11/13 http://datamining.xmu.edu.cn 12/32

  13. Introduction of Pass-Join-K Partition Scheme r = “abcdefghijk” s = “abdefghk” def L11 1 3 4 2 r r r r 2014/11/13 http://datamining.xmu.edu.cn 13/32

  14. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/11/13 http://datamining.xmu.edu.cn 14/32

  15. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/11/13 http://datamining.xmu.edu.cn 15/32

  16. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/11/13 http://datamining.xmu.edu.cn 16/32

  17. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/11/13 http://datamining.xmu.edu.cn 17/32

  18. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/11/13 http://datamining.xmu.edu.cn 18/32

  19. Introduction of Pass-Join-K Substring Selection So what we do is to deduce the number of substrings. More pruning techniques, please read our paper: 《Pass-Join-K多分段匹配的相似性连接算法》 2014/11/13 http://datamining.xmu.edu.cn 19/32

  20. Introduction of Pass-Join-K Verification DP( Dynamic programming) D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-1,n-1)+flag) where flag = 1 when sm=rn , s and r are both strings. 2014/11/13 http://datamining.xmu.edu.cn 20/32

  21. Introduction of Pass-Join-K Verification Here we suppose tau = 3 and k = 1; Tauleft = 3 Tauright = 3-3=0 2014/11/13 http://datamining.xmu.edu.cn 21/32

  22. Combining Pass-Join-K with Hadoop Inverted index tree in hadoop (abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r) (jk,4,11,r) L11 1 3 4 2 r r r r 2014/11/13 http://datamining.xmu.edu.cn 22/32

  23. Combining Pass-Join-K with Hadoop Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate some records such as (a,1,5,s),(a,2,6,s)(a,3,7,s),(ab,1,8,s),…,(ab,1,11,s),… 2014/11/13 http://datamining.xmu.edu.cn 23/32

  24. Combining Pass-Join-K with Hadoop Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate more than 2*tau*(tau+k)*m records where m is the average number that substring for each segment, such as (a,1,5,s),(a,1,6,s)(a,1,7,s),(ab,1,8,s),…,(ab,1,11,s),… 2014/11/13 http://datamining.xmu.edu.cn 24/32

  25. Combining Pass-Join-K with Hadoop Data flows in hadoop 2014/11/13 http://datamining.xmu.edu.cn 25/32

  26. Combining Pass-Join-K with Hadoop How to improve the performance ? We have known that as k increased , the pairs we need to verity would be decrease. As k increased, more than (tau+k+1)/(tau+k) records should be translated in Mapper phase. 2014/11/13 http://datamining.xmu.edu.cn 26/32

  27. Combining Pass-Join-K with Hadoop Here we have 2 ways to improve our algorithm. Finding a dataset that the candidate pairs number are large enough or making tau are large enough. Decreasing the data which were generated in Mapper phase. 2014/11/13 http://datamining.xmu.edu.cn 27/32

  28. Combining Pass-Join-K with Hadoop Decrease the data flows 2014/11/13 http://datamining.xmu.edu.cn 28/32

  29. Combining Pass-Join-K with Hadoop Decrease the data flows The inverted index record was formulated as (substring,segmentNumber, LengthInf, Id, flag) Each record’s length is length(substring)+4*sizeof(int), and substring sometimes could be so long. Hash(substring) -> integer, then record length is 5*sizeof(int) 2014/11/13 http://datamining.xmu.edu.cn 29/32

  30. Combining Pass-Join-K with Hadoop Decrease the data flows The substring would generate some similar records such as (a,1,5,s),(a,1,6,s)(a,1,7,s)… Each substring would generate tau+k similar segments, so we combine them as ,for example, (a,1,5,7,s). So we make the (tau+k)*4*sizeof(int) to 5*sizeof(int). 2014/11/13 http://datamining.xmu.edu.cn 30/32

  31. Combining Pass-Join-K with Hadoop Decrease the data flows So by using two steps we have seen before, we have reduced the (length(substring)+4*sizeof(int))*(tau+k) to 5 times sizeof(int) 2014/11/13 http://datamining.xmu.edu.cn 31/32

  32. Email: yhycai@gmail.com Thanks for patience http://datamining.xmu.edu.cn

More Related