1 / 30

L arge-scale Similarity Join with Edit-distance Constraints

L arge-scale Similarity Join with Edit-distance Constraints. ---BY Yu Haiyang. 1 / 30. Outline. Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop. 2014/10/21. http://datamining.xmu.edu.cn. 2 / 30. Background.

rufin
Download Presentation

L arge-scale Similarity Join with Edit-distance Constraints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30

  2. Outline Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop 2014/10/21 http://datamining.xmu.edu.cn 2/30

  3. Background Similarity join: Find all similar pairs from two sets. Data Cleaning. Query Relaxation Spellchecking 2014/10/21 http://datamining.xmu.edu.cn 3/30

  4. Background How to define similarity? Jaccard distance(词袋模型) Cosine distance Edit distance 2014/10/21 http://datamining.xmu.edu.cn 4/30

  5. Background Edit distance The minimum number of edit operations (insertion, deletion, and substitution) to transform one string to another. Insertion Bod Body Substitution Baby Body 2014/10/21 http://datamining.xmu.edu.cn 5/30

  6. Background How does the edit distance compare with other two? Accuracy: {“abcdefg”,”gfedcba”} Verification time: O(m+n) -> O(mn) 2014/10/21 http://datamining.xmu.edu.cn 6/30

  7. Background Find similar pairs We have two string sets ,one is {vldb,sigmod,….} ,the other is {pvldb,icde,…}. Find some candidate pairs , and then verify these pairs. {<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….} <vldb,pvldb> Yes <vldb,icde> No 2014/10/21 http://datamining.xmu.edu.cn 7/30

  8. Background So we have to: Finding candidate pairs. There are O(N2) if we do not prune some pairs. verifying these pairs. O(mn) 2014/10/21 http://datamining.xmu.edu.cn 8/30

  9. Outline Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop 2014/10/21 http://datamining.xmu.edu.cn 9/30

  10. Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K= 1 and we have a pair <“abcde”,”ace”> 2014/10/21 http://datamining.xmu.edu.cn 10/30

  11. Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K=2and we have a pair <“abcdefghijk”,”abdefghk”> 2014/10/21 http://datamining.xmu.edu.cn 11/30

  12. Introduction of Pass-Join-K Some obvious pruning techniques Length –based: threshold = 2,<“ab”,”abcee”> Shift-based: <“abcd”,”cdef”> 2014/10/21 http://datamining.xmu.edu.cn 12/30

  13. Introduction of Pass-Join-K Partition Scheme We have seen that the longer the substrings are, the harder they could be marched. So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 2014/10/21 http://datamining.xmu.edu.cn 13/30

  14. Introduction of Pass-Join-K Partition Scheme 2014/10/21 http://datamining.xmu.edu.cn 14/30

  15. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/10/21 http://datamining.xmu.edu.cn 15/30

  16. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/10/21 http://datamining.xmu.edu.cn 16/30

  17. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/10/21 http://datamining.xmu.edu.cn 17/30

  18. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; 2014/10/21 http://datamining.xmu.edu.cn 18/30

  19. Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; a b d e f g h k 2014/10/21 http://datamining.xmu.edu.cn 19/30

  20. Introduction of Pass-Join-K Substring Selection So what we do is to deduce the number of substrings. More pruning techniques, please read our paper: 《Pass-Join-K多分段匹配的相似性连接算法》 2014/10/21 http://datamining.xmu.edu.cn 20/30

  21. Introduction of Pass-Join-K Verification DP( Dynamic programming) D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-1,n-1)+flag) where flag = 1 when sm=rn , s and r are both strings. 2014/10/21 http://datamining.xmu.edu.cn 21/30

  22. Introduction of Pass-Join-K Verification Here we suppose tau = 3 and k = 1; Tauleft = 3 Tauright = 3-3=0 2014/10/21 http://datamining.xmu.edu.cn 22/30

  23. Outline Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop 2014/10/21 http://datamining.xmu.edu.cn 23/30

  24. Combining Pass-Join-K with Hadoop Big data Big file Large number of files 2014/10/21 http://datamining.xmu.edu.cn 24/30

  25. Combining Pass-Join-K with Hadoop Inverted index tree in hadoop (abc, 1, 11,r,IFlag) (def,2,11,r,IFlag) (ghi,3,11,r,IFlag) (jk,4,11,r,IFlag) L11 1 3 4 2 r r r r 2014/10/21 http://datamining.xmu.edu.cn 25/30

  26. Combining Pass-Join-K with Hadoop Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate some records such as (a,1,5,s,SFlag),(a,2,6,s,SFlag)(a,3,7,s,SFlag),(ab,1,8,s,SFlag),…,(ab,1,11,s,SFlag),… 2014/10/21 http://datamining.xmu.edu.cn 26/30

  27. Combining Pass-Join-K with Hadoop Data flows in hadoop 2014/10/21 http://datamining.xmu.edu.cn 27/30

  28. Combining Pass-Join-K with Hadoop Big data Big file Large number of files 2014/10/21 http://datamining.xmu.edu.cn 28/30

  29. Combining Pass-Join-K with Hadoop [segmentString, segmentNumber, stringLength, FLAG], [DirNumber, ID] 2014/10/21 http://datamining.xmu.edu.cn 29/30

  30. Email: yhycai@gmail.com Thanks for patience 2014/10/21 http://datamining.xmu.edu.cn 30/30

More Related