1 / 28

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia. Motivation. Data Cleaning typo: multiple representation: ‘harbor’ vs ‘harbo u r’ Bioinformatics DNA/protein sequence

preston
Download Presentation

Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints Chuan Xiao, Wei Wang, Xuemin Lin University of New South Wales & NICTA Australia CSE@UNSW

  2. Motivation • Data Cleaning • typo: • multiple representation: ‘harbor’ vs ‘harbour’ • Bioinformatics • DNA/protein sequence • AAAGTCTGAC… • AAACTCTGAC… ‘Steven Spielberg’ ‘Stephen Spielburg’ CSE@UNSW

  3. More Applications SPAM EMAIL TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION> Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. identify plagiarism detect spam Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read. CSE@UNSW

  4. Outline • Motivation • Problem Definition • Algorithms • Experiments • Conclusions CSE@UNSW

  5. Edit Similarity Join • Focus on similarity join on strings with edit distance threshold (d) • edit distance d  two strings are similar • Problem Definition • Given two collection of strings S and T, the edit similarity join problem is to compute { <s, t> | sS, tT, ed(s,t) d } • Consider the self-join case here CSE@UNSW

  6. Outline • Motivation • Problem Definition • Algorithms • Experiments • Conclusions CSE@UNSW

  7. q-gram Based Filtering[Gravano et al. VLDB01] • Naïve algorithm • compute edit distance: O(n2) time complexity • do this for N2/2 pairs • q-gram based filtering • filter-and-refine • length filtering • | len(s)-len(t) | d New_Zealand New ew_ w_Z _Ze Zea eal ala lan and CSE@UNSW

  8. Matching q-grams • count filtering • at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 –q*d • position filtering • positions of common q-grams should be within d • Implemented on RDBMS • best performance when small q, such as q=2,3 New_Zealand New ew_ w_Z _Ze Zea eal ala lan and S S S S • destroy at most q*dq-grams  share most q-grams matching q-grams CSE@UNSW

  9. Prefix Filter[Chaudhuri et al. ICDE06, Bayardo et al. WWW07] • Bottleneck: generating candidate pairs which share at least LB(s,t) matching q-grams • Prefix Filter • sort q-grams by global ordering, such as idf • Qs= • Qt= q*d+1 l-q*d-1 = LB(s,t)-1 CSE@UNSW

  10. All-Pairs-Ed Algorithm[Bayardo et al. WWW07] Indexed Record Set Prefix Filter Cand-1 Generation Count Filter Cand-2 Generation Verification Edit Distance Result Pairs CSE@UNSW

  11. Example – All-Pairs-Ed • d=1, q=2 • a=‘Austria’ • b=‘Australia’ • c=‘Australiana’ • d=‘New_Zealand’ • e=‘New_Sealand’ • after prefix filter: <a,b> <b,c> <d,e> • after count filter: <b,c> <d,e> • after edit distance verification: <d,e> prefix_len = q*d+1 = 3 • Qa={ri, Au, us, …} • Qb={ra, li, Au, …} • Qc={na, ra, li, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} CSE@UNSW

  12. Ed-Join • Idea • mismatchingq-grams provide useful information CSE@UNSW

  13. Location-Based Filtering • Idea: reduce prefix length • Example, d=1, q=2 • s=‘Austria’ • t=‘Australia’ • Qs= • Qt= location 5 1 pruned 5 7 location CSE@UNSW

  14. Minimum Prefix Length q*d+1 • Qs = sequential search at least d+1 edit operations to destroy them 1 2 3 4 5 6 A C d=2, q=2 G A C G T A Further optimization: binary search within [d+1, q*d+1] min. prefix len. = 4 CSE@UNSW

  15. Limit of Count/Loc.-Based Filter • Clustered edit operations • s=‘…please submit by Aug…’ • t=‘…please submit by Sep…’ • Non-clustered edit operations • s’=‘…please submit by Aug…’ • t’=‘…pleese supmit bi Aug…’ • Clustered edit operations destroy fewer q-grams  count/location-based filtering less effective 4 mismatching q-grams if q=2  retained (d=2) 6 mismatching q-grams if q=2  pruned (d=2) CSE@UNSW

  16. Content-Based Filtering • Probing Window • An edit operation increases L1 distance within the probing window by at most two • L1 distance should be  2d if ed(s, t) d s t CSE@UNSW

  17. Select Probing Window • Example, d=3, q=3 s t L1 = 2 L1 = 8 > 2d pruned CSE@UNSW

  18. Example – Ed-Join • d=1, q=2 • a=‘Austria’ • b=‘Australia’ • c=‘Australiana’ • d=‘New_Zealand’ • e=‘New_Sealand’ • after prefix filter: <b,c> <d,e> • after count filter: <b,c> <d,e> • after content-based filter: <d,e> • after edit distance verification: <d,e> • Qa={ri, Au, us, …} • Qb={ra, li, Au, …} • Qc={na, ra, li, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} • Qa={ri, Au, …} • Qb={ra, li, …} • Qc={na, ra, …} • Qd={_Z, Ze, Ne, …} • Qe={_S, Se, Ne, …} CSE@UNSW

  19. Outline • Motivation • Problem Definition • Algorithms • Experiments • Conclusions CSE@UNSW

  20. Experiment Settings • Environment • Intel Xeon X3220 2.4GHz CPU, 4GB RAM • Debian 4.1, GCC 4.1.2 with –O3 • Algorithm • All-Pairs-Ed [Bayardo et al. WWW07] • PartEnum [Arasu et al. VLDB06] • Ed-Join / Ed-Join-l • Dataset CSE@UNSW

  21. Experiment – Large Threshold • UNIREF, Running Time CSE@UNSW

  22. Experiment - q • TREC, Running Time • q=8 achieves best performance for TREC CSE@UNSW

  23. Experiment - with PartEnum d=1 d=2 d=3 CSE@UNSW

  24. Conclusions • Contributions • an efficient algorithm for edit similarity join • exploit mismatchingq-grams • location-based filtering – non-clustered edit ops. • content-based filtering – clustered edit ops. • longer q-grams perform best for stand-alone implementation • Future work • other similarity measures, e.g., used in DNA/protein alignment CSE@UNSW

  25. Thank you! Additional Materials Available at http://www.cse.unsw.edu.au/~weiw/project/simjoin.html CSE@UNSW

  26. Related Work • q-qram Based Filtering • L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. • Algorithms to Set Similarity Join • Index-based approaches • S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. • C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008. • Prefix-based approaches • S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. • R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. • C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. • PartEnum • A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. CSE@UNSW

  27. Related Work • Edit Distance Computation • R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974. • W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18–31, 1980. • G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic Programming. J. ACM, 46(3):395–415, 1999. • E. Ukkonen. On approximate string matching. In FCT, 1983. CSE@UNSW

  28. Experiment – Pruning Power CSE@UNSW

More Related