1 / 25

Top-k Set Similarity Joins

Top-k Set Similarity Joins. Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee. Based on Chuan Xiao’s presentation slides in ICDE ’09. Outline. Introduction Problem Definition Existing Approaches

finnea
Download Presentation

Top-k Set Similarity Joins

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee • Based on Chuan Xiao’s presentation slides in ICDE ’09

  2. Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments

  3. Motivation • Data Cleaning

  4. More Application • Near duplicate Web page detection Obama Has Busy Final Day Before Taking Office as Bush Says Farewells iht.com Jan 20, 2009 New York Times Jan 19th, 2009

  5. Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments

  6. (Traditional) Set Similarity Join • Each record is tokenized into a set • Given a collection of records, the set similarity join problem is to find all pairs of records, <x,y>, such that sim(x,y) t • Common similarity functions: • jaccard: • cosine: • dice: What if t is unknown beforehand?

  7. What If t is Unknown Beforehand? • Example – using jaccard similarity function • w = {A, B, C, D, E} • x = {A, B, C, E, F} • y = {B, C, D, E, F} • z = {B, C, F, G, H} • If t = 0.7  no results • If t = 0.4  <w,x>, <w,y>, <x,y>, <x,z>, <y,z> (too many results and long running time) • Return the top-k results ranked by their similarity values • if k = 1  <w,x>

  8. Top-k Set Similarity Join • Return top-k pairs of records, ranked by similarity scores • Advantages over traditional similarity join • Without specifying a threshold • Output results progressively benefit interactive applications • Produce most meaningful results under limited resources/time constraints • Can be stopped at any time, but still guarantee sim(output results)  sim(unseen pairs)

  9. Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments

  10. Straightforward Solution • Start from a certain t, repeat the following steps: • answer traditional sim-join with t as threshold • if # of results  k, stop and output k results with highest sim • else, decrease t • Example (jaccard, k = 2) • w = {A, B, C, E} • x = {A, B, C, E, F} • y = {B, C, D, E, F} • z = {B, C, F, G, H} • t = 0.9  no result • t = 0.8  <w,x> • t = 0.7  <w,x> • t = 0.6  <w,x>, <x,y> Which thresholds shall we enumerate? 0.8, 0.6 results don’t change!

  11. Naïve and Index-Based Algorithms • Naïve Algorithm: • Compare every pair of objects -> O(n2) time complexity • Index-based Algorithm[Sarawagi et al. SIGMOD04]: inverted lists Record Set Index Construction Candidate Generation Verification Result Pairs

  12. Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07] • Sort the tokens by a global ordering • increasing order of document frequency • Only need to index the first few tokens (prefix) for each record • Example: jaccardt = 0.8  |x y|  4 if |x|=|y|=5 sorted x upper boundO(x,y) = 3 < 4! y sorted prefix • Must share at least one token in prefix to be a candidate pair • For jaccard, prefix length = |x| * (1 –t) + 1  each t is associated with a prefix length

  13. Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments

  14. Necessary Thresholds • Each prefix is associated with a threshold • the maximum possible similarity a record can achieve with other records t x = x y z

  15. Event-driven Model • Problem: repeated invocation of sim-join algorithm • t is decreasing  run sim-join algorithm in an incremental way • Prefix Event <x, A, t> • Initialize prefix length for each record as 1  <x, A, 1.0> • For each prefix event • Probe the inverted list of the token for candidate pairs, verify the candidate pairs, and insert them into temp results • Insert x into A’s inverted list • Extend prefix by one token  maintain prefix events with a max-heap on t • Stop until tk-th temp result’s similarity

  16. Topk-join - Example jaccard, k=2 prefix event t=0.6  2nd temp result’s sim w x y z inverted list temporary result verified twice!

  17. Optimizations - Verification • In the above example, (w,x) and (y,z) have been verified twice • How to avoid repeated verification? • Memorize all verified pairs with a hash table  too much memory consumption • Check if this pair will be identified again when it is verified for the first time • Keep only those will be identified again before algorithm stops • Guarantee no pair will be verified twice x if k-th temp result’s sim = 0.7 won’t be identified again! y

  18. Optimizations - Indexing • How to reduce inverted list size to save memory? • tis decreasing  calculate the upper bound of similarity for future probings into inverted lists • Don’t insert into inverted list if upper bound k-th temp result’s similarity • 0.8 x max. similarity = 4/6 = 0.67 y

  19. Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments

  20. Experiment Settings • Algorithms • topk-join • pptopk: modified ppjoin[Xiao, et al. WWW08], a prefix-filter based approach, with t = 0.95, 0.90, 0.85... • Measure • Compare topk-join and pptopk (candidate size, running time) • Output results progressively • Dataset

  21. Experiment Results

  22. Experiment Results

  23. Experiment Results

  24. Thank You! Any questions or comments?

  25. Related Work • Index-based approaches • S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004 • C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008 • Prefix-based approaches • S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006 • R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007 • C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008 • PartEnum • A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006

More Related