1 / 20

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. Presented by Fan Deng Joint work with Davood Rafiei. Outline. Motivation The problem - Approximate duplicate detection Existing solutions - Caching - Bloom filters Our approach - Stable Bloom filters

mariah
Download Presentation

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximately Detecting Duplicates for Streaming Datausing Stable Bloom Filters Presented by Fan Deng Joint work withDavood Rafiei University of Alberta

  2. Outline • Motivation • The problem - Approximate duplicate detection • Existing solutions - Caching - Bloom filters • Our approach - Stable Bloom filters - Results • Related work • Conclusions University of Alberta

  3. The Motivating Application • Duplicate URL detection in Web crawling [Broder et al. WWW03] - Web search engines fetch web pages continuously - Extract URLs within each downloaded page - Check each URL(duplicate detection), if never seen before, then download it; else skip it • Problem - Huge number of distinct URLs - Memory is usually not large enough, and disks are too slow University of Alberta

  4. The Motivating Application • Errors are usually acceptable - A false positive (false alarms) -- A distinct URL is wrongly reported as a duplicate; -- This URL will not be crawled - A false negative (misses) -- A duplicate URL is wrongly reported as distinct -- This URL will be crawled redundantly or searched in disks University of Alberta

  5. M The Problem Approximate Duplicate Detection • A sequence of elements with order • Storage space M (not large enough to store all distinct elements) • Continual membership query Appeared before? Yes or No …dg a f b e a d c b a • Our goal • Minimize the # of errors • Fast University of Alberta

  6. Existing Solutions – Caching • Store as many distinct elements as possible in a buffer • Duplicate detection process - Seeing an element, search the buffer - if found then report “duplicate” else “distinct” • Update the buffer using some replacement policies - LRU, FIFO, Random, … University of Alberta

  7. Existing Solutions – Caching • False negatives - lead to redundant crawling or searching disks • Need extra space - to speed up the searching, - to maintain the replacement policy (e.g. LRU) - space amount proportional to the buffer size University of Alberta

  8. 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 Existing Solutions – Bloom Filters • A bitmap, originally all “0” • Duplicate detection process - Hash each incoming element into some bits - If any bit is “0” then report “distinct” else “duplicate” • Update process - sets corresponding bits to “1” xh1(x) h2(x)1 2 3 4 5 6 a 1 2 b 1 3 c 2 4 a 1 2 University of Alberta

  9. 1 1 1 1 1 1 Existing Solutions – Bloom Filters • False positives (false alarms) • Bloom Filters will be “full” - All distinct URLs will be reported as duplicates, and thus skipped! University of Alberta

  10. 1 3 0 0 1 2 0 1 1 3 1 0 Our approach –Stable Bloom Filters(SBF) • Kick “elements” out of the Bloom filters • Change bits to “cells” (“cellmap”) University of Alberta

  11. Stable Bloom Filters(SBF) • A “cellmap”, originally all “0” • Duplicate detection - Hash each element into some cells, check those cells - If any cell is “0”, report “distinct” else “duplicate” • Kick “elements” - Randomly choose some cells and deduct them by 1 • Update the “cellmap” - Set cells into a predefined value, Max > 0 - Use the same hash functions as in the detection stage University of Alberta

  12. Analytical results • SBF will be stable - the expected # of “0”s will become a constant after a number of updates - converge at an exponential rate - monotonic - a lower bound of the expected # of “0”s (a function of the SBF size, # of hash functions, max cell values, and kick-out rates) University of Alberta

  13. Analytical results • Two-sided errors - false positive rates become constant - An upper bound of false positive rates (a function of 4 parameters) - Given a false positive rate and SBF size, find the optimal parameters minimizing the # of false negatives (combining empirical results on setting max cell values) University of Alberta

  14. Experiments • Experimental comparison between SBF, and Caching/Buffering method (LRU) • URL fingerprint data set, originally obtained from Internet Archive (~ 700M URLs) • Synthetic data simulating network traffics using Possion and B-model • To fairly compare, we introduce FPBuffering let Caching generate some false positives, i.e. if an element is not found in the buffer, report “duplicate” with certain probabilities University of Alberta

  15. Experimental Results • SBF generates 3-13% less false negatives than FPBuffering, while having exactly the same # of false positives (<10%) University of Alberta

  16. Experimental Results University of Alberta

  17. Experimental Results • MIN, [Broder et al. WWW03], theoretically optimal - assumes “the entire sequence of requests is known in advance” - beats LRU caching by <5% in most cases • More false positives allowed, SBF gains more University of Alberta

  18. Related work • Duplicate detection in click streams [Metwally et al. WWW05] • URL caching [Broder et al. WWW03] • Other variations of Bloom filters - Counting Bloom filters [Fan et al. SIGCOMM98] - Spectral Bloom filters [Cohen&Matias SIGMOD03] - … • Fuzzy duplicate detection [Ananthakrishna et al. VLDB02], [Chaudhuri et al. ICDE05], [Weis et al. SIGMOD05] University of Alberta

  19. Conclusions • SBF provides false positives/negatives trade-off when the space is limited • SBF is fast and simple • More false positive rates are allowed, SBF gains more University of Alberta

  20. Questions/Comments? Thanks! University of Alberta

More Related