180 likes | 289 Views
This document explores advanced algorithms for handling massive datasets, focusing on optimal data structures for web crawling and virus detection. It discusses the efficiency of different strategies, including the use of Bloom Filters for tracking visited URLs and virus checks against a predefined dictionary of checksums. The study highlights the effectiveness of utilizing trie structures, comparing their efficiency with brute-force methods while proposing an optimal solution for minimizing storage and enhancing performance. Insights into empirical formulas for optimal parameters further enrich the discussion.
E N D
Advanced Algorithmsfor Massive Datasets The power of “failing”
2 TTT
Opt k = 5.45... m/n = 8 We do have an explicit formula for the optimal k
Crawling What data structures should we use to keep track of the visited URLs of a crawler? • URLs are long • Check should be very fast • No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs
Vj i i+z F Anti-virus detection D is a dictionary of virus-checksum of some given length z. For each position i, check… • Brute-force check: O( |D| * |F| ) time • Trie check: O( z * |F| ) time • Better Solution ? • Build a BF on D. • Check T[i,i+z-1] єD, if BF answers YES then “warn the user” or explicitly scan D O(k*|F|) or even better...
Recurring minimum for improving the estimate + 2 SBF