1 / 31

Detecting Near-Duplicates for Web Crawling

Detecting Near-Duplicates for Web Crawling. Gurmeet Singh Manku (Google) Arvind Jain (Google) Anish Das Sarma (Stanford University). What are Near-Duplicates?. Identical content, but differ in small portion of document Advertisements Counters Timestamps. Near-Duplicates: Why and How?.

teresah
Download Presentation

Detecting Near-Duplicates for Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku (Google) Arvind Jain (Google) Anish Das Sarma (Stanford University)

  2. What are Near-Duplicates? • Identical content, but differ in small portion of document • Advertisements • Counters • Timestamps

  3. Near-Duplicates: Why and How? • Why do we want to detect near-duplicates? • Save storage • Search quality • How to determine if a pair of documents are near-duplicates? • Lots of past work (survey in the paper) • Our work: detect near-duplicate webpages during crawl

  4. Simplified Crawl Architecture Web Web Index one document HTML Document traverse links Near-duplicate? entire index newly-crawled document(s) Yes No trash insert

  5. Near-Duplicate Detection • Why is it hard in a crawl setting? • Scale • Tens of billions of documents indexed • Millions of pages crawled every day • Need to decide quickly!

  6. Single and Batch Modes Web Web Index one document HTML Document traverse links Single document Near-duplicate? entire index OR Batch of documents

  7. Rest of Talk • Simhash overview • Formal definition of the problem • Single and Batch algorithms • Experiments • Conclusions

  8. Simhash [Charikar 02] • Dimensionality-reduction technique • used for near-duplicate detection • Obtain f-bit fingerprint for each document • A pair of documents are near duplicate if and only if fingerprints at most k-bits apart • We experimentally show f=64, k=3 good.

  9. Simhash feature, weight hash, weight w1 w1 w1 -w1 -w1 w1w1-w1 100110 w2 w2 w2 w2-w2 -w2 -w2 -w2 110000 Doc. wn wn -wn-wnwn-wn-wnwn 001001 add sign 13,108,-22,-5,-32,55 110001 fingerprint

  10. Problem Definition • Input: • Set S of f-bit fingerprints (document index) • Query fingerprint Q (new document) • Output: • Exists near-duplicate, or not • Batch Mode Input: Set of query fingerprints • Running Example:f=64, k=3 (Q1,Q2) near-duplicate hamming-distance(simhash(Q1),simhash(Q2)) ≤ k

  11. Attempt One Pre-sorted fingerprints in S Exact Probes 64-bit Q All Q’: hd(Q,Q’)≤k=3 ( ) probes! 64 3

  12. Attempt Two Fingerprints in S S’: All fingerprints at most k-bits away from S Exact Probes 64-bit Q (Sort) |S’| ≈ |S|  ( ) 64 3

  13. Intuition for Our Approach • Observation 1: Consider 2df-bit fingerprints in sorted order • Most 2d combinations in d most significant bits exist • Can quickly do exact probe on first d’ (≤d) bits • Observation 2: Q’ hd(Q,Q’) = 3 Q exact match!

  14. Example exact search on 16 bits 16-bit Q3 Q2 Q1 Q4 64-bit Q A B C D D B C A Q1 Q4 Q2 Q3 64-bit Fingerprints in S

  15. Example: Analysis • 64-bits split into 4 pieces • 4 tables with permuted fingerprints • Exact search on 16 bits • If 234 (≈10 billion) fingerprints • Each probe gives 234-16 fingerprints

  16. Analysis (contd.) • f-bits split into r pieces • tables with permuted fingerprints • Exact search on f(1-k/r) bits • With 2d existing fingerprints, each probe yields 2d- f(1-k/r)fingerprints f=64,k=3,d=34 ( ) r k

  17. Same Idea Recursively 12-bit 12-bit 12-bit 12-bit 16-bit 36-bit 36-bit 36-bit 36-bit 16-bit 16-bit 16-bit 16-bit 48-bit 3 4 2 1 A B C D B D C C C C C A C 64-bit 16 tables 234-28 matches/probe Fingerprints in S

  18. General Solution • Space (#tables) / Time (#matches) tradeoff • Minimum number of tables, with at most 2X matches per probe? • General solution: if d<X 1, X(f,k,d) = otherwise minr>k ∙ X (fk/r, k, d-(r-k)/r), ( ) r k

  19. Compression of Tables • We can efficiently compress tables • In expectation, first d bits are common in successive fingerprints • Exploit this to compress each of the tables • Details in the paper • Brings down space requirements by nearly 50%

  20. Rest of Talk • Simhash overview • Formal definition of the problem • Single algorithm • Batch algorithm • Experiments • Conclusions

  21. Reminder: Batch Problem • Tens of billions of pages indexed • Crawl millions of pages each day • Quickly find all new pages having a near-duplicate in the index

  22. MapReduce Framework • MapReduce framework used within Google • massively parallel • Map phase: • operate individually on a set of objects • Reduce phase • aggregate results of the mapped objects

  23. Batch Algorithm • Suppose 8B existing fingerprints (~32GB after compression): File F • 1M batch query fingerprints (~8MB): File B • F stored in a GFS file system • chunked into roughly 64MB • replicated at 3 random nodes • B stored with much higher replication factor

  24. Batch Algorithm (continued) • Map Phase: • Duplicate detection within each chunk Fi and whole of B • Build multiple tables for B (in memory) • Scan Fi and probe into B • Output near-duplicates in B • Reduce phase • Merge outputs

  25. Batch Algorithm (continued) F1 B1 B1 B1 B2 B2 B2 merge F2 Fn

  26. Experimental Analysis • Promising preliminary results! • Studied: • Choice of simhash parameters • Distribution of fingerprints

  27. Choice of Simhash Parameters

  28. Distribution of Fingerprints

  29. Distribution of Fingerprints

  30. Summary • Addressed near-duplicate detection in a web-crawling system • Proposed algorithms for single and batch cases • Preliminary experiments to validate our techniques and suitability of simhash • Mini-survey of near-duplicate detection in the paper

  31. Thank you!

More Related