1 / 13

Detecting Near Duplicates for Web Crawling

Detecting Near Duplicates for Web Crawling. Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Presenter: Siyuan Hua. Outline. Application and why Algorithm Google story Q&A. Application of duplicate detection. Web Documents Files in a file system E-mails Domain-specific corpora.

ama
Download Presentation

Detecting Near Duplicates for Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Near Duplicatesfor Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: SiyuanHua

  2. Outline • Application and why • Algorithm • Google story • Q&A

  3. Application of duplicate detection • Web Documents • Files in a file system • E-mails • Domain-specific corpora

  4. Why? • Web Mirrors • Clustering for “related documents” • Data extraction • Plagiarism • Spam detection • Duplicates in domain-specific corpora

  5. Fingerprinting With Simhash • Simhash compute each document to a f bit value and each bit is relevant to a unique feature of the document • Properties of simhash value: • The fingerprint of a document is a “hash” value of its features • Similar documents have similar hash values

  6. Practical Hamming Distance Problem • Definition: • Given a collection of f-bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version there are set of query fingerprints instead of a single query fingerprint) • Simple Solution: • Linear search O(mn) time • Scale Problem: • 1M query document against 8 billion( ) existing web pages in100 seconds. • Simple solution require comparisons! (impossible in 100 seconds)

  7. Practical Hamming Distance Problem(Cont.) • Oberservation: • Pre-compute all F’ such that Hamming distance between F’ and F is at most k. Assume K=3 F’ and comparisons! Too much time! • Pre-compute all F’ such that some existing fingerprint is at most Hamming distance k away from F’. Too much space!

  8. Practical Hamming Distance Problem(Cont.) • Their solution: • Initiation: They build t tables: . Associated with table Ti are two quantities: an integer and a permutation over the f bit-positions. • Given fingerprint F and an integer k, we probe these tables in parallel: • Step 1: Identify all permuted fingerprints in Ti whose top bit-positions match the top bit-positions of (F). • Step 2: For each of the permuted fingerprints identified in Step 1, check if it differs from (F) in at most k bit positions. • Example: • 64 bit fingerprint divided to 6 blocks can build 20 tables • Space: Reasonable! Time: Awesome!

  9. Practical Hamming Distance Problem(Cont.) • Exploration of Design Parameters: • (1) A small set of permutations to avoid blowup in space requirements • (2) Large values for various Pi to avoid checking too many fingerprints in Step 2. • Tradeoff • Increasing the number of tables increases pi and hence reduces the query time. Decreasing the number of tables reduces storage requirements, but reduces pi and hence increases the query time

  10. How Google utilize this algorithm • Story: • Assume that existing fingerprints are stored in file F and that the batch of query fingerprints are stored in file Q. With 8B 64-bit fingerprints, file F will occupy 64GB • They use GFS files which is broken into 64MB chunks. Each chunk is replicated at three (almost) randomly chosen machines in a cluster, each chunk is stored as a file in the local file system. • F is divided to 64-MB chunk while Q keeps entirety. • MapReduce computes all the duplications in parallel

  11. Result and Analysis

  12. Result and Analysis

  13. Thank you very much! Questions?

More Related