1 / 22

Detecting Near Duplicates for Web Crawling

Detecting Near Duplicates for Web Crawling. Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi. Introduction. There are various duplicate documents on the web. Many pages differ in small portion because of advertisement displayed and so on.

misha
Download Presentation

Detecting Near Duplicates for Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Near Duplicates for Web Crawling • Authors : • Gurmeet Singh Mank • Arvind Jain • Anish Das Sarma Presented by Chintan Udeshi Udeshi-CS572

  2. Introduction • There are various duplicate documents on the web. • Many pages differ in small portion because of advertisement displayed and so on. • Such pages are irrelevant for crawling point of you. • This paper uses Charikar‘s finnger-printing technique for the same to find out duplicate documents. • This technique is useful for both online queries and batch queries. Udeshi-CS572

  3. Advantages of duplicate detection • Saves B.W. • Reduction in storage cost • Improve quality of search engine • Reduces load on remote host. Udeshi-CS572

  4. Limitations of duplicate detection • Scaling • Speed • Use less resources Udeshi-CS572

  5. FINGERPRINTING WITH SIMHASH • Extract set of features from a document along with corresponding weight for each feature. • We use simhash to generate an f-bit finger-print based on presence or absence of feature in a given document. • When we use simhash, 64-it finger-print will be good enough for 8B we pages. Udeshi-CS572

  6. Idea behind using Simhash algorithm Simhash has 2 properties : • A : The fingerprint of a document is hash of its features. • B :Similar documents have similar hash values. • Our algorithms are designed assuming that Property A holds and we experimentally measure the impact of non-uniformity introduced by Property B on real datasets. Udeshi-CS572

  7. Hamming Distance problem • Consider a collection of 8B 64-bit fingerprints, occupying 64GB. • We have to decide whether existing 8B 64-bit fingerprints differs from F in at most k = 3 bit-positions. • Algorithm is different for online queries and batch queries. Udeshi-CS572

  8. Algorithm for online queries • We have to build t tables: T1, T2,……. Tt. • Table Ti is constructed by applying permutation to each existing fingerprints. • There are 2 steps for the same : • Identify all permuted fingerprints in Ti whose top bit-positions match the other fingerprints top bit-positions. • After following the above step, check if it differs from other by at most k bit-positions. Udeshi-CS572

  9. Design parameters for the algorithm • There is a trade-off between number of tables and selecting value of Pi for the table. • Increasing the number of tables increases Pi and hence reduces the query time. • De-creasing the number of tables reduces storage requirements, but reduces Pi and thus increases the query time. Udeshi-CS572

  10. Algorithm for Batch Queries • Files are first broken into 64 MB chunks. • Each chunk is replicated at three randomly chosen machines in a cluster. • Each chunk is stored as a file in the local system. • First, we solve hamming distance problem for each 64MB chunk. • Later on, we combine output from all the chunks to produce final output. Udeshi-CS572

  11. Broder's shingle-based fingerprints • Broder shingle-based finger-print uses Rabin fingerprints. • The algorithm is such that Given an n-bit message m0,...,mn-1…, fingerprint of m to be the remainder r(x) after division of f(x) by p(x). Udeshi-CS572

  12. Comparison with Broder's shingle-based fingerprints • For the comparison, 6 Rabin fingerprints are calculated. • Later on, it is checked to see if 2 or more finger-prints are matching or not. • Each finger-print takes approximately 24 bytes. • On the other hand, simhash will take 64-bits for 8B web pages. Udeshi-CS572

  13. Experimental Results There is a tradeoff between f and k for detection of duplicates for web pages using simhash. Topics includes : • Choice of parameters • Distribution of finger-prints • Scalability Udeshi-CS572

  14. Choice of parameters • Vary K between 1 to 10. • Divide pages into different categories • False Positive • True Positive • Unknown • There is a trade-off. • K=3 gives reasonable result for 64-bit finger-print. Udeshi-CS572

  15. Distribution of finger-print (1) • Left side of the slide doesn’t drop rapidly as the right side one. • This is due to the fact that some pages are similar to each other. • So, finger prints differ by moderate number. Udeshi-CS572

  16. Distribution of finger-print (2) • More or less uniform with spikes in some places. • Reasons: • Empty pages. • File not found. • Multiple websites uses similar login page. Udeshi-CS572

  17. Nature of Corpus: System is mainly divided into 4 documents : • Web pages. • Files in file system • E-mail • Domain-specific Corpora This paper mainly involves finding near duplicates for web pages. Udeshi-CS572

  18. Scalability • For batch mode, compressed version of file Q occupies almost 32GB. • Usually, computational time for each file is approximately 1GBps. • So, Computation usually finishes in 100 seconds. Udeshi-CS572

  19. Need to detect duplicates • Web Mirror • Clustering for related documents query • Data Extraction • Plagiarism • Spam Detection • Duplicate in domain specific corpora Udeshi-CS572

  20. Feature set per-documents • Shingles from page content • Document vector from page content • Connectivity information • Anchor text and anchor window • Phrases Udeshi-CS572

  21. Future Research • Can we categorize web-pages into categories and search for near duplicates only within the relevant categories. • Feasibility to devise algorithms for detecting portions of web-pages that contains ads or timestamp. • Change sensitivity of simhash algorithm for feature selection and assignment of weights to features. • Algorithm for clustering of the documents. • Can we categories documents based on languages. Udeshi-CS572

  22. Thank you.Q & A ? Udeshi-CS572

More Related