1 / 20

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. Author : Monika Henzinger Presenter: Chao Yan. Overview. Two near-duplicate detecting algorithms ( Broder’s & Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages)

aram
Download Presentation

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Near-DuplicateWeb Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan

  2. Overview • Two near-duplicate detecting algorithms (Broder’s& Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages) • Need to know the pros and cons of each algorithm when they work in different situations. • Need to find a new approach to get better results of detecting near-duplicates Finding Near-Duplicates in a Large Scale 3/28/2013

  3. Relation to course material • Discuss more details of two algorithms which were introduced in lecture, and draw important conclusions by comparing the experiment results • Broder’s algorithm is basically a minhashing algorithm discussed in lecture. The paper goes further to calculate a supershingle based on the minvalue vector. • Both algorithms obey the general paradigm of finding near-duplicates, which is to generate and compare signature of each file Finding Near-Duplicates in a Large Scale

  4. Broder’s Algorithm • Begin with preprocessing HTML tags and URLs for each document (also used in Charikar) • Use m functions to fingerprint the shingle sequence, and find mminvalues each from the fingerprinted sequence. Finding Near-Duplicates in a Large Scale

  5. Broder’s Algorithm • Divide the mminvalues into m’ groups, each with l elements e.g. m = 84, m’ = 6, l = 14 • Concatenate minvalues in each group to reduce the vector from m entries to m’ entries • Fingerprint each of the m’ entries to generate an m’-dimensional vector (supershingle) Finding Near-Duplicates in a Large Scale

  6. B-Similarity • Definition: The number of identical entries in the supershingle vectors of two pages • Two pages are near-duplicates ifftheir B-similarity is at least 2. e.g. m’ = 6, pairs with more than 2 entry agrees are near-duplicate Finding Near-Duplicates in a Large Scale

  7. Charikar’salgorithm • Extract a set of features (meaningful tokens) from a web page, and each feature is tagged with a weight • Each feature (token) is projected to a b-bit vector that each entry in the vector has value {-1, 1} Finding Near-Duplicates in a Large Scale

  8. Charikar’s algorithm • Sum up all b-bit projections of tokens each multiplied by its weight to form a new b-dimensional vector • Generate the final b-dimensional vector by setting the positive entry to 1 and non-positive entry to 0 Finding Near-Duplicates in a Large Scale

  9. C-Similarity • Definition: The C-similarity of two pages is the number of bits their final projections agree on • Two pages are near-duplicates iffthe number of agreeing bits in their projections lies above a fixed threshold e.g. b = 384, threshold = 372 Finding Near-Duplicates in a Large Scale

  10. Comparison of two algorithms Note: T is the total number of tokens in all web pages. D is the number of web pages. Finding Near-Duplicates in a Large Scale

  11. Comparison of experiment results • Construct a similarity graph in which every page is a node and every edge denotes a near-duplicate pair. • A node is considered a near-duplicate page iffit is incident to at least one edge Finding Near-Duplicates in a Large Scale

  12. Comparison of experiment results Distribution of degree in log-log scale B-similarity C-similarity Finding Near-Duplicates in a Large Scale

  13. Comparison of experiment results • Precision measurement • Precision of results from same sites is low because very often pages on the same site use the same boilerplate text and differ only in the main item in the center of the page. Finding Near-Duplicates in a Large Scale

  14. Comparison of experiment results • Term differences in two algorithms Finding Near-Duplicates in a Large Scale

  15. Comparison of experiment results • Distribution of term differences in two algorithms Broder’s algorithm Charikar’s algorithm Finding Near-Duplicates in a Large Scale

  16. Comparison of experiment results • Error cases: Finding Near-Duplicates in a Large Scale

  17. A combined algorithm • Use Broder’s algorithm to compute all B-similar pairs first. Then use Charikar’s algorithm to filter out those pairs whose C-similarity falls below a certain threshold • The reason: false positives for Broder’s algorithm (consecutive term differences with large boilerplate text) can be filtered by Charikar’s algorithm • Overall precision improves to 0.79 Finding Near-Duplicates in a Large Scale

  18. Pros • Experiment is persuasive and reliable to conclude the pros and cons of the two algorithms. e.g. large data samples, human evaluation, error case analysis • The combined approach includes advantages from both algorithms which can avoid large numbers of false positives. • In the combined approach, Charikar’s algorithm is computed on the fly, which saves much space. Finding Near-Duplicates in a Large Scale

  19. Cons • The experiment focus on the precision of the two algorithm, but do not get statistics on the recall. • The combined algorithm has overhead on time complexity, because finding a near-duplicate pair need to run both algorithm. Finding Near-Duplicates in a Large Scale

  20. Improvement • Consider token order in Charikar’s algorithm by using shingling; • Consider token frequency in Broder’s algorithm with weighted shingle based on frequency Finding Near-Duplicates in a Large Scale

More Related