detecting near duplicates for web crawling n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Detecting Near-Duplicates for Web Crawling PowerPoint Presentation
Download Presentation
Detecting Near-Duplicates for Web Crawling

Loading in 2 Seconds...

play fullscreen
1 / 26

Detecting Near-Duplicates for Web Crawling - PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on

Detecting Near-Duplicates for Web Crawling. Presentation By: Fernando Arreola. Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Outline. De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Detecting Near-Duplicates for Web Crawling' - fuller-santiago


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
detecting near duplicates for web crawling

Detecting Near-Duplicates for Web Crawling

Presentation By:

Fernando Arreola

Authors:

Gurmeet Singh Manku,

Arvind Jain, and

Anish Das Sarma

outline
Outline
  • De-duplication
  • Goal of the Paper
  • Why is De-duplication Important?
  • Algorithm
  • Experiment
  • Related Work
  • Tying it Back to Lecture
  • Paper Evaluation
  • Questions
de duplication
De-duplication
  • The process of eliminating near-duplicateweb documents in a generic crawl
  • Challenge of near-duplicates:
    • Identifying exact duplicates is easy
      • Use checksums
    • How to identify near-duplicate?
      • Near-duplicates are identical in content but have differences in small areas
        • Ads, counters, and timestamps
goal of the paper
Goal of the Paper
  • Present near-duplicate detection system which improves web crawling
  • Near-duplicate detection system includes:
    • Simhash technique
      • Technique used to transform a web-page to an f-bit fingerprint
    • Solution to Hamming Distance Problem
      • Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions
why is de duplication important
Why is De-duplication Important?
  • Elimination of near duplicates:
    • Saves network bandwidth
      • Do not have to crawl content if similar to previously crawled content
    • Reduces storage cost
      • Do not have to store in local repository if similar to previously crawled content
    • Improves quality of search indexes
      • Local repository used for building search indexes not polluted by near-duplicates
algorithm simhash technique
Algorithm: Simhash Technique
  • Convert web-page to set of features
    • Using Information Retrieval techniques
      • e.g. tokenization, phrase detection
    • Give a weight to each feature
  • Hash each feature into a f-bit value
  • Have a f-dimensional vector
    • Dimension values start at 0
  • Update f-dimensional vector with weight of feature
    • If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature
    • If i-th bit of hash value is one -> add the weight of the feature to the i-thvector value
  • Vector will have positive and negative components
    • Sign (+/-) of each component are bits for the fingerprint
algorithm simhash technique cont
Algorithm: Simhash Technique (cont.)
  • Very simple example
    • One web-page
      • Web-page text: “Simhash Technique”
    • Reduced to two features
      • “Simhash” -> weight = 2
      • “Technique” -> weight = 4
    • Hash features to 4-bits
      • “Simhash” -> 1101
      • “Technique” -> 0110
algorithm simhash technique cont1
Algorithm: Simhash Technique (cont.)
  • Start vector with all zeroes

0

0

0

0

algorithm simhash technique cont2
Algorithm: Simhash Technique (cont.)
  • Apply “Simhash” feature (weight = 2)

feature’s

f-bit value

calculation

2

0

1

0 + 2

0

2

1

0 + 2

-2

0

0

0 - 2

0

2

1

0 + 2

algorithm simhash technique cont3
Algorithm: Simhash Technique (cont.)
  • Apply “Technique” feature (weight = 4)

feature’s

f-bit value

calculation

-2

2

0

2 - 4

2

6

1

2 + 4

2

-2

-2 + 4

1

-2

2

2 - 4

0

algorithm simhash technique cont4
Algorithm: Simhash Technique (cont.)
  • Final vector:
  • Sign of vector values is -,+,+,-
  • Final 4-bit fingerprint = 0110

-2

6

2

-2

algorithm solution to hamming distance problem
Algorithm: Solution to Hamming Distance Problem
      • Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions
  • Solution:
    • Create tables containing the fingerprints
      • Each table has a permutation (π) and a small integer (p) associated with it
      • Apply the permutation associated with the table to its fingerprints
      • Sort the tables
    • Store tables in main-memory of a set of machines
      • Iterate through tables in parallel
        • Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F)
        • For the fingerprints that matched, check if they differ from πi(F) in at most k-bits
algorithm solution to hamming distance problem cont
Algorithm: Solution to Hamming Distance Problem (cont.)
  • Simple example
    • F = 0100 1101
    • K = 3
    • Have a collection of 8 fingerprints
    • Create two tables
algorithm solution to hamming distance problem cont3
Algorithm: Solution to Hamming Distance Problem (cont.)
  • F = 0100 1101

π(F) = 1101 0100

π(F) = 0101 0011

Match!

algorithm solution to hamming distance problem cont4
Algorithm: Solution to Hamming Distance Problem (cont.)
  • With k =3, only fingerprint in first table is a near-duplicate of the F fingerprint

F

algorithm compression of tables
Algorithm: Compression of Tables
  • Store first fingerprint in a block (1024 bytes)
  • XOR the current fingerprint with the previous one
  • Append to the block the Huffman code for the position of the most significant 1 bit
  • Append to the block the bits after the most significant 1 bit
  • Repeat steps 2-4 until block is full
  • Comparing to the query fingerprint
    • Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block
algorithm extending to batch queries
Algorithm: Extending to Batch Queries
  • Problem: Want to get near-duplicates for batch of query fingerprints – not just one
  • Solution:
    • Use Google File System (GFS) and MapReduce
      • Create two files
        • File F has the collection of fingerprints
        • File Q has the query fingerprints
      • Store the files in GFS
        • GFS breaks up the files into chunks
      • Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q
        • MapReduce allows for a task to be created per chunk
        • Iterate through chunks in parallel
        • Each task produces output of near-duplicates found
      • Produce sorted file from output of each task
        • Remove duplicates if necessary
experiment parameters
Experiment: Parameters
  • 8 Billion web pages used
  • K = 1 …10
  • Manually tagged pairs as follows:
    • True positives
      • Differ slightly
    • False positives
      • Radically different pairs
    • Unknown
      • Could not be evaluated
experiment results
Experiment: Results
  • Accuracy
    • Low k value -> a lot of false negatives
    • High k value -> a lot of false positives
    • Best value -> k = 3
      • 75% of near-duplicates reported
      • 75% of reported cases are true positives
  • Running Time
    • Solution Hamming Distance: O(log(p))
    • Batch Query + Compression:
      • 32GB File & 200 tasks -> runs under 100 seconds
related work
Related Work
  • Clustering related documents
    • Detect near-duplicates to show related pages
  • Data extraction
    • Determine schema of similar pages to obtain information
  • Plagiarism
    • Detect pages that have borrowed from each other
  • Spam
    • Detect spam before user receives it
tying it back to lecture
Tying it Back to Lecture
  • Similarities
    • Indicated importance of de-duplication to save crawler resources
    • Brief summary of several uses for near-duplicate detection
  • Differences
    • Lecture focus:
      • Breadth-first look at algorithms for near-duplicate detection
    • Paper focus:
      • In-depth look of simhash and Hamming Distance algorithm
        • Includes how to implement and effectiveness
paper evaluation pros
Paper Evaluation: Pros
  • Thorough step-by-step explanation of the algorithm implementation
  • Thorough explanation on how the conclusions were reached
  • Included brief description of how to improve simhash + Hamming Distance algorithm
    • Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc.
paper evaluation cons
Paper Evaluation: Cons
  • No comparison
    • How much more effective or faster is it than other algorithms?
    • By how much did it improve the crawler?
  • Limited batch queries to a specific technology
    • Implementation required use of GFS
    • Approach not restricted to certain technology might be more applicable