Near duplicates detection
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Near-Duplicates Detection PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on
  • Presentation posted in: General

Near-Duplicates Detection. Naama Kraus. Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky. Why duplicate detection?. About 30-40% of the pages on the Web are (near) duplicates of other pages.

Download Presentation

Near-Duplicates Detection

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Near duplicates detection

Near-Duplicates Detection

Naama Kraus

Slides are based on Introduction to Information Retrieval Book

by Manning, Raghavan and Schütze

Some slides are courtesy of KiraRadinsky


Why duplicate detection

Why duplicate detection?

  • About 30-40% of the pages on the Web are (near) duplicates of other pages.

    • E.g., mirror sites

  • Search engines try to avoid indexing duplicate pages

    • Save storage

    • Save processing time

    • Avoid returning duplicate pages in search results

      • Improved user’s search experience

  •  The goal: detect duplicate pages


Exact duplicates detection

Exact-duplicates detection

  • A naïve approach – detect exact duplicates

    • Map each page to some fingerprint, e.g. 64-bit

    • If two web pages have an equal fingerprint  check if content is equal


Near duplicates

Near-duplicates

  • What about near-duplicates?

    • Pages that are almost identical.

    • Common on the Web. E.g., only date differs.

    • Eliminating near duplicates is desired!

  • The challenge

    • How to efficiently detect near duplicates?

    • Exhaustively comparing all pairs of web pages wouldn’t scale.


Shingling

Shingling

  • K-shingles of a document d is defined to be the set of all consecutive sequences of k terms in d

    • k is a positive integer

  • E.g., 4-shingles of “My name is Inigo Montoya. You killed my father. Prepare to die”:

    {

    my name is inigo

    name is inigomontoya

    is inigomontoya you

    inigomontoya you killed

    montoya you killed my,

    you killed my father

    killed my father prepare

    my father prepare to

    father prepare to die

    }


Computing similarity

Computing Similarity

  • Intuition: two documents are near-duplicates if their shingles sets are ‘nearly the same’.

  • Measure similarity using Jaccard coefficient

    • Degree of overlap between two sets

  • Denote by S(d) the set of shingles of document d

  • J(S(d1),S(d2)) = |S(d1) S(d2)| / |S(d1) S(d2)|

  • If J exceeds a preset threshold (e.g. 0.9) declare d1,d2 near duplicates.

  • Issue: computation is costly and done pairwise

    • How can we compute Jaccard efficiently ?


Hashing shingles

Hashing shingles

  • Map each shingle into a hash value integer

    • Over a large space, say 64 bits

  • H(di) denotes the hash values set derived from S(di)

  • Need to detect pairs whose sets H() have a large overlap

    • How to do this efficiently ? In next slides …


Permuting

Permuting

  • Let p be a random permutation over the hash values space

  • Let P(di) denote the set of permuted hash values in H(di)

  • Let xi be the smallest integer in P(di)


Illustration

Illustration

Document 1

264

Start with 64-bit H(shingles)

Permute on the number line

with p

Pick the min value

264

264

264


Key theorem

Key Theorem

  • Theorem: J(S(di),S(dj)) = P(xi = xj)

    • xi, xj of the same permutation

  • Intuition: if shingle sets of two documents are ‘nearly the same’ and we randomly permute, then there is a high probability that the minimal values are equal.


Proof 1

Proof (1)

  • View sets S1,S2 as columns of a matrix A

    • one row for each element in the universe.

    • aij = 1 indicates presence of item i in set j

  • Example

S1S2

0 1

1 0

1 1Jaccard(S1,S2) = 2/5 = 0.4

0 0

1 1

0 1


Proof 2

Proof (2)

  • Let pbe a random permutation of the rows of A

  • Denote by P(Sj) the column that results from applying p to the j-th column

  • Let xi be the index of the first row in which the column P(Si) has a 1

P(S1) P(S2)

0 1

1 0

1 1

0 0

1 1

0 1


Proof 3

Proof (3)

  • For columns Si, Sj, four types of rows

    SiSj

    A 1 1

    B 1 0

    C 0 1

    D 0 0

  • LetA = # of rows of type A

  • Clearly, J(S1,S2) = A/(A+B+C)


Proof 4

Proof (4)

  • Previous slide: J(S1,S2) = A/(A+B+C)

  • Claim: P(xi=xj) = A/(A+B+C)

  • Why?

    • Look down columns Si, Sj until first non-Type-D row

      • I.e., look for xi or xj (the smallest or both if they are equal)

    • P(xi) = P(xj)  type A row

    • As we picked a random permutation, the probability for a type A row is A/(A+B+C)

  •  P(xi=xj) = J(S1,S2)


Sketches

Sketches

  • Thus – our Jaccard coefficient test is probabilistic

    • Need to estimate P(xi=xj)

  • Method:

    • Pick k (~200) random row permutations P

    • Sketchdi = list of xi values for each permutation

      • List is of length k

  • Jaccard estimation:

    • Fractionof permutations where sketch values agree

    • |Sketchdi Sketchdj| / k


Example

Example

Sketches

S1 S2 S3

Perm 1 = (12345) 1 2 1

Perm 2 = (54321) 4 5 4

Perm 3 = (34512) 3 5 4

S1 S2 S3

R1 1 0 1

R2 0 1 1

R3 1 0 0

R4 1 0 1

R5 0 1 0

Similarities

1-2 1-3 2-3

0/3 2/3 0/3


Algorithm for clustering near duplicate documents

Algorithm for Clustering Near-Duplicate Documents

1.Compute the sketch of each document

2.From each sketch, produce a list of <shingle, docID> pairs

3.Group all pairs by shingle value

4.For any shingle that is shared by more than one document, output a triplet <smaller-docID, larger-docID, 1>for each pair of docIDs sharing that shingle

5.Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docID, larger-docID, # common shingles>

6.Join any pair of documents whose number of common shingles exceeds a chosen threshold using a “Union-Find”algorithm

7.Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents

Implementation nicely fits the “map-reduce”programming paradigm


Implementation trick

Implementation Trick

  • Permuting universeeven once is prohibitive

  • Row Hashing

    • Pick P hash functions hk

    • Ordering under hk gives random permutation of rows

  • One-pass Implementation

    • For each Ci andhk, keep slot for min-hash value

    • Initialize all slot(Ci,hk) to infinity

    • Scan rows in arbitrary order looking for 1’s

      • Suppose row Rj has 1 in column Ci

      • For each hk,

        • if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j)


Example1

Example

C1 C2

R11 0

R20 1

R31 1

R41 0

R5 0 1

C1 slotsC2 slots

h(1) = 11-

g(1) = 33-

h(2) = 212

g(2) = 030

h(3) = 312

g(3) = 220

h(4) = 412

g(4) = 420

h(x) = x mod 5

g(x) = 2x+1 mod 5

h(5) = 010

g(5) = 120


  • Login