1 / 17

Document duplication (exact or approximate)

Detect near-duplicate documents on the web using advanced techniques like shingling, min-hash, and locality-sensitive hashing. Efficiently identify and compare documents that share similar content, even with slight variations.

Download Presentation

Document duplication (exact or approximate)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document duplication(exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

  2. Sec. 19.6 Duplicate documents • The web is full of duplicated content • Few exact duplicate detection • Many cases of nearduplicates • E.g., Last modified date the only difference between two copies of a page

  3. Near-Duplicate Detection • Problem • Given a large collection of documents • Identifythe near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • Mistaken – spider errors • 30% of web-pages are near-duplicates [1997]

  4. Desiderata • Storage: only small sketchesof each document. • Computation:the fastest possible • Stream Processing: • once sketch computed, source is unavailable • Error Guarantees • problem scale  small biases have large impact • need formal guarantees – heuristics will not do

  5. Natural Approaches • Fingerprinting: • only works for exact matches • Karp Rabin (rolling hash) – collision probability guarantees • MD5 – cryptographically-secure string hashes • Edit-distance • metric for approximate string-matching • expensive – even for one pair of documents • impossible – for billion web documents • Random Sampling • sample substrings (phrases, sentences, etc) • hope: similar documents  similar samples • But – even samples of same document will differ

  6. Karp-Rabin Fingerprints • Consider – m-bit string A = 1 a1 a2 … am • Basic values: • Choose a prime p in the universe U ≈ 264 • Fingerprint: f(A) = A mod p • Rolling hash  given B = a2 … am am+1 • f(B) = [2m-1 (A – 2m- a1 2m-1) + 2m + am+1 ] mod p • Prob[false hit]= Prob p divides (A-B) = #div(A-B)/ #prime(U) < (log (A+B)) / #prime(U) ) ≈ (m log U)/U

  7. Basic Idea[Broder 1997] • Shingling • dissect document into q-grams(shingles) • represent documents by shingle-sets • reduce problem to set intersection [ Jaccard ] • They are near-duplicates if large shingle-sets intersect enough

  8. SA SB #1. Doc Similarity  Set Intersection Doc A Doc B • Jaccard measure – similarity of SA, SB • Claim: A & B are near-duplicates if sim(SA,SB) is high We need to cope with“Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (further efficiency)

  9. Multiset of Shingles Multiset of Fingerprints Doc shingling fingerprint #2. Sets of 64-bit fingerprints Fingerprints: • Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) • In practice, use 64-bit fingerprints, i.e., U=264 • Prob[collision] ≈ (8q * 64)/264 << 1 This reduces space for storing the multi-sets and the time to intersect them, but...

  10. Sec. 19.6 #3. Sketch of a document • Sets are large, so their intersection is still too costly • Create a “sketch vector” (of size ~200) for each shingle-set • Documents that share ≥t(say 80%) of the sketch-elements are claimed to be near duplicates

  11. Sketching by Min-Hashing • Consider • SA, SB {0,…,p-1} • Pick a random permutation π of the whole set P (such as ax+b mod p) • Define = min{π(SA)} , b = min{π(SB)} • minimal element under permutation π • Lemma:

  12. Strengthening it… • Similarity sketch sk(A) = k minimal elements under π(SA) • We might also take K permutations and the min of each • Note: we can reduce the variance by using a larger k

  13. Sec. 19.6 Document 1 Computing Sketch[i] for Doc1 264 Start with 64-bit f(shingles) Permute with pi Pick the min value 264 264 264

  14. Sec. 19.6 Document 1 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 264 264 264 264 264 264 A B 264 264 Are these equal? Test for 200 random permutations:p1, p2,… p200

  15. Sec. 19.6 Document 2 Document 1 264 264 264 264 264 264 264 264 However… A B A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union

  16. #4. Detecting all duplicates • Brute-force (quadratic time): • compare sk(A) vs. sk(B) for all the pairs of docs A and B. • Still (numdocs)^2 istoomuchcomputingevenif it isexecuted in internalmemory • Locality sensitive hashing (LSH) for sk(A)  sk(B) • Sample h elements of sk(A) as ID (may induce false positives) • Create t IDs (to reduce the false negatives) • If at least one ID matches with another one (wrt same h-selection), then A and B are probably near-duplicates (hence compare).

  17. #4. do you implement this? GOAL:If at least one ID matches with another one (wrt same h-selection), then A and B are probably near-duplicates (hence compare). SOL 1: • Create t hashtables (with chaining), one per ID [recallthatthisis an h-sample of sk()]. • Insert the docID in the slots of each ID (using some hash) • Scaneverybucket and checkwhichdocID share >=1 ID. SOL 2: • Sortby each ID, and thencheck the consecutive equalones.

More Related