detecting phrase level duplication on the world wide web n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Detecting Phrase-Level Duplication on the World Wide Web PowerPoint Presentation
Download Presentation
Detecting Phrase-Level Duplication on the World Wide Web

Loading in 2 Seconds...

play fullscreen
1 / 19
taliesin

Detecting Phrase-Level Duplication on the World Wide Web - PowerPoint PPT Presentation

75 Views
Download Presentation
Detecting Phrase-Level Duplication on the World Wide Web
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Detecting Phrase-Level Duplication on the World Wide Web Fetterly, Manasse, Najork Paper Presentation by: Vinay Goel

  2. Introduction • Problem • Identify instances “slice and dice” generation • Example • German spammer • 1 million URLs originating from single IP (but use of many host names) • Pages changed completely on every download • Pages consisted of grammatically well-formed sentences stitched together at random

  3. Goal • Find instances of sentence level synthesis of web pages • More generally, of pages with an unusually large number of popular phrases

  4. The Data • Datasets • DS1 • BFS crawl starting at www.yahoo.com • 151 million HTML pages • DS2 • Large crawl conducted by MSN search • 96 million HTML pages chosen at random

  5. Finding Phrase Replication • Sampling • Reduce each document to a feature vector • Employ a variant of the shingling algorithm of Broder et al. • Significantly reduces the data volume

  6. Sampling method • Replace all HTML markup by white-space • k-phrases of a document: all sequences of k consecutive words • Treat the document as a circle: last word followed by first word • n word document has exactly n phrases

  7. Sampling method • Exploit properties of Rabin fingerprints • Rabin fingerprints support efficient extension and prefix deletion • Fingerprints of distinct bit patterns are distinct

  8. Computing feature vectors • Fingerprint each word in the document - gives n tokens • Compute fingerprint of each k-token phrase - gives n phrase fingerprints • Apply m different fingerprint functions • Retain the smallest of the n resulting values for each function • Vector of m fingerprints representative of document (elements referred to as shingles)

  9. Duplicate Suppression • Replication rampant on the web • Clustered all pages in data set into equivalence classes • Each class contains all pages that are exact or near duplicates of one another

  10. Popular phrases • Occur in more documents than would be expected by chance • Assumptions: • “Normal” web pages characterized by a generative model • Sought web pages - copying model (need to consider number of phrases, length of typical documents…)

  11. Popular Phrases • Limit attention to the shingles chosen by sampling functions • Phrase is popular if selected as shingle in sufficiently many documents • To determine popular phrases, consider triplets (i,s,d)

  12. Popular Phrases • First 24 most popular phrases not very interesting • Starting from the 36th phrase, discover phrases caused by machine generated content • Templatic form: common text, “fill in the blank” slots and optional • 60th phrase - instance of idiomatic phrase

  13. Zipfian Distribution

  14. Histogram of popular shingles per doc

  15. Covering set • Covering sets for shingles of each page • Approximate a minimum covering set using a greedy heuristic

  16. Distribution of covering set sizes

  17. German spammer

  18. Looking for likely sources

  19. Conclusion • Power law distribution • Popular phrases • Often limited by design choices • Legal disclaimers • Navigational phrases • “fill in the blanks” • More replicated than original content