1 / 15

SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections

SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections. Presenter: Tsai Tzung Ruei Authors: Martin Theobald , Jonathan Siddharth , and Andreas Paepcke. 國立雲林科技大學 National Yunlin University of Science and Technology. SIGIR. 2008. Outline. Motivation Objective

jack
Download Presentation

SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai TzungRuei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR. 2008

  2. Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments

  3. Motivation • Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte-by-byte comparisons fail.

  4. Objective • To avoid exact duplicates during thecollection of Web archives, near duplicates frequently slipinto the corpus.

  5. Methodology • SPOT SIGNATURE • EXTRACTION • MATCHING document Web Database

  6. Methodology • SPOT SIGNATURE EXTRACTION • A = {aj(dj, cj)} Example a(1,2), an(1,2), the(1,2) and is(1,2) “ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.” Result S = {a:rally:kick, a:weeklong:campain, the:south:carolina, the:record:straight, an:attack:circulating, the:internet:designed, is:designed:play}

  7. Methodology • SPOT SIGNATURE MATCHING • JaccardSimilarity for Sets Generalization for Multi-Sets

  8. Methodology • SPOT SIGNATURE MATCHING partition SPOT SIGNATURE Inverted Index Pruning Jaccard Similarity for Sets partition partition

  9. Methodology • Optimal Partitioning

  10. Methodology • Inverted Index Pruning Example d1 = {s1:5, s2:4, s3:4}, with |d1| = 13 d2 = {s1:8, s2:4}, |d2| = 12 d3 = {s1:4, s2:5, s3:5} , |d3| = 14 τ = 0.8 δ1 = 0 δ2 = |d1| − |d3| = −1 partition SPOT SIGNATURE Inverted Index Pruning Jaccard Similarity for Sets partition partition

  11. Experiments • Gold Set of Near Duplicate News Articles • SpotSigs vs. Shingling • Choice of Spot Signatures • SpotSigs vs. Hashing • TREC WT10g • SpotSigs vs. Hashing

  12. Experiments SpotSigs vs. Hashing • Gold Set of Near Duplicate News Articles Choice of Spot Signatures SpotSigs vs. Shingling

  13. Experiments • TREC WT10g • SpotSigs vs. Hashing

  14. Conclusion • MAJOR CINTRIBUTION • SpotSigs proved to provide both increased robustness of signatures as well as highly efficient deduplicationcompared to various state-of-the-art approaches. • FUTURE WORK • Future work will focus on efficient access to disk-based index structures, as well as generalizing the bounding approach toward other metrics such as Cosine.

  15. Comments • Advantage • The SpotSigsdeduplication algorithm runs “right out of the box” without the need for further tuning, while remaining exact and efficient. • Drawback • ….. • Application • information retrieval

More Related