1 / 34

Near-Duplicate Detection for eRulemaking

Near-Duplicate Detection for eRulemaking. Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University {huiyang, callan}@cs.cmu.edu. Presentation Outline. Introduction Problem Definition System Architecture Feature-based Document Retrieval

Download Presentation

Near-Duplicate Detection for eRulemaking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University {huiyang, callan}@cs.cmu.edu

  2. Presentation Outline • Introduction • Problem Definition • System Architecture • Feature-based Document Retrieval • Similarity-based Clustering • Evaluation and Experimental Results • Related Work • Conclusion and Demo

  3. Introduction - I • U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations. • Some popular regulations attract hundreds of thousands of comments from the general public. • In late 1990s USDA’s national organic standard, manually sort over 250,000 public comments. • In 2004 the EPA’s proposed “Mercury rule” (USEPA-OAR-2002-0056) attracted over 530,000 email messages. • Very labor-intensive

  4. Introduction - II • Things Become Worse Now • Many Online form letters available • Written by special interest groups • Modifying an electronic form letter is extremely easy • Special Interest Groups • Build electronic advocacy groups when there is a disconnect between broad public opinion and legislative action. • Provide information and tools to help each individual have the greatest possible impact once a group is assembled. • Moveon.org, http://www.moveon.org • GetActive, http://www.getactive.org

  5. Introduction - III • Public comments will be near-duplicates if created from the same form letter. • Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter. • Goal: • Recognizing the Near-duplicates and organize them • Finding the added information by an individual • Finding the Unique comments • Our research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data

  6. Problem Definition - I • What is a near-duplicate? • Pugh declared that “two documents are considered near duplicates if they have more than r features in common”. • Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human experts • Our definition based on the ways to create near-duplicates

  7. Problem Definition - II • Sources of Public Comments • from scratch (unique comments) • based on a form letter (exact- or near duplicates) • A Category-based Definition • Block edit • Key block • Minor change • Minor change + block edit • Exact

  8. Block Edit

  9. Key Block

  10. Minor Change

  11. Minor Change + Block Edit

  12. Block Reordering

  13. Exact

  14. Presentation Outline • Introduction • Problem Definition • System Architecture • Feature-based Document Retrieval • Similarity-based Clustering • Evaluation and Experimental Results • Related Work • Conclusion and Demo

  15. System Architecture

  16. Feature-based Document Retrieval - I • To get duplicate candidate set for each seed • Avoid working on the entire dataset • Steps: • Each seed document is broken into chunks • Select the most informative words from each chunk • a text span around term t* which is the term with minimal document frequency in the chunk

  17. Feature-based Document Retrieval - II • Metadata extraction by Information Extraction • email senders, • receivers, • signatures, • docket IDs, • delivered dates, • email relayers

  18. Feature-based Document Retrieval - III • Query Formulation #AND ( docketoar.20020056 router.moveon #OR(“standards proposed by” “will harm thousands” “unborn children for” “coal plants should” “other cleaner alternative” “by 90 by” “with national standards” “available pollution control”) )

  19. Presentation Outline • Introduction • Problem Definition • System Architecture • Feature-based Document Retrieval • Similarity-based Clustering • Evaluation and Experimental Results • Related Work • Conclusion and Demo

  20. Similarity-based Clustering - I • Document dissimlilarity based on Kullback-Leibler (KL) divergence • KL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another. • Clustering Algorithm • Soft, Non-Hierarchical Clustering (Partition) • Single Pass Clustering with carefully selected seed documents each time • Close to K-Means • No need to define K before hand

  21. Similarity-based Clustering - II Dup Set 1 B N Dup Set 2 Dup Candidates d d d

  22. Adaptive Thresholding • Cut-off threshold for different clusters should be different • Documents in a cluster are sorted by their document-centroid similarity scores. • Sample sorted scores at 10 document intervals • If there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.

  23. Does Feature-based Document Retrieval Helps? • It works fairly efficient for large clusters • cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. • 536,975 ->10,995 documents • Bad for small clusters (especially for those only containing a single unique document) • Disable feature-based retrieval after most of the big clusters have been found. • assume that most of the remaining unclustered documents are unique. • Only similarity-based clustering is used on them

  24. Presentation Outline • Introduction • Problem Definition • System Architecture • Feature-based Document Retrieval • Similarity-based Clustering • Evaluation and Experimental Results • Related Work • Conclusion and Demo

  25. Evaluation Methodology - I • Difficult to evaluate clustering performance • lack of man power to produce ground truth for large dataset • Two subsets of 1,000 email messages each were selected randomly from the Mercury dataset. • Assessors: two graduate research assistants • Manually organized the documents into clusters of documents that they felt were near-duplicates • Manually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)

  26. Evaluation Methodology - II • Class j vs. Cluster i • F-measure pij = nij/ ni , rij = nij/ nj F = , Fj = maxi {Fij} • Purity ρ = , ρi = maxj{pij}, • Pairwise-measure Folkes and Mallows index • Kappa κ = p(A) = (a+d)/m , p(E) =

  27. Experimental Results - I

  28. Experimental Results - II

  29. Conclusion • Large Volume Working Set • Duplicate Definition and Automatic evaluation • Feature-based Duplicate Candidate Retrieval • Similarity-based Clustering • Improved Efficiency

  30. Related Work - I • Duplicate detection in other domains: • databases [Bilenko and Mooney 2003] • to find records referring to the same entity but possibly in different representations • electronic publishing [Brin et al. 1995] • to detect plagiarism or to identify different versions of the same document. • web search [Chowdhury et al. 2002] [Pugh 2004] • more efficient web-crawling • effective search results ranking • easy web documents archiving

  31. Related Work - II • Fingerprinting • a compact description of a document, and then do pair-wise comparison of document fingerprints • Shingling [Broder et al.] • represents a document as a series of simple numeric encodings for an n-term window • retain every mth shingle to produce a document sketch • super shingles • Selective fingerprinting [Heintze] • selected a subset of the substrings to generate fingerprints • Statistical approach [Chowdhury et al.] • n high idf terms • Improved accuracy over shingling • Efficient: one-fifth of the time over shingling • Fingerprints Reliability in dynamic environment [Conrad et al.] • Consider time factor on the Web

  32. References • M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003. • S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398–409. ACM Press, May 1995. • A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 ’97, pages 391–404. Elsevier Science, April 1997. • J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004 • A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002. • J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960. • J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM’03, pages 443–452. ACM Press, Nov. 2003. • N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191–200, Nov. 1996. • W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003

  33. Demo • http://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/

More Related