near duplicate detection for erulemaking n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Near-Duplicate Detection for eRulemaking PowerPoint Presentation
Download Presentation
Near-Duplicate Detection for eRulemaking

Loading in 2 Seconds...

play fullscreen
1 / 34

Near-Duplicate Detection for eRulemaking - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Near-Duplicate Detection for eRulemaking. Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University {huiyang, callan}@cs.cmu.edu. Presentation Outline. Introduction Problem Definition System Architecture Feature-based Document Retrieval

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Near-Duplicate Detection for eRulemaking' - joseph-garrison


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
near duplicate detection for erulemaking

Near-Duplicate Detection for eRulemaking

Hui Yang, Jamie Callan

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

{huiyang, callan}@cs.cmu.edu

presentation outline
Presentation Outline
  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo
introduction i
Introduction - I
  • U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations.
  • Some popular regulations attract hundreds of thousands of comments from the general public.
    • In late 1990s USDA’s national organic standard, manually sort over 250,000 public comments.
    • In 2004 the EPA’s proposed “Mercury rule” (USEPA-OAR-2002-0056) attracted over 530,000 email messages.
  • Very labor-intensive
introduction ii
Introduction - II
  • Things Become Worse Now
    • Many Online form letters available
    • Written by special interest groups
    • Modifying an electronic form letter is extremely easy
  • Special Interest Groups
    • Build electronic advocacy groups when there is a disconnect between broad public opinion and legislative action.
    • Provide information and tools to help each individual have the greatest possible impact once a group is assembled.
      • Moveon.org, http://www.moveon.org
      • GetActive, http://www.getactive.org
introduction iii
Introduction - III
  • Public comments will be near-duplicates if created from the same form letter.
  • Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter.
  • Goal:
    • Recognizing the Near-duplicates and organize them
    • Finding the added information by an individual
    • Finding the Unique comments
  • Our research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data
problem definition i
Problem Definition - I
  • What is a near-duplicate?
    • Pugh declared that “two documents are considered near duplicates if they have more than r features in common”.
    • Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human experts
  • Our definition based on the ways to create near-duplicates
problem definition ii
Problem Definition - II
  • Sources of Public Comments
    • from scratch (unique comments)
    • based on a form letter (exact- or near duplicates)
  • A Category-based Definition
    • Block edit
    • Key block
    • Minor change
    • Minor change + block edit
    • Exact
presentation outline1
Presentation Outline
  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo
feature based document retrieval i
Feature-based Document Retrieval - I
  • To get duplicate candidate set for each seed
    • Avoid working on the entire dataset
  • Steps:
    • Each seed document is broken into chunks
    • Select the most informative words from each chunk
      • a text span around term t* which is the term with minimal document frequency in the chunk
feature based document retrieval ii
Feature-based Document Retrieval - II
  • Metadata extraction by Information Extraction
      • email senders,
      • receivers,
      • signatures,
      • docket IDs,
      • delivered dates,
      • email relayers
feature based document retrieval iii
Feature-based Document Retrieval - III
  • Query Formulation

#AND ( docketoar.20020056 router.moveon #OR(“standards proposed by” “will harm thousands” “unborn children for” “coal plants should” “other cleaner alternative” “by 90 by” “with national standards” “available pollution control”) )

presentation outline2
Presentation Outline
  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo
similarity based clustering i
Similarity-based Clustering - I
  • Document dissimlilarity based on Kullback-Leibler (KL) divergence
    • KL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another.
  • Clustering Algorithm
    • Soft, Non-Hierarchical Clustering (Partition)
    • Single Pass Clustering with carefully selected seed documents each time
      • Close to K-Means
      • No need to define K before hand
similarity based clustering ii
Similarity-based Clustering - II

Dup Set 1

B

N

Dup Set 2

Dup Candidates

d

d

d

adaptive thresholding
Adaptive Thresholding
  • Cut-off threshold for different clusters should be different
    • Documents in a cluster are sorted by their document-centroid similarity scores.
    • Sample sorted scores at 10 document intervals
    • If there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.
does feature based document retrieval helps
Does Feature-based Document Retrieval Helps?
  • It works fairly efficient for large clusters
    • cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number.
    • 536,975 ->10,995 documents
  • Bad for small clusters (especially for those only containing a single unique document)
  • Disable feature-based retrieval after most of the big clusters have been found.
    • assume that most of the remaining unclustered documents are unique.
    • Only similarity-based clustering is used on them
presentation outline3
Presentation Outline
  • Introduction
  • Problem Definition
  • System Architecture
  • Feature-based Document Retrieval
  • Similarity-based Clustering
  • Evaluation and Experimental Results
  • Related Work
  • Conclusion and Demo
evaluation methodology i
Evaluation Methodology - I
  • Difficult to evaluate clustering performance
    • lack of man power to produce ground truth for large dataset
  • Two subsets of 1,000 email messages each were selected randomly from the Mercury dataset.
  • Assessors: two graduate research assistants
    • Manually organized the documents into clusters of documents that they felt were near-duplicates
    • Manually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)
evaluation methodology ii
Evaluation Methodology - II
  • Class j vs. Cluster i
  • F-measure

pij = nij/ ni , rij = nij/ nj

F = , Fj = maxi {Fij}

  • Purity

ρ = , ρi = maxj{pij},

  • Pairwise-measure

Folkes and Mallows index

  • Kappa

κ = p(A) = (a+d)/m ,

p(E) =

conclusion
Conclusion
  • Large Volume Working Set
  • Duplicate Definition and Automatic evaluation
  • Feature-based Duplicate Candidate Retrieval
  • Similarity-based Clustering
  • Improved Efficiency
related work i
Related Work - I
  • Duplicate detection in other domains:
    • databases [Bilenko and Mooney 2003]
      • to find records referring to the same entity but possibly in different representations
    • electronic publishing [Brin et al. 1995]
      • to detect plagiarism or to identify different versions of the same document.
    • web search [Chowdhury et al. 2002] [Pugh 2004]
      • more efficient web-crawling
      • effective search results ranking
      • easy web documents archiving
related work ii
Related Work - II
  • Fingerprinting
    • a compact description of a document, and then do pair-wise comparison of document fingerprints
  • Shingling [Broder et al.]
    • represents a document as a series of simple numeric encodings for an n-term window
    • retain every mth shingle to produce a document sketch
    • super shingles
  • Selective fingerprinting [Heintze]
    • selected a subset of the substrings to generate fingerprints
  • Statistical approach [Chowdhury et al.]
    • n high idf terms
    • Improved accuracy over shingling
    • Efficient: one-fifth of the time over shingling
  • Fingerprints Reliability in dynamic environment [Conrad et al.]
    • Consider time factor on the Web
references
References
  • M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.
  • S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398–409. ACM Press, May 1995.
  • A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 ’97, pages 391–404. Elsevier Science, April 1997.
  • J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004
  • A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.
  • J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960.
  • J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM’03, pages 443–452. ACM Press, Nov. 2003.
  • N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191–200, Nov. 1996.
  • W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003
slide34
Demo
  • http://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/