1 / 30

An Unsupervised Approach for the Detection of Outliers in Corpora

An Unsupervised Approach for the Detection of Outliers in Corpora. David Guthrie Louise Guthire, Yorick Wilks. The University of Sheffield. Corpora in CL. Increasingly common in computational linguistics to use textual resources gathered automatically IR, scraping Web, etc.

sun
Download Presentation

An Unsupervised Approach for the Detection of Outliers in Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield

  2. Corpora in CL • Increasingly common in computational linguistics to use textual resources gathered automatically • IR, scraping Web, etc. • Construct corpora from specific blogs, bulletin boards, websites (Wikipedia, RottenTomatoes)

  3. Corpora Can Contain Errors • IR and scraping can lead to errors in precision • Can contain entries that might be considered spam: • Advertising • gibberish messages • (more subtly) information that is an opinion rather than a fact, rants about political figures

  4. Difficult to verify • The quality of corpora has a dramatic impact on the results of QA, ASR, TC, etc. • Creation and validation of corpora has generally relied on humans

  5. Goals • Improve the consistency and quality of corpora • Automatically identify and remove text from corpora that does not belong

  6. Approach • Treat the problem as a type of outlier detection • We aim to find pieces of text in a corpus that differ significantly from the majority of text in that corpus and thus are ‘outliers’

  7. Method • Characterize each piece of text (document, segment, paragraph, …) in our corpus as a vector of features • Use these vectors to construct a matrix, X, which has number of rows equal to the pieces of text in the corpus and number of columns equal to the number of features

  8. Represent each piece of text as a vector of features Feature Matrix X

  9. Characterizing Text • 158 features computed for every piece of text (many of which have been used successfully for genre classification by Biber, Kessler, Argamon, …) • Simple Surface Features • Readability Measures • POS Distributions (RASP) • Vocabulary Obscurity • Emotional Affect (General Inquirer Dictionary)

  10. Identify outlying Text Feature Matrix X

  11. Outliers are ‘hidden’

  12. SDE • Use the Stahel-Donoho Estimator (SDE) to identify outliers • Project the data down to one dimension and measure the outlyingness of each piece of text in that dimension • For every piece of text, the goal is to find a projection of the that maximizes its robust z-score • Especially suited to data with a large number of dimensions (features)

  13. Robust Zscore of furthest point is <3

  14. Robust z score for triangles in this Projection is >12 std dev

  15. SDE • Where a is a direction (unit length vector) and • xia is the projection of row xi onto direction a • mad is the median absolute deviation

  16. Outliers have a large SD • The distances for each piece of text SD(xi)are then sorted and all pieces of text above a cutoff are marked as outliers • We use

  17. Experiments • In each experiment we randomly select 50 segments of text from the Gigaword corpus of newswire and insert one piece of text from a different source to act as an ‘outlier’ • Measure the accuracy of automatically identifying the inserted segment as an outlier • We varied the size of the pieces of text from 100 to 1000 words

  18. Anarchist Cookbook • Very different genre from newswire. The writing is much more procedural (e.g. instructions to build telephone phreaking devices) and also very informal (e.g. ``When the fuse contacts the balloon, watch out!!!'') • Randomly one segment from the Anarchist Cookbook and attempt to identify outliers • This is repeated 200 times for each segment size (100, 500, and 1,000 words)

  19. Cookbook Results • Remember we are not using any training data and there is only a chance 1/51 (1.96%) of guessing the outlier correctly

  20. Machine Translations • 35 thousand words of Chinese news articles were hand picked (Wei Liu) and translated into English using Googles Chinese to English translation engine • Similar genre to English newswire but translations are far from perfect and so the language use is very odd • 200 test collections are created for each segment size as before

  21. MT Results

  22. Conclusions and Future Work • Outlier detection can be a valuable tool for corpus linguistics (if we want a homogeneous corpus) • Automatically clean corpora • Does not require training data or human annotation • This this method can be used reliably for relatively large pieces of text (1000 words). Threshold could be adjusted to insure a high precision at the expense of recall • Looking at ways to increase accuracy by more intelligently picking directions for SDE and the cutoff to use for outliers

More Related