1 / 28

Genre-based decomposition of email class noise

This study explores the presence and measurement of class noise in email classification, specifically focusing on genre-based decomposition. The research examines the challenges of classifying emails accurately and proposes methods to create highly accurate models in the presence of noisy labels. The study also analyzes the distribution of class noise among different email genres.

sheriw
Download Presentation

Genre-based decomposition of email class noise

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genre-based decomposition of email class noise Aleksander Kołcz Microsoft Gordon Cormack University of Waterloo

  2. Class-noise Noise in data mining: measurement/feature noise Class noise (label mis-assignment) Class noise can lead to inferior model induction If its rate is high enough, models can be quite poor If it is present in test data, it can lead to the wrong conclusions In real data class noise is not really equivalent to flipping class labels uniformly at random

  3. Email spam detection A complex problem with data non-stationarity and adversarial play Close to 100% accuracy desired, especially when it comes to false positives (misclassifying emails from the boss, payment receipts, etc, can be very costly). Yet – people have a hard time classifying even their own email in a consistent way. Problem is more pronounced for community filtering where the same messages may be considered spam or ham by different users. How can we create and tune highly accurate models when training and test data contain noisy labels?

  4. Class noise in email data Inter-assessor disagreement on the same corpus can be as high as 5%. A single person can disagree with themselves too (e.g., 0.1-1% is not uncommon). Class noise is not uniform – studies flipping class labels uniformly at random find that machine learning methods cope with such data better than with natural class noise.

  5. Measuring/identifying class noise Often studied from the perspective of data cleaning/preprocessing. Disagreement within an ensemble of classifiers Mimicking the common data cleaning procedure – votes of human assessors. Confident decisions of a single classifier indicating the opposite class Depends on the classifiers’ ability of expressing confidence/calibration. Inconsistent instances can be removed from the training pool or have their labels “corrected”. Down-weighting potential noisy cases is also an option

  6. Specificity of the email domain Labeling confusion between classes can be problem dependent Past studies considered pair-wise variation in class confusion for multi-class problems. In email data, messages of commercial type are easier to confuse with spam than ones of personal nature. We hypothesize that class noise is heavily dependent on email genre.

  7. Email genres

  8. Genres vs. sub-categories Genres can generally span the class boundary e.g., there can be both legitimate and spammy advertisements Sub-categories are considered class-specific This makes it natural to specify sub-category specific misclassification costs

  9. Questions • What is a better way of measuring human disagreement • Classifier disagreement • Classifier confidence • How is class noise distributed among genres? • How to use the knowledge of this distribution to our advantage?

  10. Datasets TREC-05 Combination of Enron data, spam traps and personal mailbox data (52,790 spam and 39,399 ham). Hand labeled over multiple passes by 2 assessors (this is the gold standard). SpamOrHam: A version of TREC-05 where messages were labeled by multiple assessors (different from those responsible for gold standard) – 342,771 labels total. CEAS-08: 109,123 spam and 89,451 ham messages labeled by end-users and collected by a large email service provider.

  11. Online filtering setup Used in TREC/CEAS filtering competitions Messages sorted in arrival order When making a decision filters can only use past messages for training Initial 1,200 messages are not included in evaluation

  12. Classifiers Logistic Regression (online version) DMC (compression-based) ROSVM (online SVM) Bogofilter (an online variant of Naïve Bayes) Features – unigrams and character ngrams Evaluation: AUC, 1-AUC

  13. Is classifier agreement consistent with assessor agreement? For SpamOrHam data and example is positive if all assessors agree on its class label and negative otherwise Similarly, prediction is positive if all members of an ensemble agree on it Faithfulness in terms of 1-AUC is 42% which is better than random (50%) It is not all that great however Classifiers and humans achieve consensus on different types of content

  14. Is classifier self-confidence consistent with assessor agreement? Surprisingly much more so than inter-classifier agreement

  15. Estimating human assessor error rate

  16. Assume all assessors are independent and have the same error rate e The probability u that 3 assessors are unanimous is: u=(1-e)3+e3 Which can be solved for e The SpamOrHam data allows us to measure u directly, and thus estimate e. Moreover, this can be done in a class-specific or genre-specific manner Estimating human assessor error rate

  17. Overall estimate For SpamOrHam a single assessor’s labeling rate is 5.2% (which is quite high) If it was their own email the error can be expected to be lower but most likely we overestimate our own ability to label messages correctly!

  18. Genre breakdown Totals: 1-u=14.9% e=5.2%

  19. Genre breakdown for ham Totals for ham: 1-u=21.6% e=7.8%

  20. Genre breakdown for spam Totals for spam: 1-u=10.1% e=3.5%

  21. Noise vs. prevalence

  22. Lessons learned Labeling error rate is quite high Labeling errors vary by class and genre Labeling errors for a genre spanning the class boundary are lower for the class in which the genre is more prevalent

  23. Reliability indicators • In ensemble learning it has been suggested to use reliability indicators to help in identifying the classifier with highest competence. • Genre information can be thought of as a reliability indicator too. • In this case it indicates the reliability of class labels provided during model induction.

  24. Using genres in classifying spam Learn a spam classifier as well as genre classifiers Use winner-take-all to select the genre for each message Use the confidence of the winning genre-based classifier together with the score of the regular spam classifier as inputs to a meta-classifier Other similar setups are possible

  25. Meta-classifier Regular classifier Meta-classifier score final score score Genre-classifier (winner-take-all)

  26. Results (1-AUC)%

  27. Effect of genres The improvements in classification performance are statistically significant (and quite prominent for TREC-05) For CEAS-08 the evaluation used noisy labels, so the actual level of improvement could be higher Transferring genre classier from one corpus to another is surprisingly robust Genre-membership information could be providing information about the reliability of the target class label

  28. Conclusions Class noise in email is biased by the type of content For genres spanning the class boundary noise is more prevalent for the class that is less frequent for the genre Classifier confidence is a good substitute for inter-assessor disagreement Genre-membership information can be a useful feature in improving the classification performance

More Related