Genre-based decomposition of email class noise

Genre-based decomposition of email class noise Aleksander Kołcz Microsoft Gordon Cormack University of Waterloo

Class-noise Noise in data mining: measurement/feature noise Class noise (label mis-assignment) Class noise can lead to inferior model induction If its rate is high enough, models can be quite poor If it is present in test data, it can lead to the wrong conclusions In real data class noise is not really equivalent to flipping class labels uniformly at random

Email spam detection A complex problem with data non-stationarity and adversarial play Close to 100% accuracy desired, especially when it comes to false positives (misclassifying emails from the boss, payment receipts, etc, can be very costly). Yet – people have a hard time classifying even their own email in a consistent way. Problem is more pronounced for community filtering where the same messages may be considered spam or ham by different users. How can we create and tune highly accurate models when training and test data contain noisy labels?

Class noise in email data Inter-assessor disagreement on the same corpus can be as high as 5%. A single person can disagree with themselves too (e.g., 0.1-1% is not uncommon). Class noise is not uniform – studies flipping class labels uniformly at random find that machine learning methods cope with such data better than with natural class noise.

Measuring/identifying class noise Often studied from the perspective of data cleaning/preprocessing. Disagreement within an ensemble of classifiers Mimicking the common data cleaning procedure – votes of human assessors. Confident decisions of a single classifier indicating the opposite class Depends on the classifiers’ ability of expressing confidence/calibration. Inconsistent instances can be removed from the training pool or have their labels “corrected”. Down-weighting potential noisy cases is also an option

Specificity of the email domain Labeling confusion between classes can be problem dependent Past studies considered pair-wise variation in class confusion for multi-class problems. In email data, messages of commercial type are easier to confuse with spam than ones of personal nature. We hypothesize that class noise is heavily dependent on email genre.

Email genres

Genres vs. sub-categories Genres can generally span the class boundary e.g., there can be both legitimate and spammy advertisements Sub-categories are considered class-specific This makes it natural to specify sub-category specific misclassification costs

Questions • What is a better way of measuring human disagreement • Classifier disagreement • Classifier confidence • How is class noise distributed among genres? • How to use the knowledge of this distribution to our advantage?

Datasets TREC-05 Combination of Enron data, spam traps and personal mailbox data (52,790 spam and 39,399 ham). Hand labeled over multiple passes by 2 assessors (this is the gold standard). SpamOrHam: A version of TREC-05 where messages were labeled by multiple assessors (different from those responsible for gold standard) – 342,771 labels total. CEAS-08: 109,123 spam and 89,451 ham messages labeled by end-users and collected by a large email service provider.

Online filtering setup Used in TREC/CEAS filtering competitions Messages sorted in arrival order When making a decision filters can only use past messages for training Initial 1,200 messages are not included in evaluation

Classifiers Logistic Regression (online version) DMC (compression-based) ROSVM (online SVM) Bogofilter (an online variant of Naïve Bayes) Features – unigrams and character ngrams Evaluation: AUC, 1-AUC

Is classifier agreement consistent with assessor agreement? For SpamOrHam data and example is positive if all assessors agree on its class label and negative otherwise Similarly, prediction is positive if all members of an ensemble agree on it Faithfulness in terms of 1-AUC is 42% which is better than random (50%) It is not all that great however Classifiers and humans achieve consensus on different types of content

Is classifier self-confidence consistent with assessor agreement? Surprisingly much more so than inter-classifier agreement

Estimating human assessor error rate

Assume all assessors are independent and have the same error rate e The probability u that 3 assessors are unanimous is: u=(1-e)3+e3 Which can be solved for e The SpamOrHam data allows us to measure u directly, and thus estimate e. Moreover, this can be done in a class-specific or genre-specific manner Estimating human assessor error rate

Overall estimate For SpamOrHam a single assessor’s labeling rate is 5.2% (which is quite high) If it was their own email the error can be expected to be lower but most likely we overestimate our own ability to label messages correctly!

Genre breakdown Totals: 1-u=14.9% e=5.2%

Genre breakdown for ham Totals for ham: 1-u=21.6% e=7.8%

Genre breakdown for spam Totals for spam: 1-u=10.1% e=3.5%

Noise vs. prevalence

Lessons learned Labeling error rate is quite high Labeling errors vary by class and genre Labeling errors for a genre spanning the class boundary are lower for the class in which the genre is more prevalent

Reliability indicators • In ensemble learning it has been suggested to use reliability indicators to help in identifying the classifier with highest competence. • Genre information can be thought of as a reliability indicator too. • In this case it indicates the reliability of class labels provided during model induction.

Using genres in classifying spam Learn a spam classifier as well as genre classifiers Use winner-take-all to select the genre for each message Use the confidence of the winning genre-based classifier together with the score of the regular spam classifier as inputs to a meta-classifier Other similar setups are possible

Meta-classifier Regular classifier Meta-classifier score final score score Genre-classifier (winner-take-all)

Results (1-AUC)%

Effect of genres The improvements in classification performance are statistically significant (and quite prominent for TREC-05) For CEAS-08 the evaluation used noisy labels, so the actual level of improvement could be higher Transferring genre classier from one corpus to another is surprisingly robust Genre-membership information could be providing information about the reliability of the target class label

Conclusions Class noise in email is biased by the type of content For genres spanning the class boundary noise is more prevalent for the class that is less frequent for the genre Classifier confidence is a good substitute for inter-assessor disagreement Genre-membership information can be a useful feature in improving the classification performance

Genre-based decomposition of email class noise

Genre-based decomposition of email class noise

Presentation Transcript

Template-based Email

Decomposition based techniques in mathematical programming

Class “help” email address:

Genre-Based Approach and the Competence-Based Curriculum

Reading Genre: Email “Professional Email Etiquette”

World Class Email Marketing

Multi-Class Blue Noise Sampling

Genre Based Sentimental Analysis of Movies using NLP

Decomposition of Fractions

Eigen-decomposition of a class of Infinite dimensional tridiagonal matrices

Synthesis of Speed Independent Circuits Based on Decomposition

Multi-Class Blue Noise Sampling

Reading Genre: Email “Professional Email Etiquette” (HOP pgs. 91 – 94)

Improved ASR in noise using harmonic decomposition

Web Based Email

Fact-based question decomposition in DeepQA

Genre as Noise - Noise in Genre

STAGES OF DECOMPOSITION

Genre-Based Approach and the Competence-Based Curriculum

Synthesis of Speed Independent Circuits Based on Decomposition

Web Based Email