1 / 1

Goal: Learn to automatically File e-mails into folders Filter spam e-mail Motivation

LINGER – A Smart Personal Assistant for E-mail Classification James Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney, Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au. Goal: Learn to automatically File e-mails into folders

Download Presentation

Goal: Learn to automatically File e-mails into folders Filter spam e-mail Motivation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LINGER – A Smart Personal Assistant for E-mail ClassificationJames Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney,Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au • Goal: Learn to automatically • File e-mails into folders • Filter spam e-mail • Motivation • Information overload - we are spending more and more time filtering e-mails and organizing them into folders in order to facilitate retrieval when necessary • Weaknesses of the programmable automatic filtering provided by the modern e-mail software (rules to organize mail into folders or spam mail filtering based on keywords): - Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it - Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders - The nature of the e-mails within the folder may well drift over time and the characteristics of the spam e-mail (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone • Results • Performance Measures • accuracy (A), recall (R), precision (P) and F1measure • Stratified 10-fold cross validation Filing Into Folders Overall Performance • The simpler feature selector V is more effective than IG • U2 and U4 were harder to classify than U1, U3 and U5: - different classification styles: • U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when e-mails needed to be acted upon (e.g. ThisWeek) • large ratio of the number of folders over the number of e-mails for U2 and U4 • Comparison with other classifiers • Effect of Normalization and Weighting Accuracy [%] for various normalization (e – e-mail, m- mailbox, g - global) and weighting (freq. – frequency, tf-idf and boolean) • Best results: mailbox level normalization, tf-idf and frequency weighting Spam Filtering • Cost-Sensitive Performance Measures • Blocking a legitimate message is  times more costly than non-blocking a spam message • Weighted accuracy (WA): when a legitimate e-mail is misclassified/correctly classified, this counts as  errors/successes Overall Performance Performance on spam filtering for lemm corpora • 3 scenarios: =1, no cost (flagging spam e-mail), =9, moderately accurate filter (notifying senders about blocked e-mails) and =999, highly accurate filter (completely automatic scenario) • LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees Effect of Stemming and Stop Word List– do not help Performance on the 4 different versions of LingSpam Anti-Spam Filter Portability Across Corpora • a) and b) – low SP; many fp (non spam as spam) - Reason: different nature of legitimate e-mail in LingSpam (linguistics related) and U5Spam (more diverse) • - Features selected from LingSpam are too specific, not a good predictor for U5Spam (a) • b) – low SR as well; many tp (spam as non-spam) - Reason: U5Spam is considerably smaller than LingSpam a) Typical confusion matrices b) • c) and d) – good results -Not perfect feature selection but the NN is able to recover by training • More extensive experiments with diverse, non topic specific corpora, are needed to determine the portability of anti-spam filters across different users • LINGER Based on Text Categorization • Bags of wordsrepresentation – all unique words in the entire training corpus are identified • Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V) • Feature representation – normalized weighting for every word, representing its importance in the document - Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf) - Normalization at 3 levels (e-mail, mailbox, corpus) • Classifier – neural network (NN), why? - require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+) - NN trained with backpropagation, 1 output for each class, 1 hidden layer of 20-40 neurons; early stopping based on validation set or a max. # epochs (10 000) Pre-processing for words extraction • E-mail fields used - body, sender (From, Reply-to), recipient (To, CC, Bcc) and Subject (attachments seen as part of body) - These fields are treated equally and a single bag of word representation is created for each e-mail - No stemming or stop wording were applied. • Discarding words that only appear once in a corpus • Removing words longer than 20 char from the body • # unique words in a corpus reduced from 9000 to 1000 • Corpora Filing into folders Spam e-mail filtering • 4 versions of PU1 and LingSpam depending on whether stemming and stop word list were used (bare, lemm,stop and lemm_stop)

More Related