Enron corpus a new dataset for email classification
1 / 18

Enron Corpus: - PowerPoint PPT Presentation

  • Updated On :

Enron Corpus: A New Dataset for Email Classification. By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee. Introduction. Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion. Motivation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Enron Corpus:' - Angelica

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Enron corpus a new dataset for email classification l.jpg

Enron Corpus:A New Dataset for Email Classification

By Bryan Klimt and Yiming Yang

CEAS 2004

Presented by Will Lee

Introduction l.jpg

  • Motivation

  • Related Works

  • The Enron Corpus

  • Methods

  • Evaluation

  • Thread Information

  • Conclusion

Motivation l.jpg

  • Other corpuses focus on newsgroups or personal email data

  • Lack of common data set to evaluate the performance of email classification

    • Previous research uses different personal data sets

  • Difficulties to find actual use of email within a company

    • Obviously, companies do not like to share their internal emails

    • Privacy concerns for people working for the company

Related works l.jpg
Related Works

  • Other corpuses

    • 20 Newsgroups

      • http://people.csail.mit.edu/people/jrennie/20Newsgroups/

  • Related Papers

    • Y. Diao, H. Lu, and D. Wu, A Comparative Study of Classification Based Personal E-mail Filtering (PAKDD ’00)

    • I. Androutsopoulos, et. al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00)

    • T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)

20 newsgroups l.jpg
20 Newsgroups

  • Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups

  • Sample newsgroups:

    • comp.graphics, rec.motorcycles, rec.sport.baseball, sci.electronics, talk.politics.misc, talk.religion.misc, etc.

  • Used originally in Ken Lang’s Newsweeder: Learning to filter netnews paper (ICML 1995)

  • Dataset on newsgroup data, probably not very useful for research in personal information management

Enron dataset l.jpg
Enron Dataset

  • 619,446 messages (200,399 after cleaning) by 158 users

  • Average 757 messages per user

  • Shows most users do use folders to organize emails

  • Can use folder information to evaluate effectiveness for folder classification

Enron corpus characteristics l.jpg
Enron Corpus’ Characteristics

  • Number of messages per user varies from a few messages to 10K + messages

  • Upper bound of folder seems to correlate to the log(# of messages)

  • Number of messages does not correlate to the lower bound (can have many messages but a few folders)

  • Question: how can we use this kind of information?

Email classification features l.jpg
Email Classification Features

  • Constructive text

    • BOW approach, feature used the most

    • Some fields are more important than the others

    • Stemming, stop word removal used, effectiveness not proven

  • Categorical text

    • “to” and “from” fields

    • BOW, useful for classification, but not as useful as constructive text

  • Numeric data

    • Size of message, number of replies, number of words, etc.

    • Not very useful

  • Thread information

    • Indicates how message relates to each other

    • Not fully exploited

Email features example l.jpg
Email Features (Example)

Numeric data

Categorical text

From: Mark Hills <[email protected]>

Subject: Re: When is the first lecture? When will the course page be updated?

Date: Thu, 26 Aug 2004 13:41:09 -0500

Lines: 11

Message-ID: <[email protected]>

References: <[email protected]>

In-Reply-To: <[email protected]>

Joshua Blatt wrote:

> When is the first lecture? When will the course page be updated?


> Thanks


> Josh

The first lecture was today, during the normally scheduled time.


Thread information

Contextual text

Classification method l.jpg
Classification Method

  • Vector space model with SVM

  • Vector weight wi is evaluated using “ltc” (http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means:

    • l: new-tf = ln (tf) + 1.0

    • t: new-wt = new-tf * log (num-docs/coll-freq-of-term)

    • c: divide each new-wt by sqrt (sum of (new-wts squared))

Classification method cont l.jpg
Classification Method (Cont.)

  • Sort messages in chronological order, split into train and test set

  • Run SVM on term weighted vectors of

    • From

    • Subject

    • Body

    • To, CC

    • All fields

  • Linear regression on all fields seem to have the best performance

Number of messages vs f1 l.jpg
Number of Messages vs. F1

  • Number of message does not directly correlate to the accuracy

  • Question: What about the case where the user has only one folder, which makes classification trivial?

Number of folders vs f1 l.jpg
Number of Folders vs. F1

  • There’s correlation between the number of folders and the F1 score.

  • Question: Is this trivial as well?

  • Some elements in the messages not modeled, since SVM have more messages to train on.

Thread information l.jpg
Thread Information

  • 200,399 messages, 101,786 threads, 71,696 threads with only one message

  • 61.63% of messages of corpus is in a thread.

  • Average thread size is 4.1 messages

  • Average folder per thread is 1.37 (meaning most messages of the thread stays in one folder)

  • Question: Not clear how threads are detected. How can we use this information?

More thread l.jpg
More Thread

  • D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997)

  • Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text






More thread cont l.jpg
More Thread (Cont.)

  • Lewis’ work assumes that the thread information is incomplete in the message header.

  • May not be the case.

  • Algorithm by Jamie Zawinski is widely used in the original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively.

    • http://www.jwz.org/doc/threading.htm

  • Questions

    • How can we leverage the thread information in email messages more effectively?

    • Does this model extend to the more recent form of conversation such as blog and web forums as well?

Conclusion l.jpg

  • Pros

    • Introduce a new corpus that can be useful in evaluating classification performance on a large collection of personal mail

    • Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company

  • Cons

    • Details on performing SVM and the linear weight for various fields are missing

    • Not clear how threads are detected