Enron corpus a new dataset for email classification
Download
1 / 18

Enron Corpus: A New Dataset for Email Classification - PowerPoint PPT Presentation


  • 500 Views
  • Updated On :

Enron Corpus: A New Dataset for Email Classification. By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee. Introduction. Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion. Motivation.

Related searches for Enron Corpus: A New Dataset for Email Classification

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Enron Corpus: A New Dataset for Email Classification' - Angelica


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Enron corpus a new dataset for email classification l.jpg

Enron Corpus:A New Dataset for Email Classification

By Bryan Klimt and Yiming Yang

CEAS 2004

Presented by Will Lee


Introduction l.jpg
Introduction

  • Motivation

  • Related Works

  • The Enron Corpus

  • Methods

  • Evaluation

  • Thread Information

  • Conclusion


Motivation l.jpg
Motivation

  • Other corpuses focus on newsgroups or personal email data

  • Lack of common data set to evaluate the performance of email classification

    • Previous research uses different personal data sets

  • Difficulties to find actual use of email within a company

    • Obviously, companies do not like to share their internal emails

    • Privacy concerns for people working for the company


Related works l.jpg
Related Works

  • Other corpuses

    • 20 Newsgroups

      • http://people.csail.mit.edu/people/jrennie/20Newsgroups/

  • Related Papers

    • Y. Diao, H. Lu, and D. Wu, A Comparative Study of Classification Based Personal E-mail Filtering (PAKDD ’00)

    • I. Androutsopoulos, et. al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00)

    • T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)


20 newsgroups l.jpg
20 Newsgroups

  • Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups

  • Sample newsgroups:

    • comp.graphics, rec.motorcycles, rec.sport.baseball, sci.electronics, talk.politics.misc, talk.religion.misc, etc.

  • Used originally in Ken Lang’s Newsweeder: Learning to filter netnews paper (ICML 1995)

  • Dataset on newsgroup data, probably not very useful for research in personal information management


Enron dataset l.jpg
Enron Dataset

  • 619,446 messages (200,399 after cleaning) by 158 users

  • Average 757 messages per user

  • Shows most users do use folders to organize emails

  • Can use folder information to evaluate effectiveness for folder classification


Enron corpus characteristics l.jpg
Enron Corpus’ Characteristics

  • Number of messages per user varies from a few messages to 10K + messages

  • Upper bound of folder seems to correlate to the log(# of messages)

  • Number of messages does not correlate to the lower bound (can have many messages but a few folders)

  • Question: how can we use this kind of information?


Email classification features l.jpg
Email Classification Features

  • Constructive text

    • BOW approach, feature used the most

    • Some fields are more important than the others

    • Stemming, stop word removal used, effectiveness not proven

  • Categorical text

    • “to” and “from” fields

    • BOW, useful for classification, but not as useful as constructive text

  • Numeric data

    • Size of message, number of replies, number of words, etc.

    • Not very useful

  • Thread information

    • Indicates how message relates to each other

    • Not fully exploited


Email features example l.jpg
Email Features (Example)

Numeric data

Categorical text

From: Mark Hills <[email protected]>

Subject: Re: When is the first lecture? When will the course page be updated?

Date: Thu, 26 Aug 2004 13:41:09 -0500

Lines: 11

Message-ID: <[email protected]>

References: <[email protected]>

In-Reply-To: <[email protected]>

Joshua Blatt wrote:

> When is the first lecture? When will the course page be updated?

>

> Thanks

>

> Josh

The first lecture was today, during the normally scheduled time.

Mark

Thread information

Contextual text


Classification method l.jpg
Classification Method

  • Vector space model with SVM

  • Vector weight wi is evaluated using “ltc” (http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means:

    • l: new-tf = ln (tf) + 1.0

    • t: new-wt = new-tf * log (num-docs/coll-freq-of-term)

    • c: divide each new-wt by sqrt (sum of (new-wts squared))


Classification method cont l.jpg
Classification Method (Cont.)

  • Sort messages in chronological order, split into train and test set

  • Run SVM on term weighted vectors of

    • From

    • Subject

    • Body

    • To, CC

    • All fields

  • Linear regression on all fields seem to have the best performance



Number of messages vs f1 l.jpg
Number of Messages vs. F1

  • Number of message does not directly correlate to the accuracy

  • Question: What about the case where the user has only one folder, which makes classification trivial?


Number of folders vs f1 l.jpg
Number of Folders vs. F1

  • There’s correlation between the number of folders and the F1 score.

  • Question: Is this trivial as well?

  • Some elements in the messages not modeled, since SVM have more messages to train on.


Thread information l.jpg
Thread Information

  • 200,399 messages, 101,786 threads, 71,696 threads with only one message

  • 61.63% of messages of corpus is in a thread.

  • Average thread size is 4.1 messages

  • Average folder per thread is 1.37 (meaning most messages of the thread stays in one folder)

  • Question: Not clear how threads are detected. How can we use this information?


More thread l.jpg
More Thread

  • D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997)

  • Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text

Document

weight

Query

weight

Similarity


More thread cont l.jpg
More Thread (Cont.)

  • Lewis’ work assumes that the thread information is incomplete in the message header.

  • May not be the case.

  • Algorithm by Jamie Zawinski is widely used in the original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively.

    • http://www.jwz.org/doc/threading.htm

  • Questions

    • How can we leverage the thread information in email messages more effectively?

    • Does this model extend to the more recent form of conversation such as blog and web forums as well?


Conclusion l.jpg
Conclusion

  • Pros

    • Introduce a new corpus that can be useful in evaluating classification performance on a large collection of personal mail

    • Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company

  • Cons

    • Details on performing SVM and the linear weight for various fields are missing

    • Not clear how threads are detected


ad