enron corpus a new dataset for email classification l.
Skip this Video
Loading SlideShow in 5 Seconds..
Enron Corpus: A New Dataset for Email Classification PowerPoint Presentation
Download Presentation
Enron Corpus: A New Dataset for Email Classification

Loading in 2 Seconds...

play fullscreen
1 / 18

Enron Corpus: A New Dataset for Email Classification - PowerPoint PPT Presentation

  • Uploaded on

Enron Corpus: A New Dataset for Email Classification. By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee. Introduction. Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion. Motivation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Enron Corpus: A New Dataset for Email Classification' - Angelica

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
enron corpus a new dataset for email classification

Enron Corpus:A New Dataset for Email Classification

By Bryan Klimt and Yiming Yang

CEAS 2004

Presented by Will Lee

  • Motivation
  • Related Works
  • The Enron Corpus
  • Methods
  • Evaluation
  • Thread Information
  • Conclusion
  • Other corpuses focus on newsgroups or personal email data
  • Lack of common data set to evaluate the performance of email classification
    • Previous research uses different personal data sets
  • Difficulties to find actual use of email within a company
    • Obviously, companies do not like to share their internal emails
    • Privacy concerns for people working for the company
related works
Related Works
  • Other corpuses
    • 20 Newsgroups
      • http://people.csail.mit.edu/people/jrennie/20Newsgroups/
  • Related Papers
    • Y. Diao, H. Lu, and D. Wu, A Comparative Study of Classification Based Personal E-mail Filtering (PAKDD ’00)
    • I. Androutsopoulos, et. al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00)
    • T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)
20 newsgroups
20 Newsgroups
  • Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups
  • Sample newsgroups:
    • comp.graphics, rec.motorcycles, rec.sport.baseball, sci.electronics, talk.politics.misc, talk.religion.misc, etc.
  • Used originally in Ken Lang’s Newsweeder: Learning to filter netnews paper (ICML 1995)
  • Dataset on newsgroup data, probably not very useful for research in personal information management
enron dataset
Enron Dataset
  • 619,446 messages (200,399 after cleaning) by 158 users
  • Average 757 messages per user
  • Shows most users do use folders to organize emails
  • Can use folder information to evaluate effectiveness for folder classification
enron corpus characteristics
Enron Corpus’ Characteristics
  • Number of messages per user varies from a few messages to 10K + messages
  • Upper bound of folder seems to correlate to the log(# of messages)
  • Number of messages does not correlate to the lower bound (can have many messages but a few folders)
  • Question: how can we use this kind of information?
email classification features
Email Classification Features
  • Constructive text
    • BOW approach, feature used the most
    • Some fields are more important than the others
    • Stemming, stop word removal used, effectiveness not proven
  • Categorical text
    • “to” and “from” fields
    • BOW, useful for classification, but not as useful as constructive text
  • Numeric data
    • Size of message, number of replies, number of words, etc.
    • Not very useful
  • Thread information
    • Indicates how message relates to each other
    • Not fully exploited
email features example
Email Features (Example)

Numeric data

Categorical text

From: Mark Hills <mhills@cs.uiuc.edu>

Subject: Re: When is the first lecture? When will the course page be updated?

Date: Thu, 26 Aug 2004 13:41:09 -0500

Lines: 11

Message-ID: <cglafa$f3o$1@dcs-news1.cs.uiuc.edu>

References: <cgl09c$bll$1@dcs-news1.cs.uiuc.edu>

In-Reply-To: <cgl09c$bll$1@dcs-news1.cs.uiuc.edu>

Joshua Blatt wrote:

> When is the first lecture? When will the course page be updated?


> Thanks


> Josh

The first lecture was today, during the normally scheduled time.


Thread information

Contextual text

classification method
Classification Method
  • Vector space model with SVM
  • Vector weight wi is evaluated using “ltc” (http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means:
    • l: new-tf = ln (tf) + 1.0
    • t: new-wt = new-tf * log (num-docs/coll-freq-of-term)
    • c: divide each new-wt by sqrt (sum of (new-wts squared))
classification method cont
Classification Method (Cont.)
  • Sort messages in chronological order, split into train and test set
  • Run SVM on term weighted vectors of
    • From
    • Subject
    • Body
    • To, CC
    • All fields
  • Linear regression on all fields seem to have the best performance
number of messages vs f1
Number of Messages vs. F1
  • Number of message does not directly correlate to the accuracy
  • Question: What about the case where the user has only one folder, which makes classification trivial?
number of folders vs f1
Number of Folders vs. F1
  • There’s correlation between the number of folders and the F1 score.
  • Question: Is this trivial as well?
  • Some elements in the messages not modeled, since SVM have more messages to train on.
thread information
Thread Information
  • 200,399 messages, 101,786 threads, 71,696 threads with only one message
  • 61.63% of messages of corpus is in a thread.
  • Average thread size is 4.1 messages
  • Average folder per thread is 1.37 (meaning most messages of the thread stays in one folder)
  • Question: Not clear how threads are detected. How can we use this information?
more thread
More Thread
  • D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997)
  • Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text






more thread cont
More Thread (Cont.)
  • Lewis’ work assumes that the thread information is incomplete in the message header.
  • May not be the case.
  • Algorithm by Jamie Zawinski is widely used in the original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively.
    • http://www.jwz.org/doc/threading.htm
  • Questions
    • How can we leverage the thread information in email messages more effectively?
    • Does this model extend to the more recent form of conversation such as blog and web forums as well?
  • Pros
    • Introduce a new corpus that can be useful in evaluating classification performance on a large collection of personal mail
    • Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company
  • Cons
    • Details on performing SVM and the linear weight for various fields are missing
    • Not clear how threads are detected