email classification
Download
Skip this Video
Download Presentation
Email Classification

Loading in 2 Seconds...

play fullscreen
1 / 15

Email Classification - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Email Classification. Results for Folder Classification on Enron Dataset. Overall Goals. To help users manage large volumes of email. … by helping them to sort their email into folders. Immediate Goals. To establish an credible test corpus

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Email Classification' - abner


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
email classification

Email Classification

Results for

Folder Classification

on Enron Dataset

overall goals
Overall Goals
  • To help users manage

large volumes of email.

  • …by helping them to sort

their email into folders.

immediate goals
Immediate Goals
  • To establish an credible test corpus
  • To create baseline results for email classification
  • To analyze possible future techniques
the enron corpus
The “Enron” Corpus
  • Previous email classification experiments have used “toy” collections.
  • Enron emails are collected from actual business users.
  • Made public through legal proceedings.
the enron corpus5
The Enron Corpus
  • 158 users
  • 200,399 emails
  • Average of 757 emails per user
enron data analysis
Enron Data Analysis
  • Most users do use folders to classify their email.
  • Some users with many emails still have few folders.
  • Users with more emails tend to have more email in each folder.
representation
Representation
  • From
  • To, CC
  • Subject
  • Body
  • Date/Time?
  • Thread?
  • Attachments?
  • etc…?
approaches
Approaches
  • Using a bag-of-words

classification

decision

email data

“bag of words”

SVM

approaches9
Approaches
  • Using separate SVMs for each section

LLSF

classification

decision

email data

SVMs

approach
Approach
  • Data was split in half, chronologically.
  • A “flat” approach was used. (not hierarchical)
  • An SVM was trained for each folder for each user for each field.
  • The SVM for each folder was trained using all of the emails for that user.
  • Combination weights were found with a regression for each folder.
  • Thresholding was performed for optimal F1 score, using the “scut” method.
enron results analysis
“Enron” Results Analysis
  • Obviously some data fields are more useful than others.
  • Unsurprisingly, the “To, CC” data is the least useful.
  • Body is the most useful field, followed closely by sender.
  • Using all fields works better than using any particular field alone.
  • Linearly combining fields works better than bag-of-words approach.
  • Because it’s SVM, the linear weights are not directly interpretable.
enron results analysis12
Enron Results Analysis
  • F1 classification score is unrelated to the number of emails a user has.
enron results analysis13
Enron Results Analysis
  • F1 score is somewhat correlated with the number of folders a user has.
  • Emails are much harder to classify for users with many folders.
enron thread analysis
Enron Thread Analysis
  • 200,399 messages
  • 101,786 threads
  • 30,091 non-trivial threads
  • 61.63% messages are in non-trivial threads
  • Average of 4.1 messages/thread
  • Median of 2 messages/thread
enron thread analysis15
Enron Thread Analysis
  • Largest threads are most potentially useful.

But, the largest threads are the least common.

  • Threads are also redundant with other kinds of evidence.

Since threads are detected by subject and sender, much of the thread information is redundant. Also, emails in the same thread tend to have similar bodies.

  • Largest thread in the Enron corpus is 1124 copies of the same message…all in the “Deleted Items” folder for a particular user!
ad