1 / 14

Vitor R. Carvalho & William W. Cohen Carnegie Mellon University

1 st Conference on Email and Anti-Spam, CEAS 2004 Learning to Extract Signature and Reply Lines from Email. Vitor R. Carvalho & William W. Cohen Carnegie Mellon University. Idea. Reply lines. Sig Lines. Motivation. Names, Dates, Times, etc. Preprocessing for:

Download Presentation

Vitor R. Carvalho & William W. Cohen Carnegie Mellon University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1st Conference on Email and Anti-Spam, CEAS 2004Learning to Extract Signature and Reply Lines from Email Vitor R. Carvalho & William W. Cohen Carnegie Mellon University

  2. Idea Reply lines Sig Lines

  3. Motivation Names, Dates, Times, etc Preprocessing for: *email information extraction *content-based email classifiers “Speech Act”, Topic, etc Anonymization of email corpora Automatic personal address management Email Text-To-Speech Systems

  4. Related work: • Sproat, Chen & Hu; “Emu: An e-mail preprocessor for text-to-speech”, “geometrical and linguistic analysis for e-mail signature” Our work: • 3 tasks: • Sig detection ( has a signature?) • Sig line extraction (in which lines?) • Reply line extraction • Compare state-of-the-art learning algorithms • Supervised learning

  5. Data Total: 33013 lines (3321 sig lines, 5587 reply-to lines)

  6. Sig Detection Task • Last K lines of the email message • Example: if URL pattern is detected in each of the last 3 lines, then the msg representation contains the features url1, url2 and url3

  7. Sig Detection Results • 5-fold cross-validation on 1203 labeled messages (617 positive, 586 negative) • Sproat et al. (1999): “SIG fields are rarely longer than ten lines”. • Typical mistakes: ASCII drawing only, only the nickname of the sender, or only a few quoted sentences.

  8. Signature Extraction Task • Email message represented as a sequence of lines • Each line is a set of features (sequential classification)

  9. Signature Extraction Results (5-fold cross-validation)

  10. Reply Lines Extraction Results (5-fold cross-validation)

  11. Sig & Reply Extraction: Results(5-fold cross-validation)

  12. Last Lines • Effective method to extract signature and reply lines in email messages • Sequence of lines representation (+ neighbor lines features) • Comparison of state-of-the-art learning algorithms • Implementation available on the Minorthird package (Cohen, 2004)

  13. Complete Set of Features for Line Extraction

More Related