140 likes | 211 Views
Explore the development and evaluation of a system for extracting email signature and reply lines using advanced learning algorithms. The project covers detection, extraction, and comparison of methodologies for improved email processing. Results and comparisons with existing techniques included.
E N D
1st Conference on Email and Anti-Spam, CEAS 2004Learning to Extract Signature and Reply Lines from Email Vitor R. Carvalho & William W. Cohen Carnegie Mellon University
Idea Reply lines Sig Lines
Motivation Names, Dates, Times, etc Preprocessing for: *email information extraction *content-based email classifiers “Speech Act”, Topic, etc Anonymization of email corpora Automatic personal address management Email Text-To-Speech Systems
Related work: • Sproat, Chen & Hu; “Emu: An e-mail preprocessor for text-to-speech”, “geometrical and linguistic analysis for e-mail signature” Our work: • 3 tasks: • Sig detection ( has a signature?) • Sig line extraction (in which lines?) • Reply line extraction • Compare state-of-the-art learning algorithms • Supervised learning
Data Total: 33013 lines (3321 sig lines, 5587 reply-to lines)
Sig Detection Task • Last K lines of the email message • Example: if URL pattern is detected in each of the last 3 lines, then the msg representation contains the features url1, url2 and url3
Sig Detection Results • 5-fold cross-validation on 1203 labeled messages (617 positive, 586 negative) • Sproat et al. (1999): “SIG fields are rarely longer than ten lines”. • Typical mistakes: ASCII drawing only, only the nickname of the sender, or only a few quoted sentences.
Signature Extraction Task • Email message represented as a sequence of lines • Each line is a set of features (sequential classification)
Last Lines • Effective method to extract signature and reply lines in email messages • Sequence of lines representation (+ neighbor lines features) • Comparison of state-of-the-art learning algorithms • Implementation available on the Minorthird package (Cohen, 2004)