BuzzTrackTopic Detection and Tracking in Email IUI – Intelligent User Interfaces January 2007 Gabor Cselle Google firstname.lastname@example.org Keno Albrecht ETH Zurich email@example.com Roger Wattenhofer ETH Zurich firstname.lastname@example.org
Email Overload • Email clients were not designed to handle volume and variety of messages users are dealing with today: • Large volumes of email • Task Management • Personal Archiving or Filing • Keeping Context [Whittaker and Sidner, 1996]
Fast full-text search is today's solution to finding past emails. But the flat inbox view of newly incoming emails hasn’t changed. Search vs. Inbox Browsing In our work, we focus on the problem of sensibly structuring emails in the inbox.
No sense of context: unrelated messages are shown together Important emails may drop off the “first screen” “Thread-based” tree views are unsophisticated, may not pull in all relevant messages. Today's Email Clients: The Three-Pane View
BuzzTrack Email client extension for Mozilla Thunderbirdfor displaying email grouped by topic.
Visualizations: Conversations Gmail (Google) common conversation title one entry per email, folds out on click
Automatic Foldering • Using machine learning techniques to automatically move emails into folders upon arrival • Low accuracy rates [Bekkerman et al, 2005], conceptual problems: • Users need to manually create folders and seed them with data.
People-Centered Email Clients ContactMap Bifrost [Bälter and Sidner, 2002] [Whittaker et al., 2004]
Task-based Email TaskMaster [Belotti et al., 2003] Example: TaskMaster thrasks thrask contents item contents (emails, documents, etc.)
BuzzTrack • Mozilla Thunderbird extension to automatically group related emails into topics. • Will be distributed through website: www.buzztrack.net • Provides a view on the user’s inbox.
What’s a Topic? • Topics are groups of emails that relate to the same idea, action, event, task, or question. • Examples: • A conversation about buying a digital camera. • Referring a candidate for a job. • All emails belonging to same newsgroup.
Clustering Process • For every new incoming email: Preprocessing Clustering BuzzTrack View in Thunderbird Cluster store Label generation
Preprocessing • Tokenization (remove HTML tags, style sheets, punctuation, and numbers) • Language detection • Stemming • For topic labelling: • Identify Parts-of-speech • Remember popular original word forms
Clustering • Single-link clustering: Newly incoming emails are compared to every email in existing topics: • Similarity value > threshold: assigned to topic • Similarity value <= threshold: email starts new topic
Features - 1 • How do we generate similarity values between emails? • Via a linear combination of several similarity features. • Examples: • Text similarity (TFIDF Value, cosine similarity metric) • People similarities (comparing sets of people in the From / To / Cc lines of email headers) • Thread membership
Features - 2 Other features for deriving similarities: • Subject similarity • Sender domain overlaps • Sender rank and percentage • % of email from sender that is answered • Time passed since last email in topic • People and reference count for email • Known people and reference % • Cluster size • Has attachment
Decision Score Similarities are combined into a decision score for each email / cluster pair through a linear combination of feature values: deci,j = wa*sima(mi,Cj) + wb*simb(mi,Cj) + … We tested two sets of weights wx, both trained on a development set of emails: • Empirical • Linear SVM
Evaluation • How do we evaluate clustering quality? • Topic Detection and Tracking competitions by NIST. Aimed at clustering news articles. • Corpus:
Clustering Tasks • Clustering Task is split into subtasks: • New Topic Detection (NTD):Given stream of emails, which ones start new topics? • Topic Tracking (TT):Given a fixed topic, which newly incoming emails belong to it? • DET Curves plot miss rate vs. false alarm rate for possible threshold for decision scores
Results NTD • TDT New Topic Detection Task Miss: 3% False alarm: 30% better better
Results TT • TDT Topic Tracking Task Miss: 8% False alarm: 2% better better
Comparison • Comparable quality to TDT for news articles [NIST 2004] • News has less metadata, email has worse text quality. • Wide body of work exists on improving clustering performance on news, we haven’t tapped into that yet.
BuzzTrack View • Mozilla Thunderbird plugin that provides useful view on inbox data “for free” • Topics contain email from last 60 days • We’re interested in current email only • Reduces initial clustering time • Each email is shown in one topic
Topic pane: Provides additional info Starred topics Email pane: Topics sorted by last incoming email BuzzTrack Panes
Future Work • Distribute plugin to Thunderbird users • Input on possible UI improvements • Input on clustering quality • Different clustering styles • People-based • Thread-based • We hope BuzzTrack will be valuable tool for real-world users
Questions? Contact: Gabor Cselle, email@example.com Website: www.buzztrack.net