buzztrack topic detection and tracking in email n.
Skip this Video
Loading SlideShow in 5 Seconds..
BuzzTrack Topic Detection and Tracking in Email PowerPoint Presentation
Download Presentation
BuzzTrack Topic Detection and Tracking in Email

BuzzTrack Topic Detection and Tracking in Email

124 Views Download Presentation
Download Presentation

BuzzTrack Topic Detection and Tracking in Email

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. BuzzTrackTopic Detection and Tracking in Email IUI – Intelligent User Interfaces January 2007 Gabor Cselle Google Keno Albrecht ETH Zurich Roger Wattenhofer ETH Zurich

  2. Email Overload • Email clients were not designed to handle volume and variety of messages users are dealing with today: • Large volumes of email • Task Management • Personal Archiving or Filing • Keeping Context [Whittaker and Sidner, 1996]

  3. Fast full-text search is today's solution to finding past emails. But the flat inbox view of newly incoming emails hasn’t changed. Search vs. Inbox Browsing In our work, we focus on the problem of sensibly structuring emails in the inbox.

  4. No sense of context: unrelated messages are shown together Important emails may drop off the “first screen” “Thread-based” tree views are unsophisticated, may not pull in all relevant messages. Today's Email Clients: The Three-Pane View

  5. BuzzTrack Email client extension for Mozilla Thunderbirdfor displaying email grouped by topic.

  6. Related Work

  7. Visualizations: Conversations Gmail (Google) common conversation title one entry per email, folds out on click

  8. Automatic Foldering • Using machine learning techniques to automatically move emails into folders upon arrival • Low accuracy rates [Bekkerman et al, 2005], conceptual problems: • Users need to manually create folders and seed them with data.

  9. People-Centered Email Clients ContactMap Bifrost [Bälter and Sidner, 2002] [Whittaker et al., 2004]

  10. Task-based Email TaskMaster [Belotti et al., 2003] Example: TaskMaster thrasks thrask contents item contents (emails, documents, etc.)

  11. BuzzTrack

  12. BuzzTrack • Mozilla Thunderbird extension to automatically group related emails into topics. • Will be distributed through website: • Provides a view on the user’s inbox.

  13. What’s a Topic? • Topics are groups of emails that relate to the same idea, action, event, task, or question. • Examples: • A conversation about buying a digital camera. • Referring a candidate for a job. • All emails belonging to same newsgroup.

  14. Clustering Process • For every new incoming email: Preprocessing Clustering BuzzTrack View in Thunderbird Cluster store Label generation

  15. Preprocessing • Tokenization (remove HTML tags, style sheets, punctuation, and numbers) • Language detection • Stemming • For topic labelling: • Identify Parts-of-speech • Remember popular original word forms

  16. Clustering • Single-link clustering: Newly incoming emails are compared to every email in existing topics: • Similarity value > threshold: assigned to topic • Similarity value <= threshold: email starts new topic

  17. Features - 1 • How do we generate similarity values between emails? • Via a linear combination of several similarity features. • Examples: • Text similarity (TFIDF Value, cosine similarity metric) • People similarities (comparing sets of people in the From / To / Cc lines of email headers) • Thread membership

  18. Features - 2 Other features for deriving similarities: • Subject similarity • Sender domain overlaps • Sender rank and percentage • % of email from sender that is answered • Time passed since last email in topic • People and reference count for email • Known people and reference % • Cluster size • Has attachment

  19. Decision Score Similarities are combined into a decision score for each email / cluster pair through a linear combination of feature values: deci,j = wa*sima(mi,Cj) + wb*simb(mi,Cj) + … We tested two sets of weights wx, both trained on a development set of emails: • Empirical • Linear SVM

  20. Evaluation • How do we evaluate clustering quality? • Topic Detection and Tracking competitions by NIST. Aimed at clustering news articles. • Corpus:

  21. Clustering Tasks • Clustering Task is split into subtasks: • New Topic Detection (NTD):Given stream of emails, which ones start new topics? • Topic Tracking (TT):Given a fixed topic, which newly incoming emails belong to it? • DET Curves plot miss rate vs. false alarm rate for possible threshold for decision scores

  22. Results NTD • TDT New Topic Detection Task Miss: 3% False alarm: 30% better better

  23. Results TT • TDT Topic Tracking Task Miss: 8% False alarm: 2% better better

  24. Comparison • Comparable quality to TDT for news articles [NIST 2004] • News has less metadata, email has worse text quality. • Wide body of work exists on improving clustering performance on news, we haven’t tapped into that yet.

  25. BuzzTrack View • Mozilla Thunderbird plugin that provides useful view on inbox data “for free” • Topics contain email from last 60 days • We’re interested in current email only • Reduces initial clustering time • Each email is shown in one topic

  26. Demo 1: BuzzTrack

  27. Topic pane: Provides additional info Starred topics Email pane: Topics sorted by last incoming email BuzzTrack Panes

  28. Future Work • Distribute plugin to Thunderbird users • Input on possible UI improvements • Input on clustering quality • Different clustering styles • People-based • Thread-based • We hope BuzzTrack will be valuable tool for real-world users

  29. Questions? Contact: Gabor Cselle, Website: