1 / 22

An Experimental Framework for Email Categorization and Management

An Experimental Framework for Email Categorization and Management. Kenrick Mock kenrick@uaa.alaska.edu. Project Overview. Motivation: Email Overload Potential solution: Automatic categorization and management techniques

mandek
Download Presentation

An Experimental Framework for Email Categorization and Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Experimental Framework for Email Categorization and Management Kenrick Mock kenrick@uaa.alaska.edu

  2. Project Overview • Motivation: Email Overload • Potential solution: Automatic categorization and management techniques • Problem: The potential solution is very experimental. Email use and user interaction is difficult to model, requiring a prototype that users can try on actual email • The purpose of this work is to present a Microsoft Outlook 2000TM add-in that: • Can be used as a first step toward more experimental research into automatic email management techniques • Helps manage the inbox via classification and relevancy-based search

  3. What’s the Problem with Email? • Too much • 6/26/2001 USA Today • “Workers polled this year by market researcher Gartner spent an average of 49 minutes a day on e-mail, 30% to 35% more time than they did a year ago. Ferris Research estimates management-level workers will spend four hours a day on e-mail by 2002.”

  4. Solutions? • Educate users • Don’t send so much mail, don’t subscribe to lists • Use technology in some way • Current efforts are toward some type of classification system that learns New SIGIR email New Miss Cleo email Training: System learns what email belongs to “Conferences” Folder “Conferences” with emails regarding conferences Classify into “Conferences” Classify into “Trash”

  5. This Project • An architecture for exploring automatic email management techniques • Built on Outlook 2000 • Primary code in Visual Basic • Produces DLL add-in for Outlook • Visual C++ DLL component • Hashes strings to longs (logical operators not available in VB) • Referenced from VB • Not tested with Outlook 2002!

  6. Architectural Overview VB Add-In DLL Outlook Outlook Object Model Events Message Class AddTerms() Display() Get Vals CompareMsg() Outlook / Class Interface Glue Folder Class AddMsg() GetMessages via Dictionary CompareMsg() C++ Helper DLL (Hash Strings)

  7. Add-In Interface : Messages • Message Class • Mail folders scanned on startup, class instance created for each mail item (except Trash, Sent Items). • Message text is tokenized and stoplisted using • Sender • Recipients • Subject • Text Body (possible to use more fields if desired) • Text tokens are hashed to 32-bit longs to save space, greatly increase token comparison time • Hash function by Bob Jenkins • 2 collisions on 87111 dictionary words • 10x faster to compare longs vs. strings via strcmp on Pentium II • CompareMsg function computes similarity between two email messages

  8. Add-In Interface : Folders • Folder Class • User-created mail folders are scanned on startup and a folder instance created for each mail folder (except Trash, Sent Items). • Messages that the user has placed in each folder are added to the folder’s classifier for training • CompareMsg function computes similarity between a new message and the classifier for the folder • i.e. can use to classify a new message into folders

  9. Classifier Implementation • CompareMsg • It is the goal of this project to experiment with different classifiers and algorithms as the implementation of CompareMsg to find out what works and what doesn’t • A simple classification scheme is implemented for now • Nearest Neighbor, common terms & frequencies • Others schemes that have been examined in the past: • TF-IDF, Neural Networks, Bayesian, Rule Induction, SVM • What should the classifier do when new email arrives? • Some options • Move new email directly to classified folder • Annotate email with a category tag

  10. Classifier Usage Challenges • In previous work, we built a proprietary rule induction and tf-idf classifier into Outlook and GroupWise that classified messages into categories. It was tested on managers and developers. • Problems we encountered were usage-driven: • The need for constant re-training to keep up with dynamically changing categories. • Classification errors are puzzling and instill distrust on behalf of the users. • Insufficient data may be available as training examples. • It is difficult for a user to examine or manually edit a classifier.

  11. Challenge 1: Categories Change • Common for Categories to change over time; “Topic Drift” as in Newsgroups • Project ends or changes direction • Conversation slowly changes topics • General discussion might turn more technical • Problems for learning algorithms • Classifiers need to be re-trained; how well can they handle it? How fast is it? • Our users were willing to wait seconds, not minutes • Most classifiers are not incremental; require re-training using all positive/negative examples, not just new ones • Often too slow for many algorithms (e.g. rule induction) • Vector-based classifiers • Fast to re-train but may have problems with threshold calculations or new vocabulary not in the vector

  12. Challenge 2: Classifiers Make Errors, Destroy User Trust • Users tolerate few errors • Want immediate corrections so the same error won’t happen again • Vector classifier may require several examples before centroid shifts enough to include similar message • Rule classifiers need explicit retrain • Classification errors are inevitable • Classifier may over-generalize or be too specific • Errors could “break” users hard work setting up a folder • In some cases it’s more work to fix errors than the savings the tool is intended to provide! • Trust is easy to lose, users abandon the system

  13. Challenge 3: Insufficient Data Available • Many classifiers require a large amount of training data, e.g. statistical-based classifiers • May not have enough email available • Users expect system to work well given only 6-12 training examples • Effort to find more examples typically too high • One solution: Bootstrap using data in existing folders • What about negative examples? Can be problematic for some classification algorithms

  14. Challenge 4: Model Editing and Understanding • Some users want to manually fix or edit the classifier • These are naïve users, not programmers! • Easy to understand, modify • Rule-based classifiers • More difficult • Vector classifiers, may have many keywords • Very difficult • Neural Network • SVM

  15. Current Implementation • Publicly available source, binaries for open development purposes • Simple nearest-neighbor classifier for Folders • Speed, easy to train and classify • May help classify user-created folders that really encompass multiple sub-folders (e.g. “work” where there are many work projects) better than classification techniques that rely on global data • Individual term frequencies of sub-folders topics will be low • But message-to-message comparison may be high • Don’t need negative examples • Tag messages with category rather than move into a folder • Hopefully not too critical when misclassification occur

  16. Current Implementation : User Interface Upon startup of Outlook : Scan outlook folders, create classifiers and messages View inbox grouped by category

  17. Current Interface : New Email New email automatically classified into the Best-matching folder (but not moved, only grouped)

  18. Current Interface : Related Email • Interface also supports finding other email similar to the current one • Iterate through all email message class objects invoking the comparison function • Simple term-frequency comparison of both emails for now • Linear time, but not too bad • 300 of the author’s messages scanned per second on 400Mhz PII

  19. Current Interface: Related Email Select a message, Click on button List of similar messages displayed, click to open

  20. Comments on Personal Use • No formal user studies performed yet • But, I’ve been using it…some anecdotes: • Nearest Neighbor classifier OK, could be better • Would be useful to index trash or sent-items • If not indexed, there is no folder to classify into when junk mail arrives so it gets put somewhere else • Temporary solution: Make a “Trash” folder with examples • But indexing trash could be a lot of messages… • Grouping if incoming email useful? • Not really needed for frequent email reading • Useful when returning from a trip and need to triage the mail • Relevant email • Useful for finding uncoupled email threads • Sent-Items would be useful to index here

  21. Lots of Work To Do • Experiment with other classifiers • Need to see relation with users on training issues, speed, etc. not just classification accuracy • Latch onto more events • Better mail detection, drag & drop events • Clean up code implementation • Support persistence, speed issues on startup scan • Implementation issues • Compatibility with Outlook 2002, VB .NET • Other forms of visualization / categorization • E.g., color, thread information, graphical techniques • Extend to other forms of Outlook data • Calendaring, Notes, Files

  22. Try It Out • Source Code & Binaries available online • http://www.math.uaa.alaska.edu/~afkjm/emailaddin/ • Only tested with Windows 2000 & Outlook 2000 • Feel free to use or modify code as you see fit • Warning: Developer docs and code cleanup still needs to be done! • But I’ll be glad to answer any questions!

More Related