1 / 21

Research Problems & Topics (Email Domain)

Research Problems & Topics (Email Domain). (CS598-CXZ Advanced Topics in IR Presentation) Jan. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Automatic Email Replying.

zinna
Download Presentation

Research Problems & Topics (Email Domain)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Problems & Topics (Email Domain) (CS598-CXZ Advanced Topics in IR Presentation) Jan. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

  2. Automatic Email Replying • If this problem is solved well, information service or technical support persons will benefit a lot. The data involved is emails. The task is to automatically generate replies for incoming emails. • The major challenge involved is, how to classify or summarize the incoming emails correctly. Emails from different persons will have different writing styles, even if they are about the same problem. • Another interesting problem is, a user could possibly send a series of emails about different stages of a problem. It may help by taking into consideration the same user’s previous incoming emails to generate an automatic reply for the current incoming email.

  3. Reminder assistant on personal emails / newsgroups • Nowadays, emails and newsgroups are most efficient ways to exchange information with other people. When we get an email or news and happen to be too busy to look at it immediately, generally we choose to check it later. However, we often forget it after a relative long time. • One nice way to handle this problem is to automatically classify all the emails/news and according to user specified profile to decide the extent of its urgency. Then the most urgent email/news will be reminded at proper time. • Users: everyone who use email or newsgroups. • Data involved: personal emails, newsgroups. • Functions to be developed: text classification according to user specified profile, events reminding.

  4. Email Routing in Customer Support Systems • I thought of this problem because once after I sent an email to userhelp, someone emailed me back, asking me to re-send the email to userhelp because the problem was not supposed to be assigned to him. I then realized that there must be an automatic router in any customer support system that assigns emails to experts who can solve the corresponding problems. • This is essentially a text categorization problem in the email domain. • People who benefit from solving such a problem are both the customers and the companies/organizations who provide the service. • The data involved are emails from customers who have different kinds of questions. • A good email router should provide several functions: • (1) To identify from the email text the major problem the customer encountered, and assigns the problem to a certain category or several categories. • (2) To extract information the customer provided to assist the human expert to fix the problem, e.g., error message the customer received from a computer screen, IP address, etc. The information can also include the customer’s name, phone number, customer ID, etc. • (3) If the problem is common, and the solution is already available, then the solution should be automatically sent back to the customer. I believe such routing systems have already been developed in many places. But how satisfactory they are is still questionable.

  5. Dynamic Topic Clustering(Thread Discovery/Maintenance • Current email handlers provide the function of manage emails by their explicit attributes, such as time, sender, title, etc. People can also manually group emails sharing some properties together. In real world, a very significant need is to retrieve emails with a certain topic, or talking about a certain issue. For example, one may want to trace back the email discussion of the idea “set up a new workshop” last year with some potential collaborators. • It is hard, however, to retrieve those emails by time, sender, title (they may vary a lot) or even key words (sometimes difficult to define keywords, and one may want to get ALL of them instead of the most relevant ones). If we can automatically group and manage emails sharing a topic, or latently talking about the same issue, we can easily find the emails about a specific event, and yes, all of them. ? This could be a problem of “Dynamic Topic Clustering”. • One key issue of this problem, which distinguishes it with common text clustering (LSI, CTM, etc), is that we have to maintain the clusters dynamically. When a new email comes, we will need to merge it into existing groups (modifying the structure of existing groups), or under some circumstance, generate a new group. • User: all email users, especially those handling mass email (Professor, Customer Service, etc) • Data: Emails, specifically, content of emails. • Functions: Incrementally grouping, browse by topic group, Retrieval by topic.

  6. Discussion Topic Extraction • Challenge: To extract the major discussion topics within the newsgroup, a list of FAQs, and the most active and knowledgeable contributers in a newsgroup or mailing list. • Users: People that utilize a technical newsgroup, a company that wants to gather a list of most common occuring problems in a product, technical support departments that want to construct knowledge bases for better customer support. • Data: emails from the newsgroup or mailing list • Method: One may think clustering and summarization of the cluster may be the solution to this, but in the newsgroup and mailing list there are more specific challenges and interesting ways to exploit IR techniques.

  7. Adaptive Email Classifier/Organizer • Email has become the majority in my daily information processing. A lot of appointments and events are carried out through email communication with other people. A lot of data I process, e.g., writing some notes, searching Web and sharing photos, are also around the email processing. However, there is still no well-integrated environment of all such information processing need, probably due to file/folder-based informatin management system in our local computers. Igoring this restriction, one of the problems I face often is how to organize the email. Usually, I will keep useful or informative or memorial emails and organize them into some folders. However, such folder hierarchy would grow as more and more new things carried by emails happen. For a while, the folder hierarchy becomes large and I usually have to re-arrange them: e.g., add more levels to have a more organized hierarchy, merge, split or rename some folders to clearly reflect the contents it contains, and archive some less used folders. • The challenge is that such a folder hierarchy changes often. In addition, folders may not always carry some topics according to the email content. Some folders may just correspond to some of the friends or groups. The conventional content-based classifier may not work. • User: every individual user • Data: Email + Hierarchy + External data such as Web • Funtion: The classifier to give some annotation for each email (Note that the emails is still been processed linearly or thread-based, and not categorized automatically). The classifier must be adaptive to the frequent change of the hierarchy and different characteristics of the folders in the hierarchy.

  8. Email Prioritization • Challenger: Assign estimated importance to emails • Users: Email users • Data: Incoming Emails • Description: Some people like secretaries or people working for the government receive a lot of emails daily and it would help them if the incoming emails are assigned importance values so that important emails can be dealt with earlier. One could judge the importance based on multiple issues like sender, subject, data or deeper analysis of email content. For example, if the emails contain the following, could you inform Mr. Somebody I will not be able to come to the meeting this afternoon? then it is obvious that the email must be somehow processed before afternoon and is likely to be important. This is somehow related to the paper “Integration of Email and Task Lists” in CEAS 2004

  9. Spam Detection • I think email spam detection is an interesting and important problem. It affects everyone and obviously not very well solved yet. It is also quite chanllenging since it is non- trivial to predict whether an email is a spam or not. The feature selection is not straightforward, due to the possible large quantities of term or term combinations, and the possible variations / evolvements of patterns. Also, it is not clear that what are the very effect classification method. Users: everyone. Data: emails. Functions: classification.

  10. Spam Detection • Most email servers will be happy to have a function to determine the email spams. This is related to how to find a portion of data which appears in lots of emails. • The users could be both the server administrator, it also benefits every client users. • The data invovled in the problem is the email context, which could be either assiic or binary sequences. The key functions in the problem are the similarity search among large data sets. This similarity search should be able to handle approximate matching, self join similarity matching.

  11. Automatic Email Forwarding/Recommendation • Sometimes, we forward some interesting article to friends, or receive some interesting articles forwarded by others. It is usuful to build a automated email forwarding system, that can forward emails based on the content in email and the interest of friends in your mail list. It requires to extract the information in the email and classify the email as well as the friends into groups.

  12. Diary from Email (Email Summarization) • Some people have the habit to keep a journal (web blog, live journal, etc.). But sometimes, it is just a repeat of the email writing. On the other hand, the email provides a lot of information daily activities. Here we want to build a system to extract information events from the emails and compose a simple diary. Emails have a nice date associated with the contents. • Everyone can use this tool to automatically create a diary from the emails. • This tool will provides the functionality of data extraction( entity extraction such as event and date) and composition of the diary. • There are a couple of challenges. First, how to e±ciently extract event and date. Second, how to organize the event associated with the date.

  13. Thread Finding • The emails have become the major communication media for the people now. People always rely on email to discuss work. For one thing, people may send each other a lot emails to discuss. For example, some people collaborate on a work. First, somebody propose a method. Then after a series of debates, they find out a way to implement the method. New problems emerge and are solved as the work goes on, and new discussions start and end. Finally the whole work is finished. An interesting thing is to find the topic thread from all of these emails. At what time what are discussed? This will facilitate people to summary the work and draw some lesson from the course of the work. • Users: people who rely on email to discuss problem • Data: a series of emails. • Functions: Topic Finding and summarization. • Challenges: How to set the boundary of the email thread to topic thread, how to summary a group of emails? How to present the result to people?

  14. Email Linkage/Thread Analysis • Description: With the increase of the volume of current email system, more and more emails can be stored, Sometimes, a Ph.D. student could receive more than 2000 emails/ per year, while a professor may receive more than 10,000 emails/ per year. These emails need a better way to organize for efficient information reterival. • The content based linkage analysis of the historical emails may be an interesting topic to investigate. • With the corresponding linkage, when the user receive one new email, he could easily get the corresponding content from the other emails. The desired techniques should include natural language processing and probabilistic graphical models.

  15. Intelligent Email Organizer • User: Any faculty, poor TA, or business person who is very busy and is always dealing with several tasks, and receives many emails every hour! Data Involved: The emails received, and also the input data (can include people’s schedule) Function: This problem was motivated by another problem that I have every day. I look at the problem as a Grad student who is taking 3 courses and is TAing a class with many students, and at the same time he is doing some research with some other faculty. At the same time he has many tasks, and receives many emails related to each of the tasks in different times of the day. This student decides to divide his hours in the week into spots for each of his responsibilities (probably 2-3 spots for each responsibility). Therefore, although he wants to be accessible by all the people who need him at any time, he wants to only focus on what he has in his schedule, and not be distracted by other tasks, and emails. So, what the system does is this: It gets the user’s schedule, the areas in which the user wants its tasks to be categorized, the relation between the areas and the available spots, and some other clues. Then after receiving emails from different people, it categorizes them and shows them to the user only at the time that the user has allocated for those kind of tasks. It can also extract meetings, and alert the user…and do many many (I really mean it!) interesting, and time preserving tasks!

  16. Email Mining • One real world topic is how to mine some interesting knowledge from the emails. For example, as a consumer, you are usually interested in the products from some websites. Therefore, you subscribe their mailing list and will get a lot of information including when their products are on sale and when they have new products. • If the system can automatically detect the pattern of promotion information of that website and recommend a good time for you to purchase the product, you would save money. • For this topic, the user can be any newsletter subscriber. • And the data is the newsletters subscribed from a particular website. • The challenges include • how to identify those emails including sale information from others, how to extract the useful information (e.g. 50% off original price or 20% off sale price) and • how to analyze those extracted information and make the decision whether the user should purchase or not.

  17. Interactive Email Clustering/Filtering • Email may be able to serve as an excellent "push"-form of information retrieval. Mailing lists are an example of this. For example, I want information about IR-conferences so I am on an IR mailing list. For many users this may be a better way to get subject-specific information than an RSS feed because all the "pushed" information (email itself is also "pushed" information) comes in through one channel--the email client. However, not all information is created equal; when a user is very busy with other things, mailing list emails may be as much a nuissance as spam. • Perhaps a semi-supervised, interactive clustering/filtering system would allow email reading to be more focussed for some users. Folder-based systems of email organization may not be sufficent because emails may belong in multiple classes. An email might serve both as a "receipt" of purchase and contain a "serial number". Lost or missing emails should be avoided as much as possible because a serial number might be worth the price of the product. Perhaps to minimize the chance of emails being misclassified the user should always supervise the classification of emails. To minimize the user effort, the email client might suggest categories and the user can manually add or remove the class labels for an email. • A system like this might easily employ machine learning to make classifications. Manual rules might help to classify emails sent from people in the user's address book (i.e. from friends or family into the personal category).

  18. Automatic Email Organization • The number of email each person receives per day is continuous increasing. There is a need of automatically organizing email in a reasonable and tractable ways. • One approach could be trying to connect emails which are related in some ways such as: discuss about the same problem, mention about the same event. Like in a newsgroup, when a user wants to post a message, he will decide whether or not to start a new thread or will follow an existing thread. By doing this reasonably, the messages in the newsgroup will be much more easy to keep track with and to examine. If we could do that for emails come to a mail box, this will make users manage email much more efficiently. • Users: Who have to contact by email frequently. • Data: Existing emails and new arriving emails. • Challenges: How to know an email is relevant to others in a specific aspect is a nontrivial • task. High accuracy and explainable mechanism are required.

  19. Possible Email Topic Areas • Spam Detection/Email Filtering/Email Forwarding/.. • Email Processing/Reply Support • Email prioritization • Email routing (to the right people, at the right time) • Automatic/semi-automatic reply • Thread Recognition and Thread-based management functions • Email Summarization/Mining (diary, subtopics, FAQ)

  20. Possible group projects • Improve CS TSG/Newsgroups • Better routing • Summarization (FAQ extraction) • User request prediction • “Illinois Smart Email Assistant (ISEA)?” • Beyond an email reader/composer (i.e., assistant) • Integration of email handling, web access, and personal information collection (web email is insufficient) • Advanced information management capabilities (thread-based management, summarization, task support, etc) • If ISEA can do just slightly better than pine, I would use ISEA instead of pine

  21. Assignment 2 (for Email Team) • Read CEAS 2004 proceedings and search for email-related papers on the web • Every one identifies one or two most interesting papers, which you like to present • Send me your choices by next Tuesday (Feb. 1) • Need one volunteer for presenting an email paper on Feb. 8 • Possible choices: • 1. Exploring SVMs and Random Forests for Spam Detection CEAS 2004 (http://www.ceas.cc/papers-2004/174.pdf) • 2. The Enron Corpus: A New Dataset for Email Classication Research, ECML 04(http://krakow.lti.cs.cmu.edu/yiming/Publications/klimt-ecml04.pdf)

More Related