1 / 29

Contextual Search and Name Disambiguation in Email Using Graphs

Contextual Search and Name Disambiguation in Email Using Graphs. Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University SIGIR 2006. INTRODUCTION. 計算文件的 similarity 除了 textual feature 外 , 其實還有一些其它的資訊可以用

Download Presentation

Contextual Search and Name Disambiguation in Email Using Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Contextual Search and Name Disambiguation in Email Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University SIGIR 2006

  2. INTRODUCTION • 計算文件的similarity除了textual feature外,其實還有一些其它的資訊可以用 • Ex. Hyperlinks in webs, meta-data, and header information in e-mail • In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk

  3. INTRODUCTION • In a lazy graph walk, there is a fixed probability of halting the walk at each step • Two problem • disambiguating personal names in email • E-mail Threading

  4. EMAIL AS A GRAPH

  5. EMAIL AS A GRAPH • “Einat Minkov <einat@cs.cmu.edu>” • person node “Einat Minkov” • email-address node “einat@cs.cmu.edu” • 其它規則

  6. Edge weights • To walk away from a node x, one first picks an edge label l • We assume that the probability of picking the label l depends only on the type T (x)

  7. Edge weights • once l is picked, y is chosen uniformly from the set of all y

  8. Graph walks • a lazy graph walk, there is some probability °staying at x • if V0 is some initial probability distribution over nodes, then the distribution after a k-step walk is proportional to

  9. Graph walks • In our framework, a query is an initial distribution Vq over nodes, plus a desired output type Tout • Ex. “economic impact of recycling tires” would be an appropriate distribution Vq over query terms, with Tout = file

  10. Relation to TF-IDF • Suppose we restrict ourselves to only two types, terms and files, and allow only in-file edges • common term “the” will spread its probability mass into small fractions over many file nodes • unusual term “aardvark” will spread its weight over only a few files • the effect will be similar to use of an IDF weighting scheme

  11. LEARNING • Previous researchers have described schemes for adjusting the parametersusing gradient descent-like methods • In this paper, we suggest an alternative approach of learning to re-order an initial ranking

  12. LEARNING • The reranking algorithm is provided with a training set containing • n examples • Example i includes a ranked list of li nodes • Let wij be the j th node for example i • A candidate node wij is represented through m features, which are computed by m feature functions f1, . . . , fm

  13. LEARNING • ranking function for node x is defined as: • where L(x) = log(p(x)) and ᾱis a vector of real-value parameters • minimizes the following loss function on the training data

  14. Corpora • Cspace corpus • contains email messages collected from a management course conducted at Carnegie Mellon University in 1997 • The Enron corpus • a collection of mail from the Enron corpus that has been made available for the research community

  15. Person Name Disambiguation • “Andrew” = “Andrew Y. Ng” or “Andrew McCallum” ??? • The Cspace corpus, We collected 106 cases in which single-token names were mentioned in the the body of a message but did not match any name from the header

  16. Person Name Disambiguation • For Enron, two datasets were generated automatically. we eliminate the collected person name from the email header • the namesin this corpus include people that are in the email header,but cannot be matched because

  17. Results for person name disambiguation • Baseline method • The similarity score between the name term and a person name is calculated as the maximal Jaro similarity score between the term and any single token of the personal name (ranging between 0 to 1) • In addition, we incorporate a nickname dictionary, such that if the name term is a known nickname of the person name, the similarity score of that pair is set to 1

  18. Results for person name disambiguation • Graph walk methods • 嘗試兩種Vq • query distribution on the name term • equal weight to the name term node and the file in which it appears • Tout=person type • we will use a uniform weighting of labels

  19. Reranking the output of a walk • Edge unigram features • for each edge label L , whether Lwas used in reaching x from Vq • Edge bigram features • whether L1 and L2 were used (in that order) in reaching x from Vq • Top edge bigram features • paths leading to a node originate from one or two nodes inVq

  20. Person name disambiguation results: Recall at rank k

  21. Person name disambiguation results: Recall at rank k

  22. Person Name Disambiguation Results

  23. Threading • A thread is a conversation among 2 or more people carried out by exchange of messages • Threading problem • Retrieving other messages in an email thread given a single message from the thread • Given an email file as a query, produce a ranked list of related email files, where the immediate parent and child of the given file are considered to be “correct” answers.

  24. Threading • several information types are available • Header - sender, recipients and date • Body - the textual content of an emai • reply lines - quoted lines from previous messages • Subject - the content of the subject line

  25. Threading • Baseline method • TF-IDF term weighting+cosine similarity • Graph walk methods • Vq assign probability 1 to the file node corresponding to the original message, Tout = file

  26. Graph walk methods • weight-tuning method • we evaluate 10 randomly-chosen sets of weights and pick the one that performs best (in terms of MAP) on the CSpace training data • Reranking the output of walks • The features applied are edge unigram, edge bigram and top edge bigram

  27. Threading Results: MAP

  28. Threading results: Recall at rank k

  29. CONCLUSION • We have presented a scheme for representing a corpus of email messages with a graph of typed entities • This scheme provides good performance on two representative email-related tasks: disambiguating person names, and email threading.

More Related