1 / 59

Identity Resolution in Email Collections

Identity Resolution in Email Collections. Tamer Elsayed and Douglas W. Oard. Department of Computer Science, UMIACS, and iSchool. CLIP Colloquium, UMD, Feb 2009. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.

elda
Download Presentation

Identity Resolution in Email Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard Department of Computer Science, UMIACS, and iSchool CLIP Colloquium, UMD, Feb 2009

  2. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ National Archives Real Problem Clinton White House search request Tobacco Policy 32 million emails 80,000 hired 25 persons for 6 months … 200,000

  3. Identity Resolution in Email Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila WHO?

  4. 55 Sheila’s !! weisman pardo glover rich jones breeden huckaby tweed mcintyre chadwick birmingham kahanek foraker tasman fisher petitt Dombo Robbins chang maynes nacey ferrarini dey macleod howard darling watson perlick advani hester kenner lewis walton whitman berggren osowski kelly jarnot kirby knudsen boehringer lutz glover wollam jortner neylon whanger nagel graves mclaughlin venville rappazzo miller swatek hollis Enron Collection Message-ID: <1494.1584620.JavaMail.evans@thyme> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER> X-To: 'SStack@reliant.com@ENRON' Hi Shari Hope all is well. Count me in for the group present. See ya next week if not earlier Rank Candidates Liza Elizabeth Sager 713-853-6349 -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject: Shhhh.... it's a SURPRISE ! Please call me (713) 207-5233 Thanks! Shari

  5. Generative Model • Choose “person” c to mention p(c) • Choose appropriate “context” X to mention c p(X | c) • Choose a “mention” l p(l | X, c) GEconferencecall “sheila”

  6. Posterior Distribution 3-Step Solution (1) IdentityModeling (2) Context Reconstruction (3) Mention Resolution

  7. Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion and Future Work

  8. “Easy References” of Identity Message-ID: <1494.1584620.JavaMail.evans@thyme> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER> X-To: 'SStack@reliant.com@ENRON' Email Standards Hi Shari Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713-853-6349 Email-Client Behavior User Regularities -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject: Shhhh.... it's a SURPRISE ! Please call me (713) 207-5233 Thanks! Shari

  9. Representational Model sheila glover 1170 (User Name) sheila.glover@enron.com 932 (Main Headers) 216 (Signature) 19 (Salutation) 14 (Quoted Headers) sheila sheila glover sg 19 (Signature) Representational Model of Identity 77,240 “non-trivial” models

  10. identity c name type t m observed mention Computational Model of Identity

  11. Candidates Likelihood: p ( “sheila” | c) Identity Models Candidates

  12. Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion and Future Work

  13. Topical Context Conversational Context LocalContext Contextual Space

  14. Topical Context Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Date: Fri Dec 15 05:33:00 EST 2000 From: david.oxley@enron.com To: vince j kaminski <vince.kaminski@enron.com> Cc: sheila walton <sheila.walton@enron.com> Subject: Re: Grant Masson Great news. Lets get this moving along. Sheila, can you work out GE letter? Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone. GE Sheila call sheila.walton@enron.com GE Sheila call

  15. Topical Context Social Context Conversational Context LocalContext Contextual Space

  16. Social Context Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. kay.mann@enron.com Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST) From: rebecca.walker@enron.com To: kay.mann@enron.com Subject: ESA Option Execution Kay Can you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland. Thanks, Rebecca kay.mann@enron.com Sheila Tweed

  17. Formally • A context of an email is a probability distribution over emails • Probability estimated based on type of context • Contextual Space is a linear combination of 4 contexts

  18. local conversational Context Expansion people topical social time content Temporal similarity affects social and topical similarity

  19. Temporal Similarity • Decay over time • Gaussian and Linear functions • Time difference / Rank

  20. Social Similarity • Two sets of participants (email adresses) • Binary, Overlap, Jacaard, Both

  21. Temporal Effect Temporal Sim Pure Social Sim Normalize Social Context Social Sim

  22. email reply / forward Topical Similarity • Standard IR Similarity: BM25 • Email as a DOCUMENT? • Subject • Body (+Subject) • Root of thread • Concatenated path to root • Combined similarly with temporal similarity

  23. Topical Context Social Context Conversational Context LocalContext Contextual Space (emails)

  24. Contextual Space (mentions) “Sheila Tweed” “jsheila@enron.com” social social “Sheila Walton” “Sheila” topical topical “sheila” social “Sheila” topical conversational “sg”

  25. Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion and Future Work

  26. Mention Resolution 1 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. ? Sheila 2 3 Likelihood: p ( “sheila” | c) Candidates Goal: estimate p(c|m, X(m)) and rank accordingly

  27. [1] Context-Free Resolution Context-FreeResolution “Sheila Tweed” “jsheila@enron.com” social social “Sheila Walton” “Sheila” “Sheila” topical topical “sheila” social “Sheila” topical conversational “sg” X

  28. “Sheila Tweed” “jsheila@enron.com” “Sheila Walton” “sheila” “Sheila” “sg” [2] Contextual Resolution Context-FreeResolution social social “Sheila” topical social

  29. Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion

  30. Test Collections Enron-all Enron-subset Sager Shapiro

  31. Evaluation Measures Commonly used in “known-item” retrieval • Success @1 (i.e., Precision @1) • One-best • MRR (Mean Reciprocal Rank) • Inverse of the harmonic mean of the ranks of true answer ri

  32. Comparison w/Literature ContextExpansion Lit.Best ContextExpansion Lit.Best Earlier expansion approach, reported in ACL 2008 Improved expansion 0.92 0.87

  33. Limitations • Resolving single mentions • All mention-queries are sampled from Enron to Enron emails • All mention-queries refer to Enron Employee • Small for train/test split Scalable Implementation for Resolving All Mentions New Test Collection

  34. Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion

  35. Scalable Implementation Two Bottlenecks: • Context expansion of ALL emails For each email: ranked list of “Similar” emails • Resolution of ALL mentions Resolution of one mention depends on resolution of all other mentions in context

  36. Context Expansion of ALL Emails • Goal: For each email: ranked list of “Similar” emails • Need for BOTH social and topical contexts • Efficient implementation Abstract Problem:Computing Pairwise Similarity

  37. Trivial Solution • load each vector o(N) times • load each term o(dft2) times Goal scalable and efficient solutionfor large collections

  38. Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores

  39. MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input handles low-level detailstransparently

  40. Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores reduce map

  41. contextsim model temporal sim model topical : body/root/path ~~~~~~~~~~~-- ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ contextgraph doc rep. social : participants time window df-cut rankcut-off Expansion Using MapReduce • Using generic pairwise-similarity for both topical and social expansion

  42. “Sheila Tweed” “jsheila@enron.com” “Sheila Walton” “sheila” “Sheila” “sg” Context Mention-Graph map Context-FreeResolution map social social reduce “Sheila” topical topical map social topical map conversational

  43. Preprocessing Resolution System Using MapReduce Threads Emails Identity Models Packing Dict. Conv.Expansion LocalExpansion Topical Expansion Social Expansion Mention Recognition and Prior Computation Expansion Conv. Graph Local Graph Topical Graph Social Graph Prior Prior Prior Conv.Resolution LocalResolution Topical Resolution Social Resolution Prior Resolution Resolution Merging Context Resolutions Posterior Resolution

  44. Outline • Introduction and Approach Overview • Computational Model of Identity • Context Reconstruction • Mention Resolution • Evaluation on Existing Collections • Scalable MapReduce Implementation • New Test Collection • Conclusion

  45. New Test Collection • Random Sample from CMU-Enron collection • “Annotation + Search” interface available • Total annotators: 3 • Annotation time: ~50 hours. • Not only resolutions • Time, difficulty, confidence, evidence, and comments • Total mention-queries : 584 • 80% resolvable, 82% of them to Enron domain • Overall inter-annotator agreement: ~81 %

  46. Mention-Query Selection

  47. Distribution of Names Based on Resolution

  48. Distribution Based on Difficulty

  49. Distribution Based on Confidence

  50. Distribution of Time Spent

More Related