1 / 22

The Enron and W3C Collections

The Enron and W3C Collections. Tamer Elsayed and Douglas W. Oard. University of Maryland. ICAIL 2007, DESI Workshop, June 4 th , 2007. Variants of Email Search. Searcher. Collection. Rich multimodal data Emails Phone calls Databases. The (Extended) Enron Collection.

brownw
Download Presentation

The Enron and W3C Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard University of Maryland ICAIL 2007, DESI Workshop, June 4th, 2007

  2. Variants of Email Search Searcher Collection

  3. Rich multimodal data Emails Phone calls Databases The (Extended) Enron Collection

  4. The (Extended) Enron Collection • “Public” version of Enron collection (CMU) • 150 sets of rescued Outlook email folders • 517,431 emails, 52% duplicates, 133,581 unique addresses • Subset annotated w/genre, speech act, mentioned calls, … • Extended Enron email collection (Aspen Systems) • Attachments, additional email (later release, redaction) • Phone calls from/to Enron traders (Shohomish PUD) • Transcribed subset from 52 DVDs of recorded audio • Recovered from scanned transcripts using OCR • 93 annotated with date, time, participants, mentioned names, mentioned emails, mentioned meetings, ... • Relational databases (Aspen Systems)

  5. PhoneCalls EMAIL Cross-References

  6. Phone Call Transcripts Message-ID: <24-20010126-19435570-20020114-R> Message-Type: PhoneCall Date: Fri, 26 Jan 2001 19:43:55 -0600 (CST) From: shari.stack@enron.com To: greg.wolfe@enron.com Parties: shari.stack@enron.com, greg.wolfe@enron.com Subject: Snohornish deal, Houston Chronicle Article, Bonuses e-mail, Houston Chronicle Article, Deal, email to Jane King Subject-TimePos: 145, 313, 713, 775, 920, 1018 InCallNames: Christian, Ken Lay, Greg, Chris Foster, Stewie, Stewie, Mike, Mike, Laverado, Mike, Kim, Shari, Greg, Forney, Stewie, Jane King, Shari InCallNames-TimePos: 42, 81, 90, 95, 96, 143, 146, 190, 262, 266, 522, 580, 780, 1007, 1018, 1038, 1067 Keywords: CDWR, email, email Keywords-TimePos: 55, 689, 1038 X-From: Stack, Shari <> X-To: Wolfe, Greg <> X-Parties: Stack, Shari <>, Wolfe, Greg <> X-AudioFile: 24-20010126-19435570-20020114-R.wav X-TranscriptFile: 24-20010126-19435570-20020114-R.txt SHARI STACK: Hey. GREG WOLFE: All right, let me get my fax machine workin'. Uh - [laughs] SHARI: [laughs] She's like, it was so easy, I could make you a lot of money [laughs]. She's like, he said it so desperate. She goes I hate to laugh at people, but - [laughs] GREG: Did you, um, did you, ah, ah tell her about the, ah, that voice mail? SHARI: Yeah, I said - I said Greg [inaudible] he's got the - they got a mob connection [langhs] - his friend threw away the business card after the meeting.[both laughing] SHARI: But, my God - my God, and so anyway, have you talked to Chnstian about this 'cause Christian apparently talked to him twice today. GREG: Oh, he sent a - Christian sent an e-mail shortly after, you know, that, and said we're not doin' business with this guy. SHARI: [laughs] GREG: Ah, so I still don't understand why this guy's trying to get in the middle of us and CDWR and I guess - SHARI: [laughs]

  7. Typical Enron Email Message-ID: <1494.1584620.JavaMail.evans@thyme> Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER> X-To: 'SStack@reliant.com@ENRON' Message Header Hi Shari Salutation Message Body Main Body Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713-853-6349 Signature Block -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject: Shhhh.... it's a SURPRISE ! Quoted Header Quoted Text Quoted Main Body Please call me (713) 207-5233 Thanks! Shari Quoted Signature

  8. Research Problems (Enron) • Threading • Email Classification • Social Network Analysis • Mention Resolution

  9. Who is that “Sheila”? Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Suzanne Adams <suzanne.adams@enron.com> Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. ? Sheila

  10. 82,084 addr-name 3,151 addr-nickname 19,708 addr-addr Rich Evidence about Identity m scott susan m scott m..scott@enron.com susan scott suebob sue sscott susan susan scott sscott5@enron.com sscott5 susan susan m scott friday com members scott susan scott.susan@enron.com 66,715 models susan m scott susan scott

  11. Test Collection of Mention Resolution Test Collections Enron-all Enron-subset Sager Shapiro

  12. Evaluation • Task • named-mention  ranked list of people • Measures • Mean Reciprocal Rank • Success @ K • Success @ 1 • Confidence-based scoring

  13. Limitations (Mention Resolution) • Small number of queries • Only resolved by Enron employees • Much easier • Most of participants are outsides • Measures focus only on accuracy

  14. Identity-Content Interplay SocialContext Search for People Search for Content TopicalContext

  15. W3C Collection • Set of mailing lists • public not private • Topically-oriented • ~175,000 emails • Introduced at TREC 2005 • 50 topics (x 2 years) • relevance judgments available for ad-hoc retrieval

  16. Research Problems (W3C) • Expert Finding • Topic  ranked list of experts • Know-item Retrieval • Query  ranked list of emails • Discussion Search (i.e., ad-hoc retrieval) • Pro/con retrieval • Query  ranked list of emails

  17. Topic Type Analysis Find categories amenable to pro/con classification (TREC 2005-Enterprise Track)

  18. Limitations (Pro/Con Retrieval) • Not private/personal communication • Mailing lists  receivers are hidden • Topical categories are unbalanced • Developed by researchers NOT users

  19. Related Projects • Others working with CMU’s Enron emails • Berkeley, CMU, U Mass, SIAM Workshop • University of Southern California ISI/ICT • eArchivarius, Postel collection (Anton Leuski) • Georgia Tech Research Institute PERPOS • Presidential records (Bill Underwood)

  20. Conclusion • Two email test collections • Public • Hundreds of thousands of emails • Annotated emails and transcripts • Tasks and ground truth • Need for “real” user needs • Development of evaluation measures for utility

  21. For More Information • Joint Institute for Knowledge Discovery • http://www.umiacs.umd.edu/jikd

  22. Running System

More Related