The enron and w3c collections
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

The Enron and W3C Collections PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on
  • Presentation posted in: General

The Enron and W3C Collections. Tamer Elsayed and Douglas W. Oard. University of Maryland. ICAIL 2007, DESI Workshop, June 4 th , 2007. Variants of Email Search. Searcher. Collection. Rich multimodal data Emails Phone calls Databases. The (Extended) Enron Collection.

Download Presentation

The Enron and W3C Collections

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The enron and w3c collections

The Enron and W3C Collections

Tamer Elsayed and Douglas W. Oard

University of Maryland

ICAIL 2007, DESI Workshop, June 4th, 2007


Variants of email search

Variants of Email Search

Searcher

Collection


The enron and w3c collections

Rich multimodal data

Emails

Phone calls

Databases

The (Extended) Enron Collection


The enron and w3c collections

The (Extended) Enron Collection

  • “Public” version of Enron collection (CMU)

    • 150 sets of rescued Outlook email folders

    • 517,431 emails, 52% duplicates, 133,581 unique addresses

    • Subset annotated w/genre, speech act, mentioned calls, …

  • Extended Enron email collection (Aspen Systems)

    • Attachments, additional email (later release, redaction)

  • Phone calls from/to Enron traders (Shohomish PUD)

    • Transcribed subset from 52 DVDs of recorded audio

    • Recovered from scanned transcripts using OCR

    • 93 annotated with date, time, participants, mentioned names, mentioned emails, mentioned meetings, ...

  • Relational databases (Aspen Systems)


Cross references

PhoneCalls

EMAIL

Cross-References


Phone call transcripts

Phone Call Transcripts

Message-ID: <24-20010126-19435570-20020114-R>

Message-Type: PhoneCall

Date: Fri, 26 Jan 2001 19:43:55 -0600 (CST)

From: [email protected]

To: [email protected]

Parties: [email protected], [email protected]

Subject: Snohornish deal, Houston Chronicle Article, Bonuses e-mail, Houston Chronicle Article, Deal, email to Jane King

Subject-TimePos: 145, 313, 713, 775, 920, 1018

InCallNames: Christian, Ken Lay, Greg, Chris Foster, Stewie, Stewie, Mike, Mike, Laverado, Mike, Kim, Shari, Greg, Forney, Stewie, Jane King, Shari

InCallNames-TimePos: 42, 81, 90, 95, 96, 143, 146, 190, 262, 266, 522, 580, 780, 1007, 1018, 1038, 1067

Keywords: CDWR, email, email

Keywords-TimePos: 55, 689, 1038

X-From: Stack, Shari <>

X-To: Wolfe, Greg <>

X-Parties: Stack, Shari <>, Wolfe, Greg <>

X-AudioFile: 24-20010126-19435570-20020114-R.wav

X-TranscriptFile: 24-20010126-19435570-20020114-R.txt

SHARI STACK: Hey.

GREG WOLFE: All right, let me get my fax machine workin'. Uh - [laughs]

SHARI: [laughs] She's like, it was so easy, I could make you a lot of money [laughs]. She's like, he said it so desperate. She goes I hate to laugh at people, but - [laughs]

GREG: Did you, um, did you, ah, ah tell her about the, ah, that voice mail?

SHARI: Yeah, I said - I said Greg [inaudible] he's got the - they got a mob connection [langhs] - his friend threw away the business card after the meeting.[both laughing]

SHARI: But, my God - my God, and so anyway, have you talked to Chnstian about this 'cause Christian apparently talked to him twice today.

GREG: Oh, he sent a - Christian sent an e-mail shortly after, you know, that, and said we're not doin' business with this guy.

SHARI: [laughs]

GREG: Ah, so I still don't understand why this guy's trying to get in the middle of us and CDWR and I guess -

SHARI: [laughs]


Typical enron email

Typical Enron Email

Message-ID: <[email protected]>

Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)

From: [email protected]

To: [email protected]

Subject: RE: Shhhh.... it's a SURPRISE !

X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>

X-To: [email protected]@ENRON'

Message Header

Hi Shari

Salutation

Message Body

Main Body

Hope all is well.

Count me in for the group present.

See ya next week if not earlier

Liza

Elizabeth Sager

713-853-6349

Signature Block

-----Original Message-----

From: [email protected]@ENRON

Sent: Monday, July 30, 2001 2:24 PM

To: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]

Cc: [email protected]

Subject:Shhhh.... it's a SURPRISE !

Quoted Header

Quoted

Text

Quoted Main Body

Please call me (713) 207-5233

Thanks!

Shari

Quoted Signature


Research problems enron

Research Problems (Enron)

  • Threading

  • Email Classification

  • Social Network Analysis

  • Mention Resolution


Who is that sheila

Who is that “Sheila”?

Date: Wed Dec 20 08:57:00 EST 2000

From: Kay Mann <[email protected]>

To: Suzanne Adams <[email protected]>

Subject: Re: GE Conference Call has be rescheduled

Did Sheila want Scott to participate? Looks like the call

will be too late for him.

?

Sheila


Rich evidence about identity

82,084

addr-name

3,151

addr-nickname

19,708

addr-addr

Rich Evidence about Identity

m scott

susan m scott

m..[email protected]

susan scott

suebob

sue

sscott

susan

susan scott

[email protected]

sscott5

susan

susan m scott

friday

com members

scott susan

[email protected]

66,715 models

susan m scott

susan scott


Test collection of mention resolution

Test Collection of Mention Resolution

Test Collections

Enron-all

Enron-subset

Sager

Shapiro


Evaluation

Evaluation

  • Task

    • named-mention  ranked list of people

  • Measures

    • Mean Reciprocal Rank

    • Success @ K

      • Success @ 1

    • Confidence-based scoring


Limitations mention resolution

Limitations (Mention Resolution)

  • Small number of queries

  • Only resolved by Enron employees

    • Much easier

    • Most of participants are outsides

  • Measures focus only on accuracy


Identity content interplay

Identity-Content Interplay

SocialContext

Search for People

Search for Content

TopicalContext


W3c collection

W3C Collection

  • Set of mailing lists

    • public not private

    • Topically-oriented

  • ~175,000 emails

  • Introduced at TREC 2005

  • 50 topics (x 2 years)

  • relevance judgments available for ad-hoc retrieval


Research problems w3c

Research Problems (W3C)

  • Expert Finding

    • Topic  ranked list of experts

  • Know-item Retrieval

    • Query  ranked list of emails

  • Discussion Search (i.e., ad-hoc retrieval)

    • Pro/con retrieval

    • Query  ranked list of emails


Topic type analysis

Topic Type Analysis

Find categories amenable to pro/con classification (TREC 2005-Enterprise Track)


Limitations pro con retrieval

Limitations (Pro/Con Retrieval)

  • Not private/personal communication

  • Mailing lists  receivers are hidden

  • Topical categories are unbalanced

  • Developed by researchers NOT users


Related projects

Related Projects

  • Others working with CMU’s Enron emails

    • Berkeley, CMU, U Mass, SIAM Workshop

  • University of Southern California ISI/ICT

    • eArchivarius, Postel collection (Anton Leuski)

  • Georgia Tech Research Institute PERPOS

    • Presidential records (Bill Underwood)


Conclusion

Conclusion

  • Two email test collections

    • Public

    • Hundreds of thousands of emails

    • Annotated emails and transcripts

    • Tasks and ground truth

  • Need for “real” user needs

  • Development of evaluation measures for utility


For more information

For More Information

  • Joint Institute for Knowledge Discovery

    • http://www.umiacs.umd.edu/jikd


Running system

Running System


  • Login