1 / 43

Content and the Scan Statistic for the Enron Data

Content and the Scan Statistic for the Enron Data. John M. Conroy Institute for Defense Analyses Center for Computing Sciences Bowie, MD. Citations and Coauthors.

Download Presentation

Content and the Scan Statistic for the Enron Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content and the Scan Statistic for the Enron Data John M. Conroy Institute for Defense Analyses Center for Computing Sciences Bowie, MD

  2. Citations and Coauthors • C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park, “Scan Statistics on Enron Graphs,” Computational and Mathematical Organization Theory, to appear. http://www.ams.jhu.edu/∼priebe/sseg.html • J. M. Conroy, J. D. Schlesinger, J.Goldstein, D. P. O'Leary,Left-Brain/Right-Brain Multi-Document Summarizationhttp://www.nlpir.nist.gov/projects/duc/pubs.html

  3. Outline • Enron Data • Review of Scan Statistic • Content Analysis • Content of Week 109 (Chatter Week) • Communication vs. Content for Week 109

  4. Enron Email Data • Email boxes of 184 user accounts, mostly executives. • 55362 stored messages (many duplicates). • 125,409 transactions (from-to pairs) among the 184 user accounts. • 189 weeks, from 1998 through 2002.

  5. Review of Scan Statistic • Anomaly Detection. • E.g. • Introduction of new actors. • Increase in communication between a group of people (chatter).

  6. Gees, What Were They Saying? • Data for week 109 • 1092 transactions among 22 users. • 343 files. • 91 unique messages. • What were the subjects of discussion?

  7. Counts and Subject Lines 9 5 Analysis of Joskow / Hogan Papers 4 FERC Request 3 Data on Monthly Generation for SCE 3 Draft Talking points about California Gas market 3 EnronOnline question 3 Presentations from GA Meeting on December 8 2 California Price Issues 2 Conectiv / Delmarva 2 Davis, Hoecker and Richardson 2 FYI-Edison wants Reregulation 1 Additional Arguments for Enron's Gas Cap Response 1 Calif. Performance Issues 1 California Update--12.12.00 1 Capacity Release Info for Enron's Gas Cap Response

  8. Clustering Based on Content • Find emails with similar content based on terms that occur. • Term: space-delimited string of characters from {a,b,c,…,z}, after text is lower cased and all other characters and stop words are removed. • Need to restrict our attention to indicative terms (signature terms). • Terms that occur more often then expected.

  9. Signature Terms Terms that occur more often than expected • Based on a 22 contingency table of relevance counts. • Log-likelihood; equivalent to mutual information. • Dunning 1993, Hovy & Lin 2000.

  10. Hypothesis Testing H0: P(C|ti)=p=P(C|~ti) H1: P(C|ti)=p1p2=P(C|~ti) ML Estimate p, p1, and p2

  11. Likelihood of H0 vs. H1 and Mutual Information

  12. Example: Subject: Re: Analysis of Joskow / Hogan Papers Sounds very good. Might be useful to get a "reputable" economist to write a paper that 1) describes traditional means for defining, identifying and mitigating market power, 2) compares those with the "new" means folks are coming up with these days, and 3) comments on the "split" in the academic community over the issues. When Steve Kean and I discussed the notion initially, thought it might be a good idea to gently "pile on" to the public discussion with the goal of making clear 1) just how complex this issue is and 2) how important it will be to have a thorough analysis (say, about 12+ months worth?) before rushing to judgment on anything Joskow might allege in his paper. Thoughts? Best, Jeff James D Steffes 12/11/2000 10:02 AM To: Alan Comnes/PDX/ECT@ECT, Joe Hartsoe/Corp/Enron@ENRON, Richard Shapiro/NA/Enron@Enron cc: Jeff Dasovich/NA/Enron@Enron, Mary Hain/HOU/ECT@ECT, Susan J Mara/NA/Enron@Enron Subject: Re: Analysis of Joskow / Hogan Papers Having read the Hogan paper, I think that the "academic" community is somewhat divided on this issue. If we want to move forward on the issues Joskow addresses, I would recommend that EPSA be the vehicle. The entire marketer / generator community needs to counter. What do people think about seeking activity through EPSA, WPTF, and/or IEP of CA to push back on the studies and analysis especially after the Dec 13 Order? I don't think that the discussions will be ending very soon. Jim Alan Comnes@ECT 12/07/2000 03:07 AM To: James D Steffes/NA/Enron@ENRON cc: Jeff Dasovich/NA/Enron@Enron, Susan J Mara/NA/Enron@ENRON, Mary Hain/HOU/ECT Subject: Re: Analysis of Joskow / Hogan Papers The Joskow/Kahn paper raises two issues: price above cost and witholding. Enron obviously has concerns with the "price above cost" analysis. I drafted some specific concerns and put them into a draft to Enron's reponse to Hoeker Question 1. Although the detail was dropped in the final draft, the basic technical concerns were laid out there. To really rebut Joskow/Kahn would take considerable work. Jeff D's idea was to write a paper that raised issues and indicated how complicated a "correct" response would be. The Joskow/Kahn withholding section has recieved criticism from the ISO so I am not sure Enron needs to respond to that. I think my bottom line now is that the debate at FERC will soon be over or enter a new stage on the 13th. As far as how a response would help us in California, I think requires a discussion with Jeff. Alan From: James D Steffes@ENRON on 12/05/2000 07:22 PM CST To: Alan Comnes/PDX/ECT@ECT, Jeff Dasovich/NA/Enron@Enron, Susan J Mara/NA/Enron@ENRON, Mary Hain/HOU/ECT@ECT cc: Subject: Analysis of Joskow / Hogan Papers Alan -- Before we bring in Seabron Adamson to do some analysis, I'd like your read of the Joskow and Hogan papers. When we have our understanding straight, let's talk. Jim ----- Forwarded by James D Steffes/NA/Enron on 12/05/2000 07:20 PM ----- Jeff Dasovich Sent by: Jeff Dasovich 11/30/2000 11:49 AM To: skean@enron.com, Richard Shapiro/NA/Enron@Enron, James D Steffes/NA/Enron@Enron, Sandra McCubbin/NA/Enron@Enron, Paul Kaufman/PDX/ECT@ECT, Joe Hartsoe/Corp/Enron@ENRON, Sarah Novosel/Corp/Enron@ENRON, Mary Hain/HOU/ECT@ECT, Karen Denne/Corp/Enron@ENRON, mpalmer@enron.com, Susan J Mara/NA/Enron@ENRON, Alan Comnes/PDX/ECT@ECT cc: Subject: From Today's Electricity Daily FYI. In bizarre times, help can sometimes come from bizarre places. Granted, we're likely to disagree strongly with Hogan's continued obsession with Poolco, but the discussion in his paper regarding market power may be helpful---I've read the Joskow paper, but haven't yet had a chance to review the Hogan piece. Steve and I discussed the need to do a focused assessment of the Joskow/Kahn "analysis" (remember it's Ed Kahn, not Alfred Kahn). Seems that it would be very useful to fold into that analysis any useful stuff on market power included in the paper done by Hogan & Co. If, in the end, there ain't nothing useful, so be it. But seems like there's little downside to exploring it. Jim, my understanding is that Alan is already working with the fundamentals folks on the Portland desk to deconstruct the Joskow paper. Might want to include the Hogan paper in those discussions and might also be useful to pull Seabron Adamson into the thinking, too. Ultimately, may be preferable to have any assessment of Joskow and/or Hogan to come from economists, rather than directly from us. Best, Jeff ----- Forwarded by Jeff Dasovich/NA/Enron on 11/30/2000 11:38 AM ----- "Daniel Douglass" <Douglass@ArterHadden.com> 11/30/2000 11:29 AM To: <Barbara_Klemstine@apsc.com>, <dcazalet@apx.com>, <BillR@calpine.com>, <jackp@calpine.com>, <glwaas@calpx.com>, <Ken_Czarnecki@calpx.com>, <cabaker@duke-energy.com>, <gavaughn@duke-energy.com>, <rjhickok@duke-energy.com>, <gtbl@dynegy.com>, <KEWH@dynegy.com>, <jdasovic@enron.com>, <susan_j_mara@enron.com>, <curt.Hatton@gen.pge.com>, <foothill@lmi.net>, <camiessn@newwestenergy.com>, <jcgardin@newwestenergy.com>, <rsnichol@newwestenergy.com>, <Nam.Nguyen@powersrc.com>, <rllamkin@seiworldwide.com>, <Roger.Pelote@Williams.com> cc: Subject: From Today's Electricity Daily Has FERC Gone Far Enough in California? The Federal Energy Regulatory Commission isn't going far enough in its attempt to reform the California wholesale electric market, according to a paper by three prominent economists done for San Diego Gas and Electric. The paper by John D. Chandley, Scott M. Harvey, and William W. Hogan argues that FERC should first end the artificial separation that divides the California Power Exchange and the California Independent System Operator, rather than worrying about the governance of the two institutions. "The change in governance may help," says the paper - "Electricity Market Reform in California" - "but it is not likely to be decisive in the near term. Explicit guidance from the commission regarding the nature and trajectory of reforms will be essential if market reform is to be accomplished within an acceptable time frame." Hogan, of the Kennedy School of Government, has been writing since 1995 in opposition to California's market separation. Also, argues the paper, freeing the California utilities to engage in forward contracting is no panacea. "The expectation that merely allowing utilities to participate in forward contracting necessarily would be the solution to high prices is problematic and not supported by the commission's staff report," says the analysis, adding that "putting pressure on buyers to sign contracts in the present environment may make things worse." If the underlying problem in California is high cost and low capacity, requiring forward contracting could harm not only California but also the entire Western U.S. electric system. FERC's $150 so-called "soft cap" is a wild card that has the three economists scratching their heads. "It does not appear in the staff report and there is little critical analysis of their implications, other than the discussion of Commissioner [Curt] Hebert." If the intent of the soft cap is to move toward cost justification for bids above $150/MWh, then FERC is headed into an administrative morass "that would rival those under wellhead price controls in the natural gas industry." If, on the other hand, the soft cap is "truly soft" and would only require some paper work at FERC and the possibility of a refund if the price is eventually deemed not just and reasonable, "there might be little impact on consumer prices (particularly if the principal sources of those high prices are high costs and regional capacity shortages rather than the exercise of market power). Even so, the proposal might serve to deter entry and new investments, thus combining the worst of both worlds, high consumer prices and little or no new investment." FERC's proposed order in California also demonstrates confusion about just what constitutes market power. The paper cites the proposed order's lawyerly, obfuscatory conclusion that "while this record does not support findings of specific exercises of market power, and while we are not able to reach definite conclusions about the actions of individual sellers, there is clear evidence that in California market structure and rules provide the opportunity for sellers to exercise market power when supply is tight and can result in unjust and unreasonable rates under the [Federal Power Act]." The economists note, "In this regard, the debate is confused because we are dancing around the words where the truth may be hard to face." In the case of California, say the economists, there is no evidence of market power. Even the practice of generators avoiding the day-ahead market in favor of the real-time market "is a response to bad market design and pricing incentives (including price caps), but does not demonstrate the exercise of market power." Nor is bidding above marginal cost necessarily an exercise of market power, they add. "The distinction between direct marginal cost and opportunity cost is sometimes lost in the discussion. Hence, a competitive bidder whose direct cost of generation is $40 but who could sell the same energy outside California for $100 should bid no less than $100. This would not be an exercise of market power."

  13. Example Signature Terms • analysis, california, com, economists, enron, hogan, joskow, kahn, market, na, paper, power

  14. Simple Clustering For each message compute signature terms. Form an nd matrix F with F(i,j)= # times sigterm i occurs in doc j. [R,P]=corrcoef(F); P is the dd matrix of p-values for R.

  15. Simple Clustering (cont.) • Consider the graph G(P<). • Take the clusters as the connected components of G. • Thus, two documents are connected if there is a significant overlap in their signature terms!

  16. Connected Components of Email

  17. Connected Components

  18. n 1 n 2 n Summarizing the Clusters • Single Msg. Summarization • Score the sentences: • Given signature terms. • Want first “few” great sentences. • Want the probability that a sentence is a summary sentence.

  19. Summarizing (cont.) • Multi-document Summarization • Use HMM scoresto select candidate sentences (~2w). • Terms as sentence features • Terms: {t1, …, tm} Rm • Sentences: {s1, …, sn} Rn • Scaling: || a || = HMM score • Use Pivoted QR to select sentences.

  20. Summaries of Clusters 100 Words Summary of Cluster 7; 4 msgs /data/Enron/maildir/lavorato-j/_sent_mail/11. Subject: * The power mark to market book will pay NewAlb a capacity payment of $4.87 ... for 5 years. We shaped this payment as follows:... * Enron will also pay NewAlb $2.00/MW hour for varialbe o&m. /data/Enron/maildir/lavorato-j/sent_items/225. Subject: The following points refer to the methodology that we are taking to rebook the New Albany Plant. Please send me a note immediately if you disagree.... Assume that NewAlb is a non mark to market entity and Enron is the mark to market entity. However, it is fully owned and operated by us for now.... * This will create an entity "NewAlb" that will return 9% assuming a book value of $336/kw on 12/31/2005 vs. 409 currently.

  21. Cluster 3: 6 Documents /data/Enron/maildir/dasovich-j/all_documents/4665. Subject: Re: To: Jeff Dasovich/NA/Enron@Enron... From: Jeff Dasovich on 12/13/2000 10:36 AM... Sent by: Jeff Dasovich... To: Richard Shapiro/NA/Enron@Enron... cc: ... Kahn's secretary has left messages that he's "very tied up" and continues to ... try to contact me. Suggests to me that they may be planning something and /data/Enron/maildir/dasovich-j/all_documents/4562. Subject: Re: Presentations from GA Meeting on December 8 From: Jeff Dasovich on 12/12/2000 10:57 AM... Sent by: Jeff Dasovich... To: Richard Shapiro/NA/Enron@Enron /data/Enron/maildir/dasovich-j/all_documents/4688. Subject: Re: Davis, Hoecker and Richardson From: Jeff Dasovich on 12/13/2000 12:50 PM... Sent by: Jeff Dasovich... Word on the street is that Davis, Hoecker and Richardson are meeting in D.C. /data/Enron/maildir/dasovich-j/all_documents/4565. Subject: Re: Presentations from GA Meeting on December 8 From: Jeff Dasovich on 12/12/2000 10:57 AM... Sent by: Jeff Dasovich... To: Richard Shapiro/NA/Enron@Enron >>

  22. Cluster 1: 35 documents 200 Words Summary of Email Cluster 1 /data/Enron/maildir/taylor-m/all_documents/3944. Subject: EnronOnline question To: Stacy E Dickson/HOU/ECT@ECT... cc: Mark Taylor/HOU/ECT@ECT /data/Enron/maildir/dasovich-j/all_documents/4693. Subject: FYI-Edison wants Reregulation "Stephanie-Newell" <stephanie-newell@reliantenergy.com>, "Sue Mara" ... <smara@enron.com>, "Tom Ross" <tross@mcnallytemple.com>, "Kate Castillo" ... <ccastillo@riobravo-gm.com>, "Bill Carlson" <wcarlson@wm.com>, "Bill Woods" ... <billw@calpine.com>, "Bob Escalante" <rescalante@riobravo-gm.com>, "Carolyn ... Baker" <cabaker@duke-energy.com>, "Cody Carter" <cody.carter@williams.com>, ... "Curt Hatton" <curt.hatton@gen.pge.com>, "Curtis Kebler" ... <curtis_l_kebler@reliantenergy.com>, "Dave Parquet" <dparque@ect.enron.com>, ... "Dean Gosselin" <dean_gosselin@fpl.com>, "Duane Nelsen" ... <eileenk@calpine.com>, "Eric Eisenman" <eric.eisenman@neg.pge.com>, "Frank ... DeRosa" <frank.derosa@gen.pge.com>, "Greg Blue" <gtbl@dynegy.com>, "Hap Boyd" ... <hap.boyd@enron.com>, "Jack Pigott" <jackp@calpine.com>, "Jeff Dasovich" ... <jdasovic@enron.com>, "Jim Willey" <elliottsa@earthlink.net>, "Joe Greco" ... <joe.greco@uaecorp.com>, "Joe Ronan" <joer@calpine.com>, "Jonathan Weisgall" ... <jweisgall@aol.com>, "Ken Hoffman" <khoffman@caithnessenergy.com>, "Kent ... McFadden" <marty_mcfadden@ogden-energy.com>, "Paula Soos" /data/Enron/maildir/dasovich-j/all_documents/4460. Subject: Re: Data on Monthly Generation for SCE McCubbin/NA/Enron@Enron, Tim Belden/HOU/ECT@ECT, Robert Badeer/HOU/ECT@ECT, ... Chris H Foster/HOU/ECT@ECT, Susan J Mara/NA/Enron@ENRON, Alan ... Comnes/PDX/ECT@ECT /data/Enron/maildir/kean-s/all_documents/2288. Subject: FW: NEW HARASSMENT Alan.Comnes@enron.com; Jeff.Dasovich@enron.com; >>

  23. Ada Lovelace • "The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths."

  24. Reprocess the Data! • Remove any line with 2 or more @’s. • Re-compute signature terms. • Re-compute clusters. • Re-compute summaries! • Do over!

  25. New t=109, Iter1

  26. New t=109, Iter 2

  27. New t=109 Iter8

  28. “Yourmotherisnear! So, as fast as you can, Think of something to do! You will
have to get rid of Thing One and Thing Two!” Subject: Organizational Changes ---------------------- Forwarded by Richard Shapiro/NA/Enron on 12/08/2000= Subject: Re: Analysis of Joskow / Hogan Papers Having read the Hogan paper, I think that the "academic" community is ... paper by three prominent economists done for San Diego Gas and Electric. The ... paper by John D. Chandley, Scott M. Harvey, and William W. Hogan argues that Subject: Hogan-California Market Power FYI. Not sure if you had seen this. Hogan makes many of the arguments about Subject: Re: Draft Talking points about California Gas market Given the way the numbers came out, I guess we don't need the talking points, Subject: Re: FERC Request Drew is okay with this. I will email the list to FERC. Subject: Update on FERC California Gas/Electric Matters into the California market last summer.... Various Enron units continue to receive informal data requests from FERC ... staff regarding current California gas/electric

  29. Related News Item January 13, 2001 Leading economists Paul Jaskow and Edward Kahn conclude that high wholesale prices observed in summer 2000 [in California] cannot be explained as the natural outcome of `market fundamentals’ in competitive markets since there is a very significant gap between actual market prices and competitive benchmark prices. (Source: CATO Policy Analysis) http://cantwell.senate.gov/news/releases/2002_04_18_consumer.html

  30. Content vs. Contact Given our matrix, F, the term document matrix of signature terms and emails sent during a period. Consider an induced dot product graph, based on the correlation of the signature terms.

  31. A Content-Based Dot Product Graph Let P(Aij=1)~Rij, the correlation coefficient of document i and j. Note, this is a dot-product graph based on the correlation of two sparse vectors and not low dimensional!

  32. Communication and Content

  33. Various Thresholds

  34. Content 109 vs. All Communication

  35. Rho Threshold Plot

  36. Content and Communication are not the Same Example: (due to Libby Beer) Alice & Bob exchange love letters and Carol & Dave exchange love letters DOES NOT imply Alice & Dave send love letters!

  37. Conclusions • The scan statistic on graphs rocks! • Summarization methods are useful in analyzing email, but exploiting the structure of email is integral. • Content is correlated with communication but shows about 18% of contacts. • Probability of communication correlates with message content correlation!

  38. Future Work • Content scan statistic would track changing user interests. • Augment the communication information. • Predict “love is in the air.” • Content scan statistic with nodes being documents! • E.g., emerging themes in research papers.

More Related