Planning for the trec 2008 legal track
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Planning for the TREC 2008 Legal Track PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Planning for the TREC 2008 Legal Track. Douglas Oard Stephen Tomlinson Jason Baron. Agenda. Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design

Download Presentation

Planning for the TREC 2008 Legal Track

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Planning for the trec 2008 legal track

Planning for the TREC 2008Legal Track

Douglas Oard

Stephen Tomlinson

Jason Baron


Agenda

Agenda

  • Track goals

  • Deciding on a document collection

  • “Beating Boolean”

  • Handling nasty OCR

  • Making the best use of the metadata

  • Ad hoc task design

  • Interactive task design

  • Relevance feedback task design

  • Other issues


Track goals

Track Goals

  • Develop a reusable test collection

    • Documents, topics, evaluation measures

  • Foster formation of a research community

  • Establish baseline results


Choosing a collection

Choosing a Collection

  • FERC Enron (w/attachments, full headers)

    • Somewhat larger than CMU

    • Email is the real killer app for E-discovery

  • IIT CDIP version 1.0 (same as 2006/07)

    • We have 83 topics. Do we need more?

  • State Department Cables

    • Task model would be FOIA, not E-Discovery


Planning for the trec 2008 legal track

  • TREC Topic Number: 1

  • Title: Marketers or Traders of Electricity on the Financial Market

  • Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit.

  • Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset.

  • Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron.

  • Query Possibilities:

    • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”)

    • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH)

      • o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified.

    • • (marketer or marketers or EPMI) and (short or long)

      • o As in have a long or short position in sales/purchases.

    • • (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL)

      • o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78)

      • o EOL was the forward market trading place. (36, p. 3)


Identity modeling in enron

82,084

addr-name

3,151

addr-nickname

19,708

addr-addr

Identity Modeling in Enron

m scott

susan m scott

m..[email protected]

susan scott

suebob

sue

sscott

susan

susan scott

[email protected]

again

sscott5

susan

ciao

susan m scott

friday

com members

scott susan

[email protected]

66,715 models

susan m scott

susan scott


Enron identity test collections

Enron Identity Test Collections

Test Collections

Enron-all

Enron-subset

Sager

Shapiro


Example document

Example Document

Scanned

OCR

Metadata

Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aa

Benffrts Departmext Rieh>pwna, Yfe&ia

Ta: Dishlbutfon Data aday 90,1997.

From: Lisa Fislla

Sabj.csr CIGNA WeWedng Newsbttsr -Yntsre StratsU

During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ng

artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a

msiter of disanision . I Imvm done somme reaearc>>, and wanted to pruedt you with my

Sadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee* .

I believe .vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on

whetlne you concur with my reeommendatioa

Title:CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY

Organization Authors:PMUSA, PHILIP MORRIS USA

Person Authors:HALLE, L

Document Date:19970530

Document Type:MEMO, MEMORANDUM

Bates Number:2078039376/9377

Page Count:2

Collection:Philip Morris


State department cables

State Department Cables

791,857 records – 550,983 of which are full text


State department cables1

State Department Cables


Handling nasty ocr

Handling Nasty OCR

  • Index pruning

  • Error estimation

  • Character n-grams

  • Duplicate detection

  • Expansion using a cleaner collection


How to beat boolean

How to “Beat Boolean”

  • Work from reference Boolean?

    • Swap out low-ranked-in for high-ranked-out

  • Relax Boolean somehow?

    • Cover density, proximity perturbation, …


Using metadata

Using Metadata

  • Title (term match)

  • Author (social network

  • Bates number (sequence)


Ad hoc task design

Ad Hoc Task Design

  • Evaluation measures

    • [email protected]?, [email protected]?, Index size?

    • Error bars / Statistical significance testing

    • Limits on post-hoc use of the collection?

    • What are “meaningful” differences?

  • Topic design

    • Negotiation transcript?

  • Inter-annotator agreement


Interactive track design

Interactive Track Design

  • Evaluation measure

    • Precision-oriented?

    • Recall-oriented?

    • Effect of assessor disagreement


Relevance feedback task

Relevance Feedback Task

  • Evaluation measure

    • Residual recall at B_Residual?

  • Two-stage feedback?


Some open questions

Some Open Questions

  • Test collection reusability

    • Unbiased estimates? Tight error bars?

  • Why can’t we beat Boolean???

    • Different strategies? Detailed failure analysis?

  • Can we improve topic formulation?

    • Structured relevance relevance feedback?

  • Is OCR masking effects we need to see?

    • Is it time for a new collection?

    • Must it be de-duped? Is metadata needed?

  • Does Δscope invalidate the interactive task?


  • Login