Planning for the trec 2008 legal track
Download
1 / 17

Planning for the TREC 2008 Legal Track - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Planning for the TREC 2008 Legal Track. Douglas Oard Stephen Tomlinson Jason Baron. Agenda. Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Planning for the TREC 2008 Legal Track' - weldon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Planning for the trec 2008 legal track

Planning for the TREC 2008Legal Track

Douglas Oard

Stephen Tomlinson

Jason Baron


Agenda
Agenda

  • Track goals

  • Deciding on a document collection

  • “Beating Boolean”

  • Handling nasty OCR

  • Making the best use of the metadata

  • Ad hoc task design

  • Interactive task design

  • Relevance feedback task design

  • Other issues


Track goals
Track Goals

  • Develop a reusable test collection

    • Documents, topics, evaluation measures

  • Foster formation of a research community

  • Establish baseline results


Choosing a collection
Choosing a Collection

  • FERC Enron (w/attachments, full headers)

    • Somewhat larger than CMU

    • Email is the real killer app for E-discovery

  • IIT CDIP version 1.0 (same as 2006/07)

    • We have 83 topics. Do we need more?

  • State Department Cables

    • Task model would be FOIA, not E-Discovery


  • TREC Topic Number: 1

  • Title: Marketers or Traders of Electricity on the Financial Market

  • Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit.

  • Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset.

  • Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron.

  • Query Possibilities:

    • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”)

    • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH)

      • o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified.

    • • (marketer or marketers or EPMI) and (short or long)

      • o As in have a long or short position in sales/purchases.

    • • (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL)

      • o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78)

      • o EOL was the forward market trading place. (36, p. 3)


Identity modeling in enron

82,084

addr-name

3,151

addr-nickname

19,708

addr-addr

Identity Modeling in Enron

m scott

susan m scott

[email protected]

susan scott

suebob

sue

sscott

susan

susan scott

[email protected]

again

sscott5

susan

ciao

susan m scott

friday

com members

scott susan

[email protected]

66,715 models

susan m scott

susan scott


Enron identity test collections
Enron Identity Test Collections

Test Collections

Enron-all

Enron-subset

Sager

Shapiro


Example document
Example Document

Scanned

OCR

Metadata

Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aa

Benffrts Departmext Rieh>pwna, Yfe&ia

Ta: Dishlbutfon Data aday 90,1997.

From: Lisa Fislla

Sabj.csr CIGNA WeWedng Newsbttsr -Yntsre StratsU

During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ng

artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a

msiter of disanision . I Imvm done somme reaearc>>, and wanted to pruedt you with my

Sadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee* .

I believe .vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on

whetlne you concur with my reeommendatioa

Title:CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY

Organization Authors:PMUSA, PHILIP MORRIS USA

Person Authors:HALLE, L

Document Date:19970530

Document Type:MEMO, MEMORANDUM

Bates Number:2078039376/9377

Page Count:2

Collection:Philip Morris


State department cables
State Department Cables

791,857 records – 550,983 of which are full text



Handling nasty ocr
Handling Nasty OCR

  • Index pruning

  • Error estimation

  • Character n-grams

  • Duplicate detection

  • Expansion using a cleaner collection


How to beat boolean
How to “Beat Boolean”

  • Work from reference Boolean?

    • Swap out low-ranked-in for high-ranked-out

  • Relax Boolean somehow?

    • Cover density, proximity perturbation, …


Using metadata
Using Metadata

  • Title (term match)

  • Author (social network

  • Bates number (sequence)


Ad hoc task design
Ad Hoc Task Design

  • Evaluation measures

    • [email protected]?, [email protected]?, Index size?

    • Error bars / Statistical significance testing

    • Limits on post-hoc use of the collection?

    • What are “meaningful” differences?

  • Topic design

    • Negotiation transcript?

  • Inter-annotator agreement


Interactive track design
Interactive Track Design

  • Evaluation measure

    • Precision-oriented?

    • Recall-oriented?

    • Effect of assessor disagreement


Relevance feedback task
Relevance Feedback Task

  • Evaluation measure

    • Residual recall at B_Residual?

  • Two-stage feedback?


Some open questions
Some Open Questions

  • Test collection reusability

    • Unbiased estimates? Tight error bars?

  • Why can’t we beat Boolean???

    • Different strategies? Detailed failure analysis?

  • Can we improve topic formulation?

    • Structured relevance relevance feedback?

  • Is OCR masking effects we need to see?

    • Is it time for a new collection?

    • Must it be de-duped? Is metadata needed?

  • Does Δscope invalidate the interactive task?


ad