planning for the trec 2008 legal track
Download
Skip this Video
Download Presentation
Planning for the TREC 2008 Legal Track

Loading in 2 Seconds...

play fullscreen
1 / 17

Planning for the TREC 2008 Legal Track - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Planning for the TREC 2008 Legal Track. Douglas Oard Stephen Tomlinson Jason Baron. Agenda. Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Planning for the TREC 2008 Legal Track' - weldon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
planning for the trec 2008 legal track

Planning for the TREC 2008Legal Track

Douglas Oard

Stephen Tomlinson

Jason Baron

agenda
Agenda
  • Track goals
  • Deciding on a document collection
  • “Beating Boolean”
  • Handling nasty OCR
  • Making the best use of the metadata
  • Ad hoc task design
  • Interactive task design
  • Relevance feedback task design
  • Other issues
track goals
Track Goals
  • Develop a reusable test collection
    • Documents, topics, evaluation measures
  • Foster formation of a research community
  • Establish baseline results
choosing a collection
Choosing a Collection
  • FERC Enron (w/attachments, full headers)
    • Somewhat larger than CMU
    • Email is the real killer app for E-discovery
  • IIT CDIP version 1.0 (same as 2006/07)
    • We have 83 topics. Do we need more?
  • State Department Cables
    • Task model would be FOIA, not E-Discovery
slide5

TREC Topic Number: 1

  • Title: Marketers or Traders of Electricity on the Financial Market
  • Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit.
  • Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset.
  • Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron.
  • Query Possibilities:
    • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”)
    • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH)
      • o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified.
    • • (marketer or marketers or EPMI) and (short or long)
      • o As in have a long or short position in sales/purchases.
    • • (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL)
      • o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78)
      • o EOL was the forward market trading place. (36, p. 3)
identity modeling in enron

82,084

addr-name

3,151

addr-nickname

19,708

addr-addr

Identity Modeling in Enron

m scott

susan m scott

[email protected]

susan scott

suebob

sue

sscott

susan

susan scott

[email protected]

again

sscott5

susan

ciao

susan m scott

friday

com members

scott susan

[email protected]

66,715 models

susan m scott

susan scott

enron identity test collections
Enron Identity Test Collections

Test Collections

Enron-all

Enron-subset

Sager

Shapiro

example document
Example Document

Scanned

OCR

Metadata

Philip Moxx\'s. U.S.A. x.dr~am~c. cvrrespoaa.aa

Benffrts Departmext Rieh>pwna, Yfe&ia

Ta: Dishlbutfon Data aday 90,1997.

From: Lisa Fislla

Sabj.csr CIGNA WeWedng Newsbttsr -Yntsre StratsU

During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per\'Irw+ng

artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a

msiter of disanision . I Imvm done somme reaearc>>, and wanted to pruedt you with my

Sadings and pcdiminary recwmmeadatioa for PM\'s atratezy Ieprding l4aas aewelattee* .

I believe .vayone\'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on

whetlne you concur with my reeommendatioa

Title:CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY

Organization Authors:PMUSA, PHILIP MORRIS USA

Person Authors:HALLE, L

Document Date:19970530

Document Type:MEMO, MEMORANDUM

Bates Number:2078039376/9377

Page Count:2

Collection:Philip Morris

state department cables
State Department Cables

791,857 records – 550,983 of which are full text

handling nasty ocr
Handling Nasty OCR
  • Index pruning
  • Error estimation
  • Character n-grams
  • Duplicate detection
  • Expansion using a cleaner collection
how to beat boolean
How to “Beat Boolean”
  • Work from reference Boolean?
    • Swap out low-ranked-in for high-ranked-out
  • Relax Boolean somehow?
    • Cover density, proximity perturbation, …
using metadata
Using Metadata
  • Title (term match)
  • Author (social network
  • Bates number (sequence)
ad hoc task design
Ad Hoc Task Design
  • Evaluation measures
  • Topic design
    • Negotiation transcript?
  • Inter-annotator agreement
interactive track design
Interactive Track Design
  • Evaluation measure
    • Precision-oriented?
    • Recall-oriented?
    • Effect of assessor disagreement
relevance feedback task
Relevance Feedback Task
  • Evaluation measure
    • Residual recall at B_Residual?
  • Two-stage feedback?
some open questions
Some Open Questions
  • Test collection reusability
    • Unbiased estimates? Tight error bars?
  • Why can’t we beat Boolean???
    • Different strategies? Detailed failure analysis?
  • Can we improve topic formulation?
    • Structured relevance relevance feedback?
  • Is OCR masking effects we need to see?
    • Is it time for a new collection?
    • Must it be de-duped? Is metadata needed?
  • Does Δscope invalidate the interactive task?
ad