110 likes | 219 Views
Planning for the TREC 2008 Legal Track. Douglas Oard Stephen Tomlinson Jason Baron. Thursday’s Discussion. Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design.
E N D
Planning for the TREC 2008Legal Track Douglas Oard Stephen Tomlinson Jason Baron
Thursday’s Discussion • Deciding on a document collection • “Beating Boolean” • Handling nasty OCR • Making the best use of the metadata • Ad hoc task design • Interactive task design • Relevance feedback task design
Choosing a Collection • FERC Enron (w/attachments, full headers) • Email is high-interest for E-discovery practice! • IIT CDIP version 1.0 (same as 2006/07) • Same 83 topics, plus some new ones • State Department Cables • Task: Freedom of Information Act requests
Plans for 2008 • Some things stay the same: • Same collection • Same three tasks (Ad Hoc, RF, Interactive) • Some new things • Deep assessment (fewer new topics) • Additional ranking-sensitive eval measures
Handling Nasty OCR • Index pruning • Error estimation • Character n-grams • Duplicate detection • Expansion using a cleaner collection
How to “Beat Boolean” • Work from reference Boolean? • Swap out low-ranked-in for high-ranked-out • Relax Boolean somehow? • Cover density, proximity perturbation, …
Using Metadata • Title (term match) • Author (social network • Bates number (sequence)
Ad Hoc Task Design • Evaluation measures • R@B?, P@R?, Index size? • Error bars / Statistical significance testing • Limits on post-hoc use of the collection? • What are “meaningful” differences? • Topic design • Negotiation transcript? • Inter-annotator agreement
Interactive Track Design • Evaluation measure • Precision-oriented? • Recall-oriented? • Effect of assessor disagreement
Relevance Feedback Task • Evaluation measure • Residual recall at B_Residual? • Two-stage feedback?