1 / 13

Lessons Learned from Information Retrieval

Lessons Learned from Information Retrieval. Chris Buckley Sabir Research chrisb@sabir.com. Legal E-Discovery. Important, growing problem Current solutions not fully understood by people using them Imperative to find better solutions that scale Evaluation required

kdorothy
Download Presentation

Lessons Learned from Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lessons Learned from Information Retrieval Chris Buckley Sabir Research chrisb@sabir.com

  2. Legal E-Discovery • Important, growing problem • Current solutions not fully understood by people using them • Imperative to find better solutions that scale • Evaluation required • How do we know we are doing better? • Can we prove a level of performance? Chris Buckley – ICAIL 07

  3. Lack of Shared Context • The basic problem of both search and e-discovery • Searcher does not necessarily know beforehand “vocabulary” or background of either author or intended audience of documents to be searched Chris Buckley – ICAIL 07

  4. Relevance Feedback • Human judges some documents as relevant, system finds others based on judgements • Only general technique to improve system knowledge of context proven successful • works from small collections of 1970’s to large collections of present (TREC HARD track) • Difficult to apply to discovery • Need to change entire discovery process Chris Buckley – ICAIL 07

  5. Toolbox of other techniques • Many other aids to search • Ontologies, linguistic analysis, semantic analysis, data mining, term relationships • Good techniques for IR uniformly: • Give big wins for some searches • Give mild losses for others • Need a set of techniques, a toolbox • In practice for IR research, issue not finding big wins, but avoiding the losses Chris Buckley – ICAIL 07

  6. Implications of toolbox • No expected silver bullet AI solution • Boolean search will not expand to accommodate combinations of solutions • Test collections are critical Chris Buckley – ICAIL 07

  7. Test Collection Importance • Needed to develop tools • Needed to develop decision procedures of when to use tools • Toolbox requirement means needed to distinguish a good overall system from one with a good tool • All systems are able to show searches on which individual tools work well • Good system shows performance gain on entire set of searches. Chris Buckley – ICAIL 07

  8. Test Collection Composition • Large set of realistic documents • Set (at least 30) of topics or information needs • Set of judgements: what documents are responsive (or non-responsive) to each topic • Judgements are expensive and limit how test collection results can be interpreted Chris Buckley – ICAIL 07

  9. Incomplete Judgements • Judgements are too time consuming and expensive to be complete (judge every one) • Pool retrieved documents from a variety of systems • Feasible, but: • Known incomplete • We can’t even accurately estimate how incomplete Chris Buckley – ICAIL 07

  10. Inexact Judgements • Humans differ substantially on judgements • Standard TREC collections: • Topics include 1-3 paragraphs describing what makes a document relevant • Given same pool of documents, 2 humans overlap on 70% of their relevant sets • 76% agreement on small TREC legal test Chris Buckley – ICAIL 07

  11. Implications of Judgements • No gold standard of perfect performance is even possible • Any system claiming better than 70% precision at 70% recall is working on a problem other than general search • Almost impossible to get useful absolute measures of performance Chris Buckley – ICAIL 07

  12. Comparative Evaluation • Comparisons between systems on moderate size collections (several GBytes) are solid. • Comparative results on larger collections (500 GBytes) are showing strains • Believable but larger error margin • Active area of research • Overall goal for e-discovery has to be comparative evaluation Chris Buckley – ICAIL 07

  13. Sabir TREC Legal Results • Submitted 7 runs • Very basic approach (1995 technology) • 3 tools from my toolbox • 3 query variations • One of the top systems • All results basically the same • tools did not help on average Chris Buckley – ICAIL 07

More Related