1 / 54

Is Relevance Associated with Successful Use of Information Retrieval Systems?

Is Relevance Associated with Successful Use of Information Retrieval Systems?. William Hersh Professor and Head Division of Medical Informatics & Outcomes Research Oregon Health & Science University hersh@ohsu.edu. Goal of talk.

Download Presentation

Is Relevance Associated with Successful Use of Information Retrieval Systems?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Is Relevance Associated with Successful Use of Information Retrieval Systems? William Hersh Professor and Head Division of Medical Informatics & Outcomes Research Oregon Health & Science University hersh@ohsu.edu

  2. Goal of talk • Answer question of association of relevance-based evaluation measures with successful use of information retrieval (IR) systems • By describing two sets of experiments in different subject domains • Since focus of talk is on one question assessed in different studies, I will necessarily provide only partial details of the studies

  3. For more information on these studies… • Hersh W et al., Challenging conventional assumptions of information retrieval with real users: Boolean searching and batch retrieval evaluations, Info. Processing & Management, 2001, 37: 383-402. • Hersh W et al., Further analysis of whether batch and user evaluations give the same results with a question-answering task, Proceedings of TREC-9, Gaithersburg, MD, 2000, 407-416. • Hersh W et al., Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions, Journal of the American Medical Informatics Association, 2002, 9: 283-293.

  4. Outline of talk • Information retrieval system evaluation • Text REtrieval Conference (TREC) • Medical IR • Methods and results of experiments • TREC Interactive Track • Medical searching • Implications

  5. Information retrieval system evaluation

  6. Evaluation of IR systems • Important not only to researchers but also users so we can • Understand how to build better systems • Determine better ways to teach those who use them • Cut through hype of those promoting them • There are a number of classifications of evaluation, each with a different focus

  7. Lancaster and Warner(Information Retrieval Today, 1993) • Effectiveness • e.g., cost, time, quality • Cost-effectiveness • e.g., per relevant citation, new citation, document • Cost-benefit • e.g., per benefit to user

  8. Hersh and Hickam(JAMA, 1998) • Was system used? • What was it used for? • Were users satisfied? • How well was system used? • Why did system not perform well? • Did system have an impact?

  9. Most research has focused on relevance-based measures • Measure quantities of relevant documents retrieved • Most common measures of IR evaluation in published research • Assumptions commonly applied in experimental settings • Documents are relevant or not to user information need • Relevance is fixed across individuals and time

  10. Recall and precision defined • Recall • Precision

  11. Some issues with relevance-based measures • Some IR systems return retrieval sets of vastly different sizes, which can be problematic for “point” measures • Sometimes it is unclear what a “retrieved document” is • Surrogate vs. actual document • Users often perform multiple searches on a topic, with changing needs over time • There are differing definitions of what is a “relevant document”

  12. What is a relevant document? • Relevance is intuitive yet hard to define (Saracevic, various) • Relevance is not necessarily fixed • Changes across people and time • Two broad views • Topical – document is on topic • Situational – document is useful to user in specific situation (aka, psychological relevance, Harter, JASIS, 1992)

  13. Other limitations of recalland precision • Magnitude of a “clinically significant” difference unknown • Serendipity – sometimes we learn from information not relevant to the need at hand • External validity of results – many experiments test using “batch” mode without real users; is not clear that results translate to real searchers

  14. Alternatives to recall and precision • “Task-oriented” approaches that measure how well user performs information task with system • “Outcomes” approaches that determine whether system leads to better outcome or a surrogate for outcome • Qualitative approaches to assessing user’s cognitive state as they interact with system

  15. Text Retrieval Conference (TREC) • Organized by National Institutes for Standards and Technology (NIST) • Annual cycle consisting of • Distribution of test collections and queries to participants • Determination of relevance judgments and results • Annual conference for participants at NIST (each fall) • TREC-1 began in 1992 and has continued annually • Web site: trec.nist.gov

  16. TREC goals • Assess many different approaches to IR with a common large test collection, set of real-world queries, and relevance judgements • Provide forum for academic and industrial researchers to share results and experiences

  17. Organization of TREC • Began with two major tasks • Ad hoc retrieval – standard searching • Discontinued with TREC 2001 • Routing – identify new documents with queries developed for known relevant ones • In some ways, a variant of relevance feedback • Discontinued with TREC-7 • Has evolved to a number of tracks • Interactive, natural language processing, spoken documents, cross-language, filtering, Web, etc.

  18. What has been learned in TREC? • Approaches that improve performance • e.g., passage retrieval, query expansion, 2-poisson weighting • Approaches that may not improve performance • e.g., natural language processing, stop words, stemming • Do these kinds of experiments really matter? • Criticisms of batch-mode evaluation from Swanson, Meadow, Saracevic, Hersh, Blair, etc. • Results that question their findings from Interactive Track, e.g., Hersh, Belkin, Wu & Wilkinson, etc.

  19. The TREC Interactive Track • Developed out of interest in how with real users might search using TREC queries, documents, etc. • TREC 6-8 (1997-1999) used instance recall task • TREC 9 (2000) and subsequent years used question-answering task • Now being folded into Web track

  20. TREC-8 Interactive Track • Task for searcher: retrieve instances of a topic in a query • Performance measured by instance recall • Proportion of all instances retrieved by user • Differs from document recall in that multiple documents on same topic count as one instance • Used • Financial Times collection (1991-1994) • Queries derived from ad hoc collection • Six 20-minute topics for each user • Balanced design: “experimental” vs. “control”

  21. TREC-8 sample topic • Title • Hubble Telescope Achievements • Description • Identify positive accomplishments of the Hubble telescope since it was launched in 1991 • Instances • In the time allotted, please find as many DIFFERENT positive accomplishments of the sort described above as you can

  22. TREC-9 Interactive Track • Same general experimental design with • A new task • Question-answering • A new collection • Newswire from TREC disks 1-5 • New topics • Eight questions

  23. Issues in medical IR • Searching priorities vary by setting • In busy clinical environment, users usually want quick, short answer • Outside clinical environment, users may be willing to explore in more detail • As in other scientific fields, researchers likely to want more exhaustive information • Clinical searching task has many similarities to Interactive Track design, so methods are comparable

  24. Some results of medical IR evaluations (Hersh, 2003) • In large bibliographic databases (e.g., MEDLINE), recall and precision comparable to those seen in other domains (e.g., 50%-50%, minimal overlap across searchers) • Bibliographic databases not amenable to busy clinical setting, i.e., not used often, information retrieved not preferred • Biggest challenges now in digital library realm, i.e., interoperability of disparate resources

  25. Methods and results Research question: Is relevance associated with successful use of information retrieval systems?

  26. TREC Interactive Track and our research question • Do the results of batch IR studies correspond to those obtained with real users? • i.e., Do term weighting approaches which work better in batch studies do better for real users? • Methodology • Identify a prior test collection that measures large batch performance differential over some baseline • Use interactive track to see if this difference is maintained with interactive searching and new collection • Verify that previous batch difference is maintained with new collection

  27. TREC-8 experiments • Determine the best-performing measure • Use instance recall data from previous years as batch test collection with relevance defined as documents containing >1 instance • Perform user experiments • TREC-8 Interactive Track protocol • Verify optimal measure holds • Use TREC-8 instance recall data as batch test collection similar to first experiment

  28. IR system used for our TREC-8 (and 9) experiments • MG • Public domain IR research system • Described in Witten et. al., Managing Gigabytes, 1999 • Experimental version implements all “modern” weighting schemes (e.g., TFIDF, Okapi, pivoted normalization) via Q-expressions, c.f., Zobel and Moffat, SIGIR Forum, 1998 • Simple Web-based front end

  29. Experiment 1 – Determine best “batch” performance Okapi term weighting performs much better than TFIDF.

  30. Experiment 2 – Did benefit occur with interactive task? • Methods • Two user populations • Professional librarians and graduate students • Using a simple natural language interface • MG system with Web front end • With two different term weighting schemes • TFIDF (baseline) vs. Okapi

  31. User interface

  32. Results showed benefit for better batch system (Okapi) +18%, BUT...

  33. All differences were due to one query

  34. Experiment 3 – Did batch results hold with TREC-8 data? Yes, but still with high variance and without statistical significance.

  35. TREC-9 Interactive Track experiments • Similar to approach used in TREC-8 • Determine the best-performing weighting measure • Use all previous TREC data, since no baseline • Perform user experiments • Follow protocol of track • Use MG • Verify optimal measure holds • Use TREC-9 relevance data as batch test collection analogous first experiment

  36. Determine best “batch” performance Okapi+PN term weighting performs better than TFIDF.

  37. Interactive experiments – comparing systems Little difference across systems but note wide differences across questions.

  38. Do batch results hold with new data? Batch results show improved performance whereas user results do not.

  39. Further analysis (Turpin, SIGIR 2001) • Okapi searches definitely retrieve more relevant documents • Okapi+PN user searches have 62% better MAP • Okapi+PN user searches have 101% better Precision@5 documents • But • Users do 26% more cycles with TFIDF • Users get overall same results per experiments

  40. Possible explanations for our TREC Interactive Track results • Batch searching results may not generalize • User data show wide variety of differences (e.g., search terms, documents viewed) which may overwhelm system measures • Or we cannot detect that they do • Increase task, query, or system diversity • Increase statistical power

  41. Medical IR study design • Orientation to experiment and system • Brief training in searching and evidence-based medicine (EBM) • Collect data on factors of users • Subjects given questions and asked to search to find and justify answer • Statistical analysis to find associations among user factors and successful searching

  42. System used – OvidWeb MEDLINE

  43. Experimental design • Recruited • 45 senior medical students • 21 second (last) year NP students • Large-group session • Demographic/experience questionnaire • Orientation to experiment, OvidWeb • Overview of basic MEDLINE and EBM skills

  44. Experimental design (cont.) • Searching sessions • Two hands-on sessions in library • For each of three questions, randomly selected from 20, measured: • Pre-search answer with certainty • Searching and answering with justification and certainty • Logging of system-user interactions • User interface questionnaire (QUIS)

  45. Searching questions • Derived from two sources • Medical Knowledge Self-Assessment Program (Internal Medicine board review) • Clinical questions collection of Paul Gorman • Worded to have answer of either • Yes with good evidence • Indeterminate evidence • No with good evidence • Answers graded by expert clinicians

  46. Assessment of recall and precision • Aimed to perform a “typical” recall and precision study and determine if they were associated with successful searching • Designated “end queries” to have terminal set for analysis • Half of all retrieved MEDLINE records judged by three physicians each as definitely relevant, possibly relevant, or not relevant • Also measured reliability of raters

  47. Overall results • Prior to searching, rate of correctness (32.1%) about equal to chance for both groups • Rating of certainly low for both groups • With searching, medical students increased rate of correctness to 51.6% but NP students remained virtually unchanged at 34.7%

  48. Overall results Medical students were better able to convert incorrect into correct answers, whereas NP students were hurt as often as helped by searching.

  49. Recall and precision Recall and precision were not associated with successful answering of questions and were nearly identical for medical and NP students.

  50. Conclusions from results • Medical students improved ability to answer questions with searching, NP students did not • Spatial visualization ability may explain • Answering questions required >30 minutes whether correct or incorrect • This content not amenable to clinical setting • Recall and precision had no relation to successful searching

More Related