1 / 48

Evaluation of IR systems

Evaluation of IR systems. Lecture plan. Background System-centred evaluation User-centred evaluation. The changing face of evaluation. Originally... Batch IR systems Small, textual collections Queries formulated by searchers Today... Interactive IR systems

earlene
Download Presentation

Evaluation of IR systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of IR systems

  2. Lecture plan • Background • System-centred evaluation • User-centred evaluation

  3. The changing face of evaluation • Originally... • Batch IR systems • Small, textual collections • Queries formulated by searchers • Today... • Interactive IR systems • Large collections of different or mixed media • Queries formulated by end-users

  4. Elements of evaluation • When we evaluate, we need to establish: • Methodology • Criterion • Measure • Tool • Method of data analysis

  5. System-centred evaluation • (Comparative) evaluation of technical performance of IR system(s) • Methodology = non-interactive experiment • Criterion = relevance • Measure = effectiveness • Tool = test collection • Method of data analysis = recall / precision

  6. Relevance • Relevant = “having significant and demonstrable bearing on the matter at hand” • Underlying assumptions: • Objectivity • Topicality • Binary nature • Independence

  7. Effectiveness • Effectiveness = the ability of the IR system to retrieve relevant documents and suppress non-relevant documents

  8. Test collection • Components: • Document collection • Queries / requests • Relevance judgements

  9. Test collection creation • Manual method: • Every document judged against every query by one of several judges • Pooling method: • Queries run against several IR systems first • Results pooled, and top proportion chosen for judging • Only top documents are judged

  10. Recall / precision [1] Retrieved Retrieved and relevant Relevant Document collection

  11. Recall / precision [2] • Recall = proportion of relevant documents that is retrieved, i.e. number of relevant documents retrieved / total number of relevant documents • Precision = proportion of retrieved documents that is relevant, i.e. number of relevant documents retrieved / number of documents retrieved

  12. How to use a test collection • For each system / system version • For each query in the test collection • Run query against system to obtain ranking • Use ranking and relevance judgements to calculate recall/precision (r/p) pairs at each recall point • Interpolate to standard recall points if necessary • Average r/p values across all queries in table / graph form • Produce r/p graph for all systems

  13. Interpolation Interpolated value Observed value

  14. Averaging [1] Precision Recall Query 1 Query 2 Average 0.1 0.8 0.6 0.7 0.2 0.8 0.5 0.65 0.3 0.6 0.4 0.5 0.4 0.6 0.3 0.45 0.5 0.4 0.25 0.325 0.6 0.4 0.2 0.3 0.7 0.3 0.15 0.225 0.8 0.3 0.1 0.2 0.9 0.2 0.05 0.115 1.0 0.2 0.05 0.115

  15. Averaging [2]

  16. Comparison of systems

  17. Examples of test collections [1] • TREC (Text REtrieval Conference) • Started in 1990, run by National Institute of Standards and Technology (NIST) • Components • Huge document collection (several GB), taken from Wall Street Journal, Financial Times, etc • New documents, topics (i.e. requests, including description and narrative fields) and relevance judgements (performed by retired civil servants) each year

  18. Examples of test collections [2] • Participants • Industrial, commercial and academic • Must submit results of retrieval tasks to TREC conference each November • “Tracks” • Ad-hoc + routing (filtering) • Also: interactive, cross-lingual, Web, spoken document, short queries, …

  19. Examples of test collections [3] • CIS • 1239 documents about cystic fibrosis from NLM’s MEDLINE collection • Fields: author, title, source, major and minor subjects, abstracts, references and citations • 100 queries, developed by relevance judges

  20. Examples of test collections [4] • Unusual features: • 4 judges per document per query (3 experts, 1 medical bibliographer) • 3 levels of relevance (0-2) • Combined relevances on scale of 0-8

  21. Examples of test collections [5] • CACM • 3024 articles on computer science from CACM, 1958 - 1979 • Fields: author, date, word stems for titles and abstracts, categories, direct referencing, bibliography coupling, number of co-citations for each pair of articles • 52 queries, each with 2 Boolean formulations

  22. Examples of test collections [6] • Unusual features: • Citation links to other documents, so often used for hypertext-type experiments

  23. User-centred evaluation • Evaluation of interface / interaction • Methodology = interactive experiment, ethnographic study, ... • Many different criteria, measures, tools and methods of data analysis • No standard user-centred methodology • Elements often borrowed from other areas, e.g. HCI, experimental psychology

  24. User-centred issues: layers model

  25. Test collection • Advantages • Cheap and easy for evaluator • Cross-system comparison possible • Limitations • Static requests / queries • Objective, topical relevance judgements made by domain experts • Does not evaluate interaction

  26. Different document types • Multi-media documents • Images • Topical relevance • Non-topical relevance • Speech • Recognition • Retrieval • Structured collections

  27. Interaction [1] • Data characteristics • Size of documents • Size of collection • System characteristics • Retrieval effectiveness • Functionality • Interface features

  28. Interaction [2] • User • Domain expertise • System expertise • Task • Subjects vs real users • Contextual • Social and environmental factors

  29. Strategy • System characteristics • Type of access (query-based, browsing, mixed) • Functional visibility • Search characteristics • Topic focus • Tactics and search strategy • User characteristics • Mental/cognitive models

  30. Tasks • Real • Simulated • Past real • Fictitious

  31. Learning • System • Dynamic weighting of terms/documents • Case-based retrieval • User modelling • User • Evolving information needs • Learning about domain/collection/system • Sociological view

  32. Measures [1] • From IR • Evaluation of results • Aspectual recall/precision • Pertinence • Utility

  33. Measures [2] • From information science/HCI • Evaluation of results • Task performance • Evaluation of process • Quantitative: time, number of errors • Qualitative: usability • Evaluation of overall quality of experience • User satisfaction

  34. Tools [1] • From information science/HCI • Before the session • Cognitive walkthroughs • Interviews/questionnaires • During the session • Observation • Think aloud protocols

  35. Tools [2] • After the session • Interviews/questionnaires • Focus groups

  36. Large-scale experiments • Interactive TREC • OKAPI

  37. User-centred evaluation [1] • What is to be evaluated? • e.g. IR system using new underlying model • Why do we want to evaluate? • e.g. functionality, usability • How will we evaluate? • e.g. effectiveness, efficiency, satisfaction

  38. User-centred evaluation [2] • Example evaluation measures: Functionality Usability Effectiveness recall/precision quality of solution Efficiency retrieval time task completion time Satisfaction preference confidence

  39. Experimental design process • Formulate research hypothesis • Formulate experimental hypotheses • Design experiment(s) • Conduct pilot test and experiment(s) • Analyse data • Evaluate experimental hypotheses

  40. Simple experimental design [1] • Controlled experiment in laboratory setting • One group of participants • Each participant performs one or more tasks • Pre-defined tasks vs “real” tasks

  41. Simple experimental design [2] • Example data gathered at task stages: • Stage 1: Formulate information need • Stage 2: Gather information • Task completion time • Information-seeking behaviour • Use of observation, recording, think-aloud protocols

  42. Simple experimental design [3] • Example data (continued): • Stage 3: Use information • Confidence • Use of questionnaires, interviews using Likert scales / semantic differentials • Stage 4: Assess information • Quality of solution • Independent assessment of task output

  43. Simple experimental design [4] • Analysis: • Mostly qualitative, with summary statistics • Common-sense interpretation of results • Use of pre-defined benchmarks

  44. Complex experimental design [1] • Other controlled experiments: • Within-subject, e.g. longitudinal study • Between-subject • Comparative study looking at effect of: • System type, e.g. variations in algorithm used • Task type • User characteristics, e.g. domain knowledge, general computer literacy, system knowledge • Comparison with control group

  45. Complex experimental design [2] • Other controlled experiments (continued): • Mixed within-subject / between-subject • Examine effect of interaction of variables • Analysis: • Quantitative: • Summary statistics • Significance testing • Qualitative

  46. Complex experimental design [3] • Operational / ethno-methodological experiments • Evaluation in a “semi-real” or “real” setting of the “acceptability” of the system • Analysis • Mostly qualitative

  47. Complex experimental design [4] • Case studies • Detailed evaluation using a single or small number of participant(s) • Possible to examine cognitive and affective issues • Analysis • Mostly qualitative

  48. Summary • System-centred evaluation • Uses test collection methodology, with recall and precision • Good for evaluating technical performance • User-centred evaluation • No standard methodology • Good for evaluating interface / interaction • Usually necessary to use a combination

More Related