1 / 37

An Investigation of Evaluation Metrics for Analytic Question Answering

An Investigation of Evaluation Metrics for Analytic Question Answering. ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004. Outline. Motivation & Goals Hypothesis-driven development of metrics Design – collection, subjects, scenarios Data Collection & Results

kenny
Download Presentation

An Investigation of Evaluation Metrics for Analytic Question Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004

  2. Outline • Motivation & Goals • Hypothesis-driven development of metrics • Design – collection, subjects, scenarios • Data Collection & Results • Summary and issues

  3. Motivation • Much progress in IR has been attributed to community evaluations using metrics of precision and recall with common tasks and common data. • There is no corresponding set of evaluation criteria for the interaction of users with the systems. • While performance is crucial, the utility of the system to the user is critical. • The lack of evaluation criteria prevents the comparison of systems based on utility. Acquisition of new systems is therefore based on performance of the system alone – and frequently does NOT reflect how systems will work in the user’s actual process.

  4. Goals of Workshop • To develop metrics for process and products that will reflect the interaction of users and information systems. • To develop the metrics based on: • Cognitive task analyses of intelligence analysts • Previous experience in AQUAINT and NIMD evaluations • Expert consultation • To deliver an evaluation package consisting of: • Process and product metrics • An evaluation methodology • A data set to use in the evaluation

  5. Hypothesis-driven development of metrics Hypotheses – QA systems should … Candidate metrics – What could we measure that would provide evidence to support or refute this hypothesis? • Collection methods: • Questionnaires • Mood meter • System logs • Report evaluation method • System surveillance tool • Cognitive workload instrument • … Measures – implementation of metric; depends on specific collection method

  6. Hypotheses

  7. Examples of candidate metrics • H1: Support gathering the same type of information with a lower cognitive workload • # queries/questions • % interactions where analyst takes initiative • Number of non-content interactions with system (clarifications) • Cognitive workload measurement • H7: Enable analysts to collect more data in less time • Growth of shoebox over time • Subjective assessment • H12: QA systems should provide context and continuity for the user – coherence of dialogue! • Similarity between queries – calculate shifts in dialog trails • Redundancy of documents – count how often a snippet is found more than one time • Subjective assessment

  8. Top-level Design of the Workshop • Systems • Domain • Scenarios • Collection • Subjects • On-site team • Block design and on-site plan

  9. Design – Systems HITIQA Tomek Strzalkowski Ferret Sanda Harabagiu Stefano Bertolo GINKO GNIST

  10. Design – Scenarios

  11. Design – Document Collection

  12. Design – Subjects • 8 reservists (7 Navy; 1 Army) • Age: 30-54 yrs (M=40.8) • Educational background: • 1 PhD; 4 Masters; 2 Bachelors; 1 HS • Military service: 2.5-31 yrs (M=18.3) • Analysis Experience: 0-23 yrs (M=10.8)

  13. Design – On-Site Team

  14. 2-day Block Schedule

  15. Data Collection Instruments • Questionnaires • Post-scenario (SCE) • Post-session (SES) • Post-system (SYS) • Cross-evaluation of reports • Cognitive workload • Glass Box • System Logs • Mood indicator • Status reports • Debriefing • Observer notes • Scenario difficulty assessment

  16. Questionnaires • Coverage for 14/15 hypotheses • Other question types: • All SCE questions relate to scenario content • 3 SYS questions on Readiness • 3 SYS questions on Training

  17. Questions for Hypothesis 7 Enable analysts to collect more data in less time • SES Q2: In comparison to other systems that you normally use for work tasks, how would you assess the length of time that it took to perform this task using the [X] system? [less … same … more] • SES Q13: If you had to perform a task like the one described in the scenario at work, do you think that having access to the [X] system would help increase the speed with which you find information? [not at all … a lot] • SYS Q23: Having the system at work would help me find information faster than I can currently find it. • SYS Q6: The system slows down my process of finding information.

  18. Additional Analysis for Questionnaire Data – Factor Analysis • Four factors emerged • Factor 1: most questions • Factor 2: time, navigation, training • Factor 3: novel information, new way of searching • Factor 4: Skill in using the system improved • These factors distinguished between the four systems with each system being most distinguished from the others (positively or negatively) on one factor • GNIST was related to Factor 2. • Positive for navigation and training; negative for time.

  19. Cross Evaluation Criteria Subjects rated the reports (including their own) on seven characteristics • Covers the important ground • Avoids the irrelevant materials • Avoids redundant information • Includes selective information • Is well organized • Reads clearly and easily • Overall rating

  20. Cross-evaluation Results

  21. NASA TLX -- Cognitive Load

  22. NASA TLX -- Cognitive Load

  23. Glass Box Data • Types of data captured: • Keystrokes • Mouse moves • Session start/stop times • Task times • Application focus time • Copy/paste events • Screen capture & audio track

  24. Glass Box Data Allocation of session time

  25. System log data • # queries/questions • ‘Good’ queries/questions • Total documents delivered • # unique documents delivered • % unique documents delivered • # documents copied from • # copies

  26. # Questions vs # ‘Good’ Questions

  27. $ $$ $ $$$ $$$

  28. What Next? • Query trails are being worked on by LCC, Rutgers and others; available as part of deliverable. • Scenario difficulty has become an independent effort with potential impact on both NIMD and AQUAINT. • Thinking about alternative implementation of mood indicator. AQUAINT sponsors large-scale group evals using metrics and methodology Each project team employs metrics and methodology Or something in between

  29. Issues to be Addressed • What constitutes a replication of the method? the whole thing? a few hypotheses with all data methods? all hypotheses with a few data methods? • Costs associated with data collection methods • Is a comparison needed? • Baseline – if so, is Google the right one? Maybe the ‘best so far’ to keep the bar high. • Past results – can measure progress over time, but requires iterative application • ‘Currency’ of data and scenarios • Analysts are sensitive to staleness • What is the effect of updating on repeatability?

  30. Backups

  31. Report Cross-Evaluation Results

  32. $ $$ $ $$$ $$$ Summary of Findings

More Related