an investigation of evaluation metrics for analytic question answering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
An Investigation of Evaluation Metrics for Analytic Question Answering PowerPoint Presentation
Download Presentation
An Investigation of Evaluation Metrics for Analytic Question Answering

Loading in 2 Seconds...

play fullscreen
1 / 37

An Investigation of Evaluation Metrics for Analytic Question Answering - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

An Investigation of Evaluation Metrics for Analytic Question Answering. ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004. Outline. Motivation & Goals Hypothesis-driven development of metrics Design – collection, subjects, scenarios Data Collection & Results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Investigation of Evaluation Metrics for Analytic Question Answering' - kenny


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an investigation of evaluation metrics for analytic question answering

An Investigation of Evaluation Metrics for Analytic Question Answering

ARDA Metrics Challenge: PI Meeting

Emile Morse

October 7, 2004

outline
Outline
  • Motivation & Goals
  • Hypothesis-driven development of metrics
  • Design – collection, subjects, scenarios
  • Data Collection & Results
  • Summary and issues
motivation
Motivation
  • Much progress in IR has been attributed to community evaluations using metrics of precision and recall with common tasks and common data.
  • There is no corresponding set of evaluation criteria for the interaction of users with the systems.
  • While performance is crucial, the utility of the system to the user is critical.
  • The lack of evaluation criteria prevents the comparison of systems based on utility. Acquisition of new systems is therefore based on performance of the system alone – and frequently does NOT reflect how systems will work in the user’s actual process.
goals of workshop
Goals of Workshop
  • To develop metrics for process and products that will reflect the interaction of users and information systems.
  • To develop the metrics based on:
    • Cognitive task analyses of intelligence analysts
    • Previous experience in AQUAINT and NIMD evaluations
    • Expert consultation
  • To deliver an evaluation package consisting of:
    • Process and product metrics
    • An evaluation methodology
    • A data set to use in the evaluation
hypothesis driven development of metrics
Hypothesis-driven development of metrics

Hypotheses – QA systems should …

Candidate metrics – What could we measure that would provide evidence to support or refute this hypothesis?

  • Collection methods:
  • Questionnaires
  • Mood meter
  • System logs
  • Report evaluation method
  • System surveillance tool
  • Cognitive workload instrument

Measures – implementation of metric; depends on specific collection method

examples of candidate metrics
Examples of candidate metrics
  • H1: Support gathering the same type of information with a lower cognitive workload
    • # queries/questions
    • % interactions where analyst takes initiative
    • Number of non-content interactions with system (clarifications)
    • Cognitive workload measurement
  • H7: Enable analysts to collect more data in less time
    • Growth of shoebox over time
    • Subjective assessment
  • H12: QA systems should provide context and continuity for the user – coherence of dialogue!
    • Similarity between queries – calculate shifts in dialog trails
    • Redundancy of documents – count how often a snippet is found more than one time
    • Subjective assessment
top level design of the workshop
Top-level Design of the Workshop
  • Systems
  • Domain
  • Scenarios
  • Collection
  • Subjects
  • On-site team
  • Block design and on-site plan
design systems
Design – Systems

HITIQA

Tomek Strzalkowski

Ferret

Sanda Harabagiu

Stefano Bertolo

GINKO

GNIST

design subjects
Design – Subjects
  • 8 reservists (7 Navy; 1 Army)
  • Age: 30-54 yrs (M=40.8)
  • Educational background:
    • 1 PhD; 4 Masters; 2 Bachelors; 1 HS
  • Military service: 2.5-31 yrs (M=18.3)
  • Analysis Experience: 0-23 yrs (M=10.8)
data collection instruments
Data Collection Instruments
  • Questionnaires
    • Post-scenario (SCE)
    • Post-session (SES)
    • Post-system (SYS)
  • Cross-evaluation of reports
  • Cognitive workload
  • Glass Box
  • System Logs
    • Mood indicator
    • Status reports
    • Debriefing
    • Observer notes
    • Scenario difficulty assessment
questionnaires
Questionnaires
  • Coverage for 14/15 hypotheses
  • Other question types:
    • All SCE questions relate to scenario content
    • 3 SYS questions on Readiness
    • 3 SYS questions on Training
questions for hypothesis 7
Questions for Hypothesis 7

Enable analysts to collect more data in less time

  • SES Q2: In comparison to other systems that you normally use for work tasks, how would you assess the length of time that it took to perform this task using the [X] system? [less … same … more]
  • SES Q13: If you had to perform a task like the one described in the scenario at work, do you think that having access to the [X] system would help increase the speed with which you find information? [not at all … a lot]
  • SYS Q23: Having the system at work would help me find information faster than I can currently find it.
  • SYS Q6: The system slows down my process of finding information.
additional analysis for questionnaire data factor analysis
Additional Analysis for Questionnaire Data – Factor Analysis
  • Four factors emerged
    • Factor 1: most questions
    • Factor 2: time, navigation, training
    • Factor 3: novel information, new way of searching
    • Factor 4: Skill in using the system improved
  • These factors distinguished between the four systems with each system being most distinguished from the others (positively or negatively) on one factor
    • GNIST was related to Factor 2.
    • Positive for navigation and training; negative for time.
cross evaluation criteria
Cross Evaluation Criteria

Subjects rated the reports (including their own) on seven characteristics

  • Covers the important ground
  • Avoids the irrelevant materials
  • Avoids redundant information
  • Includes selective information
  • Is well organized
  • Reads clearly and easily
  • Overall rating
glass box data
Glass Box Data
  • Types of data captured:
  • Keystrokes
  • Mouse moves
  • Session start/stop times
  • Task times
  • Application focus time
  • Copy/paste events
  • Screen capture & audio track
system log data
System log data
  • # queries/questions
  • ‘Good’ queries/questions
  • Total documents delivered
  • # unique documents delivered
  • % unique documents delivered
  • # documents copied from
  • # copies
what next
What Next?
  • Query trails are being worked on by LCC, Rutgers and others; available as part of deliverable.
  • Scenario difficulty has become an independent effort with potential impact on both NIMD and AQUAINT.
  • Thinking about alternative implementation of mood indicator.

AQUAINT sponsors large-scale group evals using metrics and methodology

Each project team employs metrics and methodology

Or something in between

issues to be addressed
Issues to be Addressed
  • What constitutes a replication of the method? the whole thing? a few hypotheses with all data methods? all hypotheses with a few data methods?
  • Costs associated with data collection methods
  • Is a comparison needed?
    • Baseline – if so, is Google the right one? Maybe the ‘best so far’ to keep the bar high.
    • Past results – can measure progress over time, but requires iterative application
  • ‘Currency’ of data and scenarios
    • Analysts are sensitive to staleness
    • What is the effect of updating on repeatability?