1 / 30

Evaluating Answer Validation in multi-stream Question Answering

The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008. Evaluating Answer Validation in multi-stream Question Answering. Álvaro Rodrigo, Anselmo Peñas , Felisa Verdejo UNED NLP & IR group nlp.uned.es. Content. Context and motivation

rusti
Download Presentation

Evaluating Answer Validation in multi-stream Question Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008 Evaluating Answer Validation in multi-stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es

  2. Content • Context and motivation • Question Answering at CLEF • Answer Validation Exercise at CLEF • Evaluating the validation of answers • Evaluating the selection of answers • Correct selection • Correct rejection • Analysis and discussion • Conclusion

  3. Evolution of the CLEF-QA Track

  4. Evolution of Results 2003 - 2006 (Spanish) Overall Best result <60% Definitions Best result >80% NOT IR approach

  5. 0.8 x 0.8 x 1.0 = 0.64 Pipeline Upper Bounds Use Answer Validation to break the pipeline Question Question analysis Answer Passage Retrieval Answer Extraction Answer Ranking Not enough evidence

  6. Results in CLEF-QA 2006 (Spanish) Best with ORGANIZATION Perfect combination 81% Best with PERSON Best with TIME Best system 52,5%

  7. Evaluation Framwork Question QA sys1 Answer Validation & Selection QA sys2 Answer QA sys3 Candidate answers QA sysn Collaborative architectures Diferent systems response better different types of questions • Specialisation • Collaboration

  8. Collaborative architectures How to select the good answer? • Redundancy • Voting • Confidence score • Performance history Why not deeper analysis?

  9. Answer Validation Exercise (AVE) Objective Validate the correctness of the answers Given by realQA systems... ...the participants at CLEF QA

  10. Answer Validation Question Automatic Hypothesis Generation Hypothesis Textual Entailment Question Candidate answer Question Answering Answer is correct Supporting Text Answer is not correct or not enough evidence AVE 2006 AVE 2007 - 2008 Answer Validation Exercise (AVE)

  11. Techniques in AVE 2007 Overview AVE 2007

  12. Questions Question Answering Track Answer Validation Exercise Systems’ answers Systems’ Validation (YES, NO) Systems’ Supporting Texts Human Judgements (R,W,X,U) Mapping (YES, NO) Evaluation QA Track results AVE Track results Evaluation linked to main QA task Reuse human assessments

  13. Content • Context and motivation • Evaluating the validation of answers • Evaluating the selection of answers • Analysis and discussion • Conclusion

  14. Question QA sys1 Answer Validation & Selection QA sys2 Answer QA sys3 Candidate answers QA sysn Participant systems in a CLEF – QA Evaluation of Answer Validation & Selection Evaluation Proposed

  15. Collections <q id="116" lang="EN"> <q_str> What is Zanussi? </q_str> <a id="116_1" value=""> <a_str> was an Italian producer of home appliances </a_str> <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> </a> <a id="116_2" value=""> <a_str> who had also been in Cassibile since August 31 </a_str> <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> </a> <a id="116_4" value=""> <a_str> 3 </a_str> <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> </a> </q>

  16. Evaluating the Validation Validation Decide if each candidate answer is correct or not • YES | NO • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over correct answers • Baseline system: Accept all answers

  17. Evaluating the Validation

  18. Evaluating the Selection • Quantify the potential gain of Answer Validation in Question Answering • Compare AV systems with QA systems • Develop measures more comparable to QA accuracy

  19. Evaluating the selection Given a question with several candidate answers Two options: • Selection • Select an answer ≡ try to answer the question • Correct selection: answer was correct • Incorrect selection: answer was incorrect • Rejection • Reject all candidate answers ≡ leave question unanswered • Correct rejection: All candidate answers were incorrect • Incorrect rejection: Not all candidate answers were incorrect

  20. Evaluating the Selection Not comparable to qa_accuracy

  21. Evaluating the Selection

  22. Evaluating the Selection Rewards rejection (not balanced cols) Interpretation for QA: all questions correctly rejected by AV will be answered correctly

  23. Interpretation for QA: questions correctly rejected by AV will be answered correctly in qa_accuracy proportion Evaluating the Selection

  24. Content • Context and motivation • Evaluating the validation of answers • Evaluating the selection of answers • Analysis and discussion • Conclusion

  25. Analysis and discussion(AVE 2007 English) Validation QA_acc correlated to R “Estimated” adjusts it Selection

  26. Multi-stream QA performance (AVE 2007 English)

  27. Analysis and discussion (AVE 2007 Spanish) Validation Comparing AV & QA Selection

  28. Conclusion • Evaluation framework for Answer Validation & Selection systems • Measures that reward not only Correct Selection but also Correct Rejection • Promote improvement of QA systems • Allow comparison between AV and QA systems • In what conditions multi-stream perform better • Room for improvement just using multi-stream-QA • Potential gain that AV systems can provide to QA

  29. Thanks! http://nlp.uned.es/clef-qa/ave http://www.clef-campaign.org Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)

  30. The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008 Evaluating Answer Validation in multi-stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es

More Related