1 / 31

CLEF 2009, Corfu Question Answering Track Overview

CLEF 2009, Corfu Question Answering Track Overview. A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova. D. Santos L.M. Cabral. J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi. QA Tasks & Time.

akamu
Download Presentation

CLEF 2009, Corfu Question Answering Track Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLEF 2009, CorfuQuestion Answering TrackOverview A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova D. Santos L.M. Cabral J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi

  2. QA Tasks & Time

  3. 2009 campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

  4. QA 2009 campaign

  5. ResPubliQA 2009:QA on European Legislation Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova

  6. Evolution of the task

  7. Objectives • Move towards a domain of potential users • Compare systems working in different languages • Compare QA Tech. with pure IR • Introduce more types of questions • Introduce Answer Validation Tech.

  8. Collection • Subset of JRC-Acquis (10,700 docs x lang) • Parallel at document level • EU treaties, EU legislation, agreements and resolutions • Economy, health, law, food, … • Between 1950 and 2006 • XML-TEI.2 encoding • Unfortunately, non parallel at the paragraph level -> extra work

  9. 500 questions • REASON • Why did a commission expert conduct an inspection visit to Uruguay? • PURPOSE/OBJECTIVE • What is the overall objective of the eco-label? • PROCEDURE • How are stable conditions in the natural rubber trade achieved? • In general, any question that can be answered in a paragraph

  10. 500 questions • Also • FACTOID • In how many languages is the Official Journal of the Community published? • DEFINITION • What is meant by “whole milk”? • No NIL questions

  11. Translation of questions

  12. Selection of the final pool of 500 questions out of the 600 produced

  13. Systems response No Answer ≠ Wrong Answer • Decide if the answer is given or not • [ YES | NO ] • Classification Problem • Machine Learning, Provers, etc. • Textual Entailment • Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer

  14. Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered • NoA R: NoA, but the candidate answer was correct • NoA W: NoA, and the candidate answer was incorrect • Noa Empty: NoA and no candidate answer was given Evaluation measure: c@1 • Extension of the traditional accuracy (as proportion of questions correctly answered) • Considering unanswered questions

  15. Evaluation measure n: Number of questions nR: Number of correctly answered questions nU: Number of unanswered questions

  16. Accuracy Accuracy Evaluation measure If nU = 0 then c@1=nR/n  Accuracy If nR = 0 then c@1=0 If nU = n then c@1=0 • Leave a question unanswered gives value only if this avoids to return a wrong answer • The added value is the performance shown with the answered questions: Accuracy

  17. List of Participants

  18. Value of reducing wrong answers

  19. Detecting wrong answers Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system

  20. Many systems under the IR baselines IR important, not enough Feasible Task Perfect combination is 50% better than best system

  21. Comparison across languages • Same questions • Same documents • Same baseline systems • Strict comparison only affected by the variable of language • But it is feasible to detect the most promising approaches across languages

  22. Comparison across languages Systems above the baselines Icia, Boolean + intensive NLP + ML-based validation & very good knowledge of the collection (Eurovoc terms…) Baseline, Okapi-BM25 tuned for paragraph retrieval

  23. Comparison across languages Systems above the baselines nlel092, ngram-based retrieval, combining evidence from several languages Baseline, Okapi-BM25 tuned for paragraph retrieval

  24. Comparison across languages Systems above the baselines Uned, Okapi-BM25 + NER + paragraph validation + ngram based re-ranking Baseline, Okapi-BM25 tuned for paragraph retrieval

  25. Comparison across languages Systems above the baselines nlel091, ngram-based paragraph retrieval Baseline, Okapi-BM25 tuned for paragraph retrieval

  26. Comparison across languages Systems above the baselines Loga, Lucene + deep NLP + Logic + ML-based validation Baseline, Okapi-BM25 tuned for paragraph retrieval

  27. Conclusion • Compare systems working in different languages • Compare QA Tech. with pure IR • Pay more attention to paragraph retrieval • Old issue, late 90’s state of the art (English) • Pure IR performance: 0.38 - 0.58 • Highest difference respect IR baselines: 0.44 – 0.68 • Intensive NLP • ML-based answer validation • Introduce more types of questions • Some types difficult to distinguish • Any question that can be answered in a paragraph • Analysis of results by question types (in progress)

  28. Conclusion • Introduce Answer Validation Tech. • Evaluation measure: c@1 • Value of reducing wrong answers • Detecting wrong answers is feasible • Feasible task • 90% of questions have been answered • Room for improvement: Best systems around 60% • Even with less participants we have • More comparison • More analysis • More learning • ResPubliQA proposal for 2010 • SC and breakout session

  29. Interest on ResPubliQA 2010 But we need more You have already a Gold Standard of 500 questions & answers to play with…

More Related