1 / 16

NLG Shared Tasks: Lets try it and see what happens

Geneval proposes to evaluate NLG systems with similar input/output functionality using various evaluation techniques, including human task-based assessments and automatic metrics. The goal is to explore correlations between different evaluation techniques and improve the understanding of NLG evaluation. This effort aims to reduce barriers to entry in NLG research and encourage interaction among researchers.

mramsey
Download Presentation

NLG Shared Tasks: Lets try it and see what happens

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter

  2. Contents • General Comments • Geneval proposal

  3. Good points of Shared Task • Compare different approaches • Encourage people to interact more • Reduce NLG “barriers to entry” • Better understanding of evaluation

  4. Bad Points • May narrow focus of community • IR ignored web search because of TREC? • May encourage incremental research instead of new ideas

  5. My opinion • Lets give it a try • But I suspect one-off exercises are better than a series • Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time

  6. Practical Issues • Domain/task? • Need something which several (6?) group are interested in • Evaluation technique • Avoid techniques that are biased • Eg, some automatic metrics may favour stat systems

  7. Geneval • Proposal to evaluate NLG evaluation • Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate • Anja Belz and Ehud Reiter • Hope to submit to EPSRC (roughly similar to NSF in US) soon

  8. NLG Evaluation • Many types • Task-based, human ratings, BLEU-like metrics, etc • Little consensus on best technique • Ie, most appropriate for a context • Poorly understood

  9. Some open questions • How well do diff types correlate? • Eg, does BLEU predict human ratings? • Are there biases? • Eg, are statistical NLG systems over/under rated by some techniques? • What is best design? • Number of subjects, subject expertise, number (quality) of reference texts, etc

  10. Belz and Reiter (2006) • Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics • Found OK (not wonderful) correlation, but also some biases • Geneval: do this on a much larger scale • More domains, more systems, more evaluation techniques (including new ones), etc

  11. Geneval: Possible Domains • Weather forecasts (not wind statements) • Use SumTime corpus • Referring expressions • Use Prodigy-Grec or Tuna corpus • Medical summaries • Use Babytalk corpus • Statistical summaries • Use Atlas corpus

  12. Geneval: Evaluation techniques • Human task-based • Eg, referential success • Human ratings • Likert vs pref; expert vs non-expert • Automatic metrics based on ref texts • BLEU, ROUGE, METEOR, etc • Automatic metrics without ref texts • MT T and X scores, length

  13. Geneval: new techniques • Would also like to explore and develop new evaluation techniques • Post-edit based human evaluations? • Automatic metrics which look at semantic features? • Open to suggestions for other ideas!

  14. Would like systems contributed • Study would be better if other people would contribute systems • We supply data sets and corpora, and carry out evaluations • So you can focus 100% on your great new algorithmic ideas!

  15. Geneval from STEC perspect • Sort of like STEC??? • If people contribute systems based on our data sets and corpora • But results will be anonymised • only developer of system X knows how well X did • One-off exercises, not repeated • Multiple evaluation techniques • Hope data sets will reduce barriers to entry

  16. Geneval • Please let Anja or I know if • You have general comments, and/or • You have a suggestion for an additional evaluation technique • You might be interested in contributing a system

More Related