NLG Shared Tasks: Lets try it and see what happens

NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter

Contents • General Comments • Geneval proposal

Good points of Shared Task • Compare different approaches • Encourage people to interact more • Reduce NLG “barriers to entry” • Better understanding of evaluation

Bad Points • May narrow focus of community • IR ignored web search because of TREC? • May encourage incremental research instead of new ideas

My opinion • Lets give it a try • But I suspect one-off exercises are better than a series • Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time

Practical Issues • Domain/task? • Need something which several (6?) group are interested in • Evaluation technique • Avoid techniques that are biased • Eg, some automatic metrics may favour stat systems

Geneval • Proposal to evaluate NLG evaluation • Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate • Anja Belz and Ehud Reiter • Hope to submit to EPSRC (roughly similar to NSF in US) soon

NLG Evaluation • Many types • Task-based, human ratings, BLEU-like metrics, etc • Little consensus on best technique • Ie, most appropriate for a context • Poorly understood

Some open questions • How well do diff types correlate? • Eg, does BLEU predict human ratings? • Are there biases? • Eg, are statistical NLG systems over/under rated by some techniques? • What is best design? • Number of subjects, subject expertise, number (quality) of reference texts, etc

Belz and Reiter (2006) • Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics • Found OK (not wonderful) correlation, but also some biases • Geneval: do this on a much larger scale • More domains, more systems, more evaluation techniques (including new ones), etc

Geneval: Possible Domains • Weather forecasts (not wind statements) • Use SumTime corpus • Referring expressions • Use Prodigy-Grec or Tuna corpus • Medical summaries • Use Babytalk corpus • Statistical summaries • Use Atlas corpus

Geneval: Evaluation techniques • Human task-based • Eg, referential success • Human ratings • Likert vs pref; expert vs non-expert • Automatic metrics based on ref texts • BLEU, ROUGE, METEOR, etc • Automatic metrics without ref texts • MT T and X scores, length

Geneval: new techniques • Would also like to explore and develop new evaluation techniques • Post-edit based human evaluations? • Automatic metrics which look at semantic features? • Open to suggestions for other ideas!

Would like systems contributed • Study would be better if other people would contribute systems • We supply data sets and corpora, and carry out evaluations • So you can focus 100% on your great new algorithmic ideas!

Geneval from STEC perspect • Sort of like STEC??? • If people contribute systems based on our data sets and corpora • But results will be anonymised • only developer of system X knows how well X did • One-off exercises, not repeated • Multiple evaluation techniques • Hope data sets will reduce barriers to entry

Geneval • Please let Anja or I know if • You have general comments, and/or • You have a suggestion for an additional evaluation technique • You might be interested in contributing a system

NLG Shared Tasks: Lets try it and see what happens

NLG Shared Tasks: Lets try it and see what happens

Presentation Transcript

Lets face it

What Happens to Fish as it Freezes?

What happens

Try It:

WHAT HAPPENS?

What happens? ↑

What Happens

What is NLG?

Lets Try one

NLG Shared Tasks: Lets try it and see what happens

What is NLG?

Hey lets see our utensils

Try It!

“Come and Try It!”

see, touch and try

Ok… Lets Try Some!

What Happens!

Why Priapism Happens and What to Do About It

Lets see the Infertility Treatment

Promote Your Business In This Simple Way And See What Happens

When it Happens

What Happens if You Don't See Your Dentist