Evaluation State-of the-art and future actions

EvaluationState-of the-art and future actions Bente Maegaard CST, University of Copenhagen bente@cst.dk

Evaluation at LREC • More than 150 papers were submitted to the Evaluation track, both Written and Spoken • This is a significant rise compared to previous years • Evaluation as a field is attracting increasing interest. • Many papers discuss evaluation methodology, the field is still under development, and the answers to some of the methodological questions are still not known. • An example: MT • Automatic evaluation • Evaluation in Context (task-based, function-based) Bente Maegaard, LREC 2006

Evaluation Written • Parsing evaluation 6 • Semantics, sense 6 • Evaluation methodologies 7 • Time annotation 9 • MT 13 • Annotation, alignment, morph. 15 • Lexica, tools 21 • QA, IR, IE, summarisation, authoring 25 • Total 102 • Note: These figures may contain papers that were originally in other tracks. Bente Maegaard, LREC 2006

Discussion MT evaluation • MT evaluation since 1965 • Van Slype: Adequacy, fluency, fidelity, • Human evaluation, expensive, time-consuming, problems with counting of errors, objective? • Formalising human evaluation, adding e.g. grammaticality • Another measure: Cost of post-editing, objective • Automatic evaluation: Papineni et al. 2001: BLEU, with various modifications. Expensive to establish the reference translations, after that cheap and fast. • However, research shows that this automatic method does not correlate well with human evaluation, also does not correlate with the cost of post-editing etc. • Automatic statistical evaluation can probably be used for evaluation of MT for gisting, but it cannot be used for MT for publishing Bente Maegaard, LREC 2006

Digression: Metrics that do not work • Why is it so difficult to evaluate MT? • Because there is more than one correct answer. • And because answers may be more or less correct. • Measures like WER are not relevant for MT. • Methods relying on a specific number of words in the translation are not OK (if the translation does not have the same number of words as the reference) Bente Maegaard, LREC 2006

Generic Contextual Quality Model (GCQM) • Popescu-Belis et al. LREC2006 • Building on the same thinking as the FEMTI taxonomy • One can only evaluate a system in the context in which it will be used. • Quality workshop 27/5: task-based, function-based evaluation. (Huang, Take) • Karen Sparck-Jones: ‘the set-up’ • So, the understanding that a system can only be reasonably evaluated wrt. a specific task, is accepted • Domain-specific vs. general purpose MT Bente Maegaard, LREC 2006

What do we need? When? • What? • In the field of MT evaluation we need more experiments in order to establish a methodology. • The French CESTA (Hamon et al, LREC2006) is a good example. • So, we need international cooperation for the infrastructure, but in the first instance this cooperation should lead to reliable metrics for MT evaluation. Later on it may be used for actually measuring MT systems’ performance. • (Of course not only MT!) • When? • As soon as possible. • Start with methodology, for each application • Move on to doing evaluation • Goal: in 2011 we can reliably evaluate MT - and other applications! Bente Maegaard, LREC 2006

Evaluation State-of the-art and future actions

Evaluation State-of the-art and future actions

Presentation Transcript

Translation Evaluation a State of the Art Survey

State of the Art

The state of the art

State of the Art and Future Trends in Radionavigation

The State of the Art

State of the Art

PPP s in education: state of the art of research and future directions ( a realist evaluation perspective)

Actions for the Future

Semantic Parsing: The Task, the State of the Art and the Future

Future Actions of ERFP

The State of the Art

IS action research: State of the art and future directions

RTD ISs in Serbia: State-of-the-art and future prospects

Monitoring and evaluation of adaptation actions

STATE OF THE ART

ESRD: State of the Art and Charting the Challenges for the Future

The State of the Art

ESRD: State of the Art and Charting the Challenges for the Future

State of the Art

State of the Art and Future Trends in Geoinformatics

The State of the Art

State of the Art and Future Trends in Geoinformatics