1 / 7

Evaluation State-of the-art and future actions

Evaluation State-of the-art and future actions. Bente Maegaard CST, University of Copenhagen bente@cst.dk. Evaluation at LREC. More than 150 papers were submitted to the Evaluation track, both Written and Spoken This is a significant rise compared to previous years

ramya
Download Presentation

Evaluation State-of the-art and future actions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EvaluationState-of the-art and future actions Bente Maegaard CST, University of Copenhagen bente@cst.dk

  2. Evaluation at LREC • More than 150 papers were submitted to the Evaluation track, both Written and Spoken • This is a significant rise compared to previous years • Evaluation as a field is attracting increasing interest. • Many papers discuss evaluation methodology, the field is still under development, and the answers to some of the methodological questions are still not known. • An example: MT • Automatic evaluation • Evaluation in Context (task-based, function-based) Bente Maegaard, LREC 2006

  3. Evaluation Written • Parsing evaluation 6 • Semantics, sense 6 • Evaluation methodologies 7 • Time annotation 9 • MT 13 • Annotation, alignment, morph. 15 • Lexica, tools 21 • QA, IR, IE, summarisation, authoring 25 • Total 102 • Note: These figures may contain papers that were originally in other tracks. Bente Maegaard, LREC 2006

  4. Discussion MT evaluation • MT evaluation since 1965 • Van Slype: Adequacy, fluency, fidelity, • Human evaluation, expensive, time-consuming, problems with counting of errors, objective? • Formalising human evaluation, adding e.g. grammaticality • Another measure: Cost of post-editing, objective • Automatic evaluation: Papineni et al. 2001: BLEU, with various modifications. Expensive to establish the reference translations, after that cheap and fast. • However, research shows that this automatic method does not correlate well with human evaluation, also does not correlate with the cost of post-editing etc. • Automatic statistical evaluation can probably be used for evaluation of MT for gisting, but it cannot be used for MT for publishing Bente Maegaard, LREC 2006

  5. Digression: Metrics that do not work • Why is it so difficult to evaluate MT? • Because there is more than one correct answer. • And because answers may be more or less correct. • Measures like WER are not relevant for MT. • Methods relying on a specific number of words in the translation are not OK (if the translation does not have the same number of words as the reference) Bente Maegaard, LREC 2006

  6. Generic Contextual Quality Model (GCQM) • Popescu-Belis et al. LREC2006 • Building on the same thinking as the FEMTI taxonomy • One can only evaluate a system in the context in which it will be used. • Quality workshop 27/5: task-based, function-based evaluation. (Huang, Take) • Karen Sparck-Jones: ‘the set-up’ • So, the understanding that a system can only be reasonably evaluated wrt. a specific task, is accepted • Domain-specific vs. general purpose MT Bente Maegaard, LREC 2006

  7. What do we need? When? • What? • In the field of MT evaluation we need more experiments in order to establish a methodology. • The French CESTA (Hamon et al, LREC2006) is a good example. • So, we need international cooperation for the infrastructure, but in the first instance this cooperation should lead to reliable metrics for MT evaluation. Later on it may be used for actually measuring MT systems’ performance. • (Of course not only MT!) • When? • As soon as possible. • Start with methodology, for each application • Move on to doing evaluation • Goal: in 2011 we can reliably evaluate MT - and other applications! Bente Maegaard, LREC 2006

More Related