METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation

METEOR-Ranking & M-BLEU:Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie Mellon University

METEOR and M-BLEU METEOR • Originally developed in 2005 as an automatic metric designed for higher correlation with human judgments at the sentence level • Main ingredients: • Extended Matching between translation and reference • Unigram Precision, Recall  parameterized F-measure • Reordering Penalty • Parameters can be tuned to optimize correlation with human judgments • Our previous work established improved correlation with human judgments (compared with BLEU and other metrics) • Not biased against “non-statistical” MT systems • Only metric that correctly ranked NIST MT Eval-06 Arabic systems • Was used as primary metric in DARPA TRANSTAC and ET-07 Evaluations, one of several metrics in NIST MT Eval, IWSLT, WMT • Main innovation in latest version (used for WMT-08): • Ranking-based Parameter optimization

METEOR and M-BLEU METEOR – Flexible Word Matching • Words in reference translation and MT hypothesis are matched using a series of modules with increasingly loose criteria of matching • Exact match • Porter Stemmer • Word Net based Synonymy • A word-to-word alignment for the sentence pair is computed using these word level matchings • NP-hard in general, uses fast approximate search

METEOR and M-BLEU The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister Alignment Example

METEOR and M-BLEU The SriLankaprimeminister criticizes the leader of the country President of SriLanka criticized by the country's primeminister Alignment Example

METEOR and M-BLEU The SriLankaprimeministercriticizes the leader of the country President of SriLankacriticized by the country'sprimeminister Alignment Example

METEOR and M-BLEU METEOR : Score Computation • Weighted combination of the unigram precision and recall • A fragmentation penalty to address fluency • A “chunk” is a monotonic sequence of aligned words • Final Score

METEOR and M-BLEU METEOR Parameter Tuning • The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements. • Since the ranges of the parameters are bounded, we perform an exhaustive search • Current official release of METEOR was tuned to obtain good correlations with adequacy and fluency human judgements • For WMT-08, we re-tuned to optimize correlation with human ranking data released from last year's WMT shared task

METEOR and M-BLEU Computing Ranking Correlation • Convert binary judgements into full rankings • Reject equal judgements • Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements • Topologically sort the graph • For one source sentence with N hypotheses • Average across all source sentences

METEOR and M-BLEU Results • 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation)‏ • Results on the WMT 2008 data( % of correct binary judgments)‏

METEOR and M-BLEU Flexible Matching for BLEU,TER • The flexible matching in METEOR can be used to extend any metric that is based on word overlap between translation and reference(s) • Compute the alignment between reference and hypothesis using the METEOR matcher • Create a “targeted” reference by substituting words in the reference with their matched equivalences from the translation hypothesis. • Compute any metric (BLEU, TER, etc.) with the new targeted references

METEOR and M-BLEU M-BLEU : Results • Average BLEU and M-BLEU scores in WMT 2008 data • No consistent gains seen in correlations at the segment level on WMT 2007 data • Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])‏

METEOR and M-BLEU Discussion and Future Work • METEOR has consistently demonstrated improved levels of correlation with human judgements in multiple evaluations in recent years • Simple and relatively fast to compute • Some minor issues for using it in MERT being resolved • Use it to evaluate your system, even if you tune to BLEU! • Performance of MT metrics varies quite a lot across languages, genres and years • Partially due to no good methodology for evaluation of these metrics • More sophisticated paraphrase detection methods (multi-word correspondences, such as compounds in German) would be useful for languages

Thank You ! Questions ? METEOR and M-BLEU

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation