1 / 15

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation. Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie Mellon University. METEOR.

kaiser
Download Presentation

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. METEOR-Ranking & M-BLEU:Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie Mellon University

  2. METEOR and M-BLEU METEOR • Originally developed in 2005 as an automatic metric designed for higher correlation with human judgments at the sentence level • Main ingredients: • Extended Matching between translation and reference • Unigram Precision, Recall  parameterized F-measure • Reordering Penalty • Parameters can be tuned to optimize correlation with human judgments • Our previous work established improved correlation with human judgments (compared with BLEU and other metrics) • Not biased against “non-statistical” MT systems • Only metric that correctly ranked NIST MT Eval-06 Arabic systems • Was used as primary metric in DARPA TRANSTAC and ET-07 Evaluations, one of several metrics in NIST MT Eval, IWSLT, WMT • Main innovation in latest version (used for WMT-08): • Ranking-based Parameter optimization

  3. METEOR and M-BLEU METEOR – Flexible Word Matching • Words in reference translation and MT hypothesis are matched using a series of modules with increasingly loose criteria of matching • Exact match • Porter Stemmer • Word Net based Synonymy • A word-to-word alignment for the sentence pair is computed using these word level matchings • NP-hard in general, uses fast approximate search

  4. METEOR and M-BLEU The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister Alignment Example

  5. METEOR and M-BLEU The SriLankaprimeminister criticizes the leader of the country President of SriLanka criticized by the country's primeminister Alignment Example

  6. METEOR and M-BLEU The SriLankaprimeministercriticizes the leader of the country President of SriLankacriticized by the country'sprimeminister Alignment Example

  7. METEOR and M-BLEU The SriLankaprimeministercriticizes the leader of the country President of SriLankacriticized by the country'sprimeminister Alignment Example

  8. METEOR and M-BLEU METEOR : Score Computation • Weighted combination of the unigram precision and recall • A fragmentation penalty to address fluency • A “chunk” is a monotonic sequence of aligned words • Final Score

  9. METEOR and M-BLEU METEOR Parameter Tuning • The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements. • Since the ranges of the parameters are bounded, we perform an exhaustive search • Current official release of METEOR was tuned to obtain good correlations with adequacy and fluency human judgements • For WMT-08, we re-tuned to optimize correlation with human ranking data released from last year's WMT shared task

  10. METEOR and M-BLEU Computing Ranking Correlation • Convert binary judgements into full rankings • Reject equal judgements • Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements • Topologically sort the graph • For one source sentence with N hypotheses • Average across all source sentences

  11. METEOR and M-BLEU Results • 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation)‏ • Results on the WMT 2008 data( % of correct binary judgments)‏

  12. METEOR and M-BLEU Flexible Matching for BLEU,TER • The flexible matching in METEOR can be used to extend any metric that is based on word overlap between translation and reference(s) • Compute the alignment between reference and hypothesis using the METEOR matcher • Create a “targeted” reference by substituting words in the reference with their matched equivalences from the translation hypothesis. • Compute any metric (BLEU, TER, etc.) with the new targeted references

  13. METEOR and M-BLEU M-BLEU : Results • Average BLEU and M-BLEU scores in WMT 2008 data • No consistent gains seen in correlations at the segment level on WMT 2007 data • Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])‏

  14. METEOR and M-BLEU Discussion and Future Work • METEOR has consistently demonstrated improved levels of correlation with human judgements in multiple evaluations in recent years • Simple and relatively fast to compute • Some minor issues for using it in MERT being resolved • Use it to evaluate your system, even if you tune to BLEU! • Performance of MT metrics varies quite a lot across languages, genres and years • Partially due to no good methodology for evaluation of these metrics • More sophisticated paraphrase detection methods (multi-word correspondences, such as compounds in German) would be useful for languages

  15. Thank You ! Questions ? METEOR and M-BLEU

More Related