Meteor &amp; M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

1 / 1

# Meteor &amp; M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output - PowerPoint PPT Presentation

Objective. MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. To be tuned against, a metric should be fast and easy to compute

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Meteor &amp; M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output' - ulla-mckay

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Objective

• MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly.
• To be tuned against, a metric should be fast and easy to compute
• In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute.

Computing Ranking Correlation

• Convert binary judgements into full rankings
• Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements
• Topologically sort the graph
• For one source sentence with N hypothesis

Introduction to METEOR

• Computes a one to one word alignment between reference and hypothesis using a matching module
• Uses unigram precision and unigram recall along with a fragmentation penalty as score components
• In case of multiple references, the best scoring reference-hypothesis pair is chosen
• Average across all source sentences

Matching Module

Results

• Matcher uses various word mapping modules to identify all possible word matches
• Exact : Match words with same surface forms
• porter_stem: Match words with same root
• wn_synonymy: Match words based on WordNet synsets
• Compute the alignment having fewest number of crossing edges
• 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation)‏
• Results on the WMT 2008 data( % of correct binary judgements)‏

Matcher Example

The Sri Lanka prime minister criticizes the leader of the country

Flexible Matching for BLEU

President of Sri Lankacriticized by prime minister of the country

• The flexible matching in METEOR can be used to extend any metric that needs word to word matching
• Compute the alignment between reference and hypothesis using Meteor matcher
• Re-write the reference by replacing matched words with the words from hypothesis.
• Compute BLEU with the new references

Score Computation

• Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed
• P and R are combined into a parametrized harmonic mean
• To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched
• Fragmentation penalty and final score are computed as follows.

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Abhaya Agarwal & Alon Lavie

Language Technologies Institute

Carnegie Mellon University

Parameter Tuning

• Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments
• Re-tuned to optimize correlation with human ranking data from last year's WMT shared task

Results

• Average BLEU and M-BLEU scores on WMT 2008 data‏

Parameter Tuning

• No consistent gains across languages seen in correlations at the segment level on WMT 2007 data
• Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])‏
• The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements.
• Since the ranges of the parameters are bounded, exhaustive search is use