slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output PowerPoint Presentation
Download Presentation
Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Loading in 2 Seconds...

play fullscreen
1 / 1

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Objective. MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. To be tuned against, a metric should be fast and easy to compute

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output' - ulla-mckay


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Objective

  • MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly.
  • To be tuned against, a metric should be fast and easy to compute
  • In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute.

Computing Ranking Correlation

  • Convert binary judgements into full rankings
      • Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements
      • Topologically sort the graph
    • For one source sentence with N hypothesis

Introduction to METEOR

  • Computes a one to one word alignment between reference and hypothesis using a matching module
  • Uses unigram precision and unigram recall along with a fragmentation penalty as score components
  • In case of multiple references, the best scoring reference-hypothesis pair is chosen
  • Average across all source sentences

Matching Module

Results

  • Matcher uses various word mapping modules to identify all possible word matches
    • Exact : Match words with same surface forms
    • porter_stem: Match words with same root
    • wn_synonymy: Match words based on WordNet synsets
  • Compute the alignment having fewest number of crossing edges
  • 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation)‏
  • Results on the WMT 2008 data( % of correct binary judgements)‏

Matcher Example

The Sri Lanka prime minister criticizes the leader of the country

Flexible Matching for BLEU

President of Sri Lankacriticized by prime minister of the country

  • The flexible matching in METEOR can be used to extend any metric that needs word to word matching
    • Compute the alignment between reference and hypothesis using Meteor matcher
    • Re-write the reference by replacing matched words with the words from hypothesis.
    • Compute BLEU with the new references

Score Computation

  • Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed
  • P and R are combined into a parametrized harmonic mean
  • To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched
  • Fragmentation penalty and final score are computed as follows.

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Abhaya Agarwal & Alon Lavie

Language Technologies Institute

Carnegie Mellon University

Parameter Tuning

  • Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments
    • Re-tuned to optimize correlation with human ranking data from last year's WMT shared task

Results

  • Average BLEU and M-BLEU scores on WMT 2008 data‏

Parameter Tuning

  • No consistent gains across languages seen in correlations at the segment level on WMT 2007 data
  • Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])‏
  • The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements.
  • Since the ranges of the parameters are bounded, exhaustive search is use