Measuring confidence intervals for mt evaluation metrics
Download
1 / 29

Measuring Confidence Intervals for MT Evaluation Metrics - PowerPoint PPT Presentation


  • 202 Views
  • Updated On :

Measuring Confidence Intervals for MT Evaluation Metrics. Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University. Outline. Automatic Machine Translation Evaluation BLEU Modified BLEU NIST MTEval

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Measuring Confidence Intervals for MT Evaluation Metrics' - whitney


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Measuring confidence intervals for mt evaluation metrics l.jpg

Measuring Confidence Intervals for MT Evaluation Metrics

Ying Zhang (Joy)

Stephan Vogel

Language Technologies Institute

School of Computer Science

Carnegie Mellon University


Outline l.jpg
Outline

  • Automatic Machine Translation Evaluation

    • BLEU

    • Modified BLEU

    • NIST MTEval

  • Confidence Intervals based on Bootstrap Percentile

    • Algorithm

    • Comparing two MT systems

    • Implementation

  • Discussions

    • How much testing data is needed?

    • How many reference translations are needed?

    • How many bootstrap samples are needed?

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Automatic machine translation evaluation l.jpg
Automatic Machine Translation Evaluation

  • Subjective MT evaluations

    • Fluency and Adequacy scored by human judges

    • Very expensive in time and money

  • Objective automatic MT evaluations

    • Inspired by the Word Error Rate metric used by ASR research

    • Measuring the “closeness” between the MT hypothesis and human reference translations

    • Precision: n-gram precision

    • Recall:

      • Against the best matched reference

      • Approximated by brevity penalty

    • Cheap, fast

    • Highly correlated with subjective evaluations

    • MT research has greatly benefited from automatic evaluations

    • Typical metrics: IBM BLEU, CMU M-BLEU, CMU METEOR, NIST MTeval, NYU GTM

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Bleu metrics l.jpg
BLEU Metrics

  • Proposed by IBM’s SMT group (Papineni et al, 2002)

  • Widely used in MT evaluations

    • DARPA TIDES MT evaluation

    • IWSLT evaluation

    • TC-Star

  • BLEU Metric:

    • Pn: Modified n-gram precision

    • Geometric mean of p1, p2,..pn

    • BP: Brevity penalty

    • Usually, N=4 and wn=1/N.

c: length of the MT hypothesis r: effective reference length

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Bleu metric l.jpg
BLEU Metric

  • Example:

    • MT Hypothesis: the gunman was shot dead by police .

    • Reference 1: The gunman was shot to death by the police .

    • Reference 2: The gunman was shot to death by the police .

    • Reference 3: Police killed the gunman .

    • Reference 4: The gunman was shot dead by the police .

  • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)

  • Brevity Penalty: c=8, r=9, BP=0.8825

  • Final Score:

  • Usually n-gram precision and BP are calculated on the test set level

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Modified bleu metric l.jpg
Modified BLEU Metric

  • BLEU focuses heavily on long n-grams because of the geometric mean

  • Example:

  • Modified BLEU Metric (Zhang, 2004)

    • Arithmetic mean of the n-gram precision

    • More balanced contribution from different n-grams

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Nist mteval metric l.jpg
NIST MTEval Metric

  • Motivation

    • “Weight more heavily those n-grams that are more informative” (NIST 2002)

    • Use a geometric mean of the n-gram score

  • Pros: more sensitive than BLEU

  • Cons:

    • Info gain for 2-gram and up is not meaningful

      • 80% of the score comes from unigram matches

      • Most matched 5-grams have info gain 0 !

    • Score increases when the testing set size increases

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Questions regarding mt evaluation metrics l.jpg
Questions Regarding MT Evaluation Metrics

  • Do they rank the MT systems in the same way as human judges?

    • IBM showed a strong correlation between BLEU and human judgments

  • How reliable are the automatic evaluation scores?

  • How sensitive is a metric?

    • Sensitivity: the metric should be able to distinguish between systems of similar performance

  • Is the metric consistent?

    • Consistency: the difference between systems is not affected by the selection of testing/reference data

  • How many reference translations are needed?

  • How much testing data is sufficient for evaluation?

  • If we can measure the confidence interval of the evaluation scores, we can answer the above questions

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Outline9 l.jpg
Outline

  • Overview of Automatic Machine Translation Evaluation

    • BLEU

    • Modified BLEU

    • NIST MTEval

  • Confidence Intervals based on Bootstrap Percentile

    • Algorithm

    • Comparing two MT systems

    • Implementation

  • Discussions

    • How much testing data is needed?

    • How many reference translations are needed?

    • How many bootstrap samples are needed?

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Measuring the confidence intervals l.jpg
Measuring the Confidence Intervals

  • One BLEU/M-BLEU/NIST score per test set

  • How accurate is this score?

  • To measure the confidence interval a population is required

  • Building a test set with multiple human reference translations is expensive

  • Solution: bootstrapping (Efron 1986)

    • Introduced in 1979 as a computer-based method for estimating the standard errors of a statistical estimation

    • Resampling: creating an artificial population by sampling with replacement

    • Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


A schematic of the bootstrapping process l.jpg
A Schematic of the Bootstrapping Process

Score0

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


An efficient implementation l.jpg
An Efficient Implementation

  • Translate and evaluate 2,000 test sets?

    • No Way!

  • Resample the n-gram precision information for the sentences

    • Most MT systems are context independent at the sentence level;

    • MT evaluation metrics are based on information collected for each testing sentences

    • E.g. for BLEU/M-BLEU and NIST

      RefLen: 17 20 19 24

      ClosestRefLen 17

      1-gram: 15 10 89.34

      2-gram: 14 4 9.04

      3-gram: 13 3 3.65

      4-gram: 12 2 2.43

    • Similar for human judgment and other MT metrics

  • Approximation for NIST information gain

  • Scripts available at:http://projectile.is.cs.cmu.edu/research/public/tools/bootStrap/tutorial.htm

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Algorithm l.jpg
Algorithm

Original test suite T0 with N segments and R reference translations

Represent the i-th segment of T0 as an n-tuple:

T0[i]=<si, ri1,ri2,..,riR>

for(b=1;b<=B;b++){

for(i=1;i<=N;i++){

s = random(1,N);

Tb[i] = T0[s];

}

Calculating BLEU/M-BLEU/NIST for Tb

}

Sort BBLEU/M-BLEU/NIST scores

Output scores ranked 2.5%th and 97.5%

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Confidence intervals l.jpg
Confidence Intervals

  • 7 Chinese-English MT systems from June 2002 TIDES evaluation

  • Observations:

    • Relative confidence interval: NIST<M-Bleu<Bleu

    • NIST scores have more discriminative powers than BLEU

    • The strong impact of long n-grams makes the BLEU score less stableor … introduces more noise)

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Are two mt systems different l.jpg
Are Two MT Systems Different?

  • Comparing two MT systems’ performance

    • Using the similar method as for single system

    • E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056]

    • If the confidence intervals overlap with 0, two systems are not significantly different

  • M-Bleu and NIST have more discriminative power than Bleu

  • Automatic metrics have pretty high correlations with the human ranking

  • Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Outline16 l.jpg
Outline

  • Overview of Automatic Machine Translation Evaluation

    • BLEU

    • Modified BLEU

    • NIST MTEval

  • Confidence Intervals based on Bootstrap Percentile

    • Algorithm

    • Comparing two MT systems

    • Implementation

  • Discussions

    • How much testing data is needed?

    • How many reference translations are needed?

    • How many bootstrap samples are needed?

    • Non-parametric interval or normal/t-intervals?

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


How much testing data is needed l.jpg
How much testing data is needed

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


How much testing data is needed18 l.jpg
How much testing data is needed

  • NIST scores increase steadily with the growing test set size

  • The distance between the scores of the different systems remains stable when using 40% or more of the test set

  • The confidence intervals become narrower for larger test set

  • Rule of thumb: doubling the testing data size narrows the confidence interval by 30% (theoretically justified)

* System A, (Bootstrap Size B=2000)

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Effects of using multiple references l.jpg
Effects of Using Multiple References

  • Single reference from one translator may favor some systems

  • Increasing the number of references narrows down the relative confidence interval

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


How many reference translations are sufficient l.jpg
How Many Reference Translations are Sufficient?

  • Confidence intervals become narrower with more reference translations

  • [100%](1-ref) ~ [80~90%](2-ref) ~ [70~80%](3-ref) ~[60%~70%](4-ref)

  • One additional reference translation compensates for 10~15% of testing data

* System A, (Bootstrap Size B=2000)

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Do we really need multiple references l.jpg
Do We Really Need Multiple References?

  • Parallel multiple reference

  • Single reference from multiple translators*

    • Reduced bias from different translators

    • Yields the same confidence interval/reliability as the parallel multiple reference

    • Costs only half of the effort compared to building a parallel multiple reference set

*Originally proposed in IBM’s BLEU report

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Single reference from multiple translators l.jpg
Single Reference from Multiple Translators

  • Reduced bias by mixing from different translators

  • Yields the same confidence intervals

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Bootstrap t interval vs normal t interval l.jpg
Bootstrap-t Interval vs. Normal/t Interval

  • Normal distribution / t-distribution

  • Student’s t-interval (when n is small)

  • Bootstrap-t interval

    • For each bootstrap sample, calculate

    • The alpha-th percentile is estimated by the value , such that

    • Bootstrap-t interval is

    • e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval

Assuming that

Assuming that

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Bootstrap t interval vs normal t interval cont l.jpg
Bootstrap-t interval vs. Normal/t interval (Cont.)

  • Bootstrap-t intervals assumes no distribution, but

    • It can give erratic results

    • It can be heavily influenced by a few outlying data points

  • When B is large, the bootstrap sample scores are pretty close to normal distribution

  • Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500)

    • STDEV=0.27 for bootstrap-t interval

    • STDEV=0.14 for normal/student-t interval

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


The number of bootstrap replications b l.jpg
The Number of Bootstrap Replications B

  • Ideal bootstrap estimate of the confidence interval takes B

  • Computational time increases linearly with B

  • The greater B, the smaller the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative confidence interval

    • STDEV = 0.60 when B=100; STDEV = 0.27 when B=500

  • Two rules of thumb:

    • Even a small B, say B=100 is usually informative

    • B>1000 gives quite satisfactory results

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Conclusions l.jpg
Conclusions

  • Using bootstrapping method to measure the confidence intervals for MT evaluation metrics

  • Using confidence intervals to study the characteristics of an MT evaluation metric

    • Correlation with human judgments

    • Sensitivity

    • Consistency

  • Modified BLEU is a better metric than BLEU

  • Single reference from multiple translators is as good as parallel multiple references and costs only half the effort

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


References l.jpg
References

  • Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77.

  • F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan.

  • M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412.

  • G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO.

  • I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada.

  • King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA.

  • S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece.

  • NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf

  • Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL.

  • Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


Questions and comments l.jpg
Questions and Comments?

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University


N gram contributions to nist score l.jpg
N-gram Contributions to NIST Score

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University