1 / 29

Measuring Confidence Intervals for MT Evaluation Metrics

Measuring Confidence Intervals for MT Evaluation Metrics. Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University. Outline. Automatic Machine Translation Evaluation BLEU Modified BLEU NIST MTEval

whitney
Download Presentation

Measuring Confidence Intervals for MT Evaluation Metrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University

  2. Outline • Automatic Machine Translation Evaluation • BLEU • Modified BLEU • NIST MTEval • Confidence Intervals based on Bootstrap Percentile • Algorithm • Comparing two MT systems • Implementation • Discussions • How much testing data is needed? • How many reference translations are needed? • How many bootstrap samples are needed? Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  3. Automatic Machine Translation Evaluation • Subjective MT evaluations • Fluency and Adequacy scored by human judges • Very expensive in time and money • Objective automatic MT evaluations • Inspired by the Word Error Rate metric used by ASR research • Measuring the “closeness” between the MT hypothesis and human reference translations • Precision: n-gram precision • Recall: • Against the best matched reference • Approximated by brevity penalty • Cheap, fast • Highly correlated with subjective evaluations • MT research has greatly benefited from automatic evaluations • Typical metrics: IBM BLEU, CMU M-BLEU, CMU METEOR, NIST MTeval, NYU GTM Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  4. BLEU Metrics • Proposed by IBM’s SMT group (Papineni et al, 2002) • Widely used in MT evaluations • DARPA TIDES MT evaluation • IWSLT evaluation • TC-Star • BLEU Metric: • Pn: Modified n-gram precision • Geometric mean of p1, p2,..pn • BP: Brevity penalty • Usually, N=4 and wn=1/N. c: length of the MT hypothesis r: effective reference length Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  5. BLEU Metric • Example: • MT Hypothesis: the gunman was shot dead by police . • Reference 1: The gunman was shot to death by the police . • Reference 2: The gunman was shot to death by the police . • Reference 3: Police killed the gunman . • Reference 4: The gunman was shot dead by the police . • Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5) • Brevity Penalty: c=8, r=9, BP=0.8825 • Final Score: • Usually n-gram precision and BP are calculated on the test set level Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  6. Modified BLEU Metric • BLEU focuses heavily on long n-grams because of the geometric mean • Example: • Modified BLEU Metric (Zhang, 2004) • Arithmetic mean of the n-gram precision • More balanced contribution from different n-grams Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  7. NIST MTEval Metric • Motivation • “Weight more heavily those n-grams that are more informative” (NIST 2002) • Use a geometric mean of the n-gram score • Pros: more sensitive than BLEU • Cons: • Info gain for 2-gram and up is not meaningful • 80% of the score comes from unigram matches • Most matched 5-grams have info gain 0 ! • Score increases when the testing set size increases Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  8. Questions Regarding MT Evaluation Metrics • Do they rank the MT systems in the same way as human judges? • IBM showed a strong correlation between BLEU and human judgments • How reliable are the automatic evaluation scores? • How sensitive is a metric? • Sensitivity: the metric should be able to distinguish between systems of similar performance • Is the metric consistent? • Consistency: the difference between systems is not affected by the selection of testing/reference data • How many reference translations are needed? • How much testing data is sufficient for evaluation? • If we can measure the confidence interval of the evaluation scores, we can answer the above questions Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  9. Outline • Overview of Automatic Machine Translation Evaluation • BLEU • Modified BLEU • NIST MTEval • Confidence Intervals based on Bootstrap Percentile • Algorithm • Comparing two MT systems • Implementation • Discussions • How much testing data is needed? • How many reference translations are needed? • How many bootstrap samples are needed? Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  10. Measuring the Confidence Intervals • One BLEU/M-BLEU/NIST score per test set • How accurate is this score? • To measure the confidence interval a population is required • Building a test set with multiple human reference translations is expensive • Solution: bootstrapping (Efron 1986) • Introduced in 1979 as a computer-based method for estimating the standard errors of a statistical estimation • Resampling: creating an artificial population by sampling with replacement • Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  11. A Schematic of the Bootstrapping Process Score0 Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  12. An Efficient Implementation • Translate and evaluate 2,000 test sets? • No Way! • Resample the n-gram precision information for the sentences • Most MT systems are context independent at the sentence level; • MT evaluation metrics are based on information collected for each testing sentences • E.g. for BLEU/M-BLEU and NIST RefLen: 17 20 19 24 ClosestRefLen 17 1-gram: 15 10 89.34 2-gram: 14 4 9.04 3-gram: 13 3 3.65 4-gram: 12 2 2.43 • Similar for human judgment and other MT metrics • Approximation for NIST information gain • Scripts available at:http://projectile.is.cs.cmu.edu/research/public/tools/bootStrap/tutorial.htm Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  13. Algorithm Original test suite T0 with N segments and R reference translations Represent the i-th segment of T0 as an n-tuple: T0[i]=<si, ri1,ri2,..,riR> for(b=1;b<=B;b++){ for(i=1;i<=N;i++){ s = random(1,N); Tb[i] = T0[s]; } Calculating BLEU/M-BLEU/NIST for Tb } Sort BBLEU/M-BLEU/NIST scores Output scores ranked 2.5%th and 97.5% Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  14. Confidence Intervals • 7 Chinese-English MT systems from June 2002 TIDES evaluation • Observations: • Relative confidence interval: NIST<M-Bleu<Bleu • NIST scores have more discriminative powers than BLEU • The strong impact of long n-grams makes the BLEU score less stableor … introduces more noise) Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  15. Are Two MT Systems Different? • Comparing two MT systems’ performance • Using the similar method as for single system • E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056] • If the confidence intervals overlap with 0, two systems are not significantly different • M-Bleu and NIST have more discriminative power than Bleu • Automatic metrics have pretty high correlations with the human ranking • Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  16. Outline • Overview of Automatic Machine Translation Evaluation • BLEU • Modified BLEU • NIST MTEval • Confidence Intervals based on Bootstrap Percentile • Algorithm • Comparing two MT systems • Implementation • Discussions • How much testing data is needed? • How many reference translations are needed? • How many bootstrap samples are needed? • Non-parametric interval or normal/t-intervals? Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  17. How much testing data is needed Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  18. How much testing data is needed • NIST scores increase steadily with the growing test set size • The distance between the scores of the different systems remains stable when using 40% or more of the test set • The confidence intervals become narrower for larger test set • Rule of thumb: doubling the testing data size narrows the confidence interval by 30% (theoretically justified) * System A, (Bootstrap Size B=2000) Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  19. Effects of Using Multiple References • Single reference from one translator may favor some systems • Increasing the number of references narrows down the relative confidence interval Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  20. How Many Reference Translations are Sufficient? • Confidence intervals become narrower with more reference translations • [100%](1-ref) ~ [80~90%](2-ref) ~ [70~80%](3-ref) ~[60%~70%](4-ref) • One additional reference translation compensates for 10~15% of testing data * System A, (Bootstrap Size B=2000) Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  21. Do We Really Need Multiple References? • Parallel multiple reference • Single reference from multiple translators* • Reduced bias from different translators • Yields the same confidence interval/reliability as the parallel multiple reference • Costs only half of the effort compared to building a parallel multiple reference set *Originally proposed in IBM’s BLEU report Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  22. Single Reference from Multiple Translators • Reduced bias by mixing from different translators • Yields the same confidence intervals Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  23. Bootstrap-t Interval vs. Normal/t Interval • Normal distribution / t-distribution • Student’s t-interval (when n is small) • Bootstrap-t interval • For each bootstrap sample, calculate • The alpha-th percentile is estimated by the value , such that • Bootstrap-t interval is • e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval Assuming that Assuming that Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  24. Bootstrap-t interval vs. Normal/t interval (Cont.) • Bootstrap-t intervals assumes no distribution, but • It can give erratic results • It can be heavily influenced by a few outlying data points • When B is large, the bootstrap sample scores are pretty close to normal distribution • Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500) • STDEV=0.27 for bootstrap-t interval • STDEV=0.14 for normal/student-t interval Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  25. The Number of Bootstrap Replications B • Ideal bootstrap estimate of the confidence interval takes B • Computational time increases linearly with B • The greater B, the smaller the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative confidence interval • STDEV = 0.60 when B=100; STDEV = 0.27 when B=500 • Two rules of thumb: • Even a small B, say B=100 is usually informative • B>1000 gives quite satisfactory results Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  26. Conclusions • Using bootstrapping method to measure the confidence intervals for MT evaluation metrics • Using confidence intervals to study the characteristics of an MT evaluation metric • Correlation with human judgments • Sensitivity • Consistency • Modified BLEU is a better metric than BLEU • Single reference from multiple translators is as good as parallel multiple references and costs only half the effort Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  27. References • Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77. • F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan. • M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412. • G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO. • I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada. • King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA. • S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece. • NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf • Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL. • Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal. Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  28. Questions and Comments? Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

  29. N-gram Contributions to NIST Score Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University

More Related