Loading in 2 Seconds...

Measuring Confidence Intervals for MT Evaluation Metrics

Loading in 2 Seconds...

- 202 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Measuring Confidence Intervals for MT Evaluation Metrics' - whitney

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Measuring Confidence Intervals for MT Evaluation Metrics

Ying Zhang (Joy)

Stephan Vogel

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

Outline

- Automatic Machine Translation Evaluation
- BLEU
- Modified BLEU
- NIST MTEval
- Confidence Intervals based on Bootstrap Percentile
- Algorithm
- Comparing two MT systems
- Implementation
- Discussions
- How much testing data is needed?
- How many reference translations are needed?
- How many bootstrap samples are needed?

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Automatic Machine Translation Evaluation

- Subjective MT evaluations
- Fluency and Adequacy scored by human judges
- Very expensive in time and money
- Objective automatic MT evaluations
- Inspired by the Word Error Rate metric used by ASR research
- Measuring the “closeness” between the MT hypothesis and human reference translations
- Precision: n-gram precision
- Recall:
- Against the best matched reference
- Approximated by brevity penalty
- Cheap, fast
- Highly correlated with subjective evaluations
- MT research has greatly benefited from automatic evaluations
- Typical metrics: IBM BLEU, CMU M-BLEU, CMU METEOR, NIST MTeval, NYU GTM

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

BLEU Metrics

- Proposed by IBM’s SMT group (Papineni et al, 2002)
- Widely used in MT evaluations
- DARPA TIDES MT evaluation
- IWSLT evaluation
- TC-Star
- BLEU Metric:
- Pn: Modified n-gram precision
- Geometric mean of p1, p2,..pn
- BP: Brevity penalty
- Usually, N=4 and wn=1/N.

c: length of the MT hypothesis r: effective reference length

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

BLEU Metric

- Example:
- MT Hypothesis: the gunman was shot dead by police .
- Reference 1: The gunman was shot to death by the police .
- Reference 2: The gunman was shot to death by the police .
- Reference 3: Police killed the gunman .
- Reference 4: The gunman was shot dead by the police .
- Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
- Brevity Penalty: c=8, r=9, BP=0.8825
- Final Score:
- Usually n-gram precision and BP are calculated on the test set level

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Modified BLEU Metric

- BLEU focuses heavily on long n-grams because of the geometric mean
- Example:
- Modified BLEU Metric (Zhang, 2004)
- Arithmetic mean of the n-gram precision
- More balanced contribution from different n-grams

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

NIST MTEval Metric

- Motivation
- “Weight more heavily those n-grams that are more informative” (NIST 2002)
- Use a geometric mean of the n-gram score
- Pros: more sensitive than BLEU
- Cons:
- Info gain for 2-gram and up is not meaningful
- 80% of the score comes from unigram matches
- Most matched 5-grams have info gain 0 !
- Score increases when the testing set size increases

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Questions Regarding MT Evaluation Metrics

- Do they rank the MT systems in the same way as human judges?
- IBM showed a strong correlation between BLEU and human judgments
- How reliable are the automatic evaluation scores?
- How sensitive is a metric?
- Sensitivity: the metric should be able to distinguish between systems of similar performance
- Is the metric consistent?
- Consistency: the difference between systems is not affected by the selection of testing/reference data
- How many reference translations are needed?
- How much testing data is sufficient for evaluation?
- If we can measure the confidence interval of the evaluation scores, we can answer the above questions

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Outline

- Overview of Automatic Machine Translation Evaluation
- BLEU
- Modified BLEU
- NIST MTEval
- Confidence Intervals based on Bootstrap Percentile
- Algorithm
- Comparing two MT systems
- Implementation
- Discussions
- How much testing data is needed?
- How many reference translations are needed?
- How many bootstrap samples are needed?

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Measuring the Confidence Intervals

- One BLEU/M-BLEU/NIST score per test set
- How accurate is this score?
- To measure the confidence interval a population is required
- Building a test set with multiple human reference translations is expensive
- Solution: bootstrapping (Efron 1986)
- Introduced in 1979 as a computer-based method for estimating the standard errors of a statistical estimation
- Resampling: creating an artificial population by sampling with replacement
- Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

A Schematic of the Bootstrapping Process

Score0

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

An Efficient Implementation

- Translate and evaluate 2,000 test sets?
- No Way!
- Resample the n-gram precision information for the sentences
- Most MT systems are context independent at the sentence level;
- MT evaluation metrics are based on information collected for each testing sentences
- E.g. for BLEU/M-BLEU and NIST

RefLen: 17 20 19 24

ClosestRefLen 17

1-gram: 15 10 89.34

2-gram: 14 4 9.04

3-gram: 13 3 3.65

4-gram: 12 2 2.43

- Similar for human judgment and other MT metrics
- Approximation for NIST information gain
- Scripts available at:http://projectile.is.cs.cmu.edu/research/public/tools/bootStrap/tutorial.htm

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Algorithm

Original test suite T0 with N segments and R reference translations

Represent the i-th segment of T0 as an n-tuple:

T0[i]=<si, ri1,ri2,..,riR>

for(b=1;b<=B;b++){

for(i=1;i<=N;i++){

s = random(1,N);

Tb[i] = T0[s];

}

Calculating BLEU/M-BLEU/NIST for Tb

}

Sort BBLEU/M-BLEU/NIST scores

Output scores ranked 2.5%th and 97.5%

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Confidence Intervals

- 7 Chinese-English MT systems from June 2002 TIDES evaluation
- Observations:
- Relative confidence interval: NIST<M-Bleu<Bleu
- NIST scores have more discriminative powers than BLEU
- The strong impact of long n-grams makes the BLEU score less stableor … introduces more noise)

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Are Two MT Systems Different?

- Comparing two MT systems’ performance
- Using the similar method as for single system
- E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056]
- If the confidence intervals overlap with 0, two systems are not significantly different

- M-Bleu and NIST have more discriminative power than Bleu
- Automatic metrics have pretty high correlations with the human ranking
- Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Outline

- Overview of Automatic Machine Translation Evaluation
- BLEU
- Modified BLEU
- NIST MTEval
- Confidence Intervals based on Bootstrap Percentile
- Algorithm
- Comparing two MT systems
- Implementation
- Discussions
- How much testing data is needed?
- How many reference translations are needed?
- How many bootstrap samples are needed?
- Non-parametric interval or normal/t-intervals?

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

How much testing data is needed

- NIST scores increase steadily with the growing test set size
- The distance between the scores of the different systems remains stable when using 40% or more of the test set
- The confidence intervals become narrower for larger test set
- Rule of thumb: doubling the testing data size narrows the confidence interval by 30% (theoretically justified)

* System A, (Bootstrap Size B=2000)

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Effects of Using Multiple References

- Single reference from one translator may favor some systems
- Increasing the number of references narrows down the relative confidence interval

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

How Many Reference Translations are Sufficient?

- Confidence intervals become narrower with more reference translations
- [100%](1-ref) ~ [80~90%](2-ref) ~ [70~80%](3-ref) ~[60%~70%](4-ref)
- One additional reference translation compensates for 10~15% of testing data

* System A, (Bootstrap Size B=2000)

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Do We Really Need Multiple References?

- Parallel multiple reference
- Single reference from multiple translators*
- Reduced bias from different translators
- Yields the same confidence interval/reliability as the parallel multiple reference
- Costs only half of the effort compared to building a parallel multiple reference set

*Originally proposed in IBM’s BLEU report

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Single Reference from Multiple Translators

- Reduced bias by mixing from different translators
- Yields the same confidence intervals

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Bootstrap-t Interval vs. Normal/t Interval

- Normal distribution / t-distribution
- Student’s t-interval (when n is small)
- Bootstrap-t interval
- For each bootstrap sample, calculate
- The alpha-th percentile is estimated by the value , such that
- Bootstrap-t interval is
- e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval

Assuming that

Assuming that

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Bootstrap-t interval vs. Normal/t interval (Cont.)

- Bootstrap-t intervals assumes no distribution, but
- It can give erratic results
- It can be heavily influenced by a few outlying data points
- When B is large, the bootstrap sample scores are pretty close to normal distribution
- Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500)
- STDEV=0.27 for bootstrap-t interval
- STDEV=0.14 for normal/student-t interval

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

The Number of Bootstrap Replications B

- Ideal bootstrap estimate of the confidence interval takes B
- Computational time increases linearly with B
- The greater B, the smaller the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative confidence interval
- STDEV = 0.60 when B=100; STDEV = 0.27 when B=500
- Two rules of thumb:
- Even a small B, say B=100 is usually informative
- B>1000 gives quite satisfactory results

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Conclusions

- Using bootstrapping method to measure the confidence intervals for MT evaluation metrics
- Using confidence intervals to study the characteristics of an MT evaluation metric
- Correlation with human judgments
- Sensitivity
- Consistency
- Modified BLEU is a better metric than BLEU
- Single reference from multiple translators is as good as parallel multiple references and costs only half the effort

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

References

- Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77.
- F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan.
- M. Bisani and H. Ney : 2004, \'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation\', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412.
- G. Leusch, N. Ueffing, H. Ney : 2003, \'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation\', In Proc. 9th MT Summit, New Orleans, LO.
- I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, \'Precision and Recall of Machine Translation\', In Proc. of NAACL/HLT 2003, Edmonton, Canada.
- King M., Popescu-Belis A. & Hovy E. : 2003, \'FEMTI: creating and using a framework for MT evaluation\', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA.
- S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, \'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research\', In Proc. LREC 2000, Athens, Greece.
- NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf
- Papineni, Kishore & Roukos, Salim et al. : 2002, \'BLEU: A Method for Automatic Evaluation of Machine Translation\', In Proc. of the 20th ACL.
- Ying Zhang, Stephan Vogel, Alex Waibel : 2004, \'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,\' In: Proc. of LREC 2004, Lisbon, Portugal.

Ying Zhang, Stephan Vogel

LTI, Carnegie Mellon University

Download Presentation

Connecting to Server..