1 / 12

Statistical Comparison of Two Learning Algorithms

Statistical Comparison of Two Learning Algorithms. Presented by: Payam Refaeilzadeh. Overview . How can we tell if one algorithm can learn better than another? Design an experiment to measure the accuracy of the two algorithms. Run multiple trials.

clio
Download Presentation

Statistical Comparison of Two Learning Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh

  2. Overview • How can we tell if one algorithm can learn better than another? • Design an experiment to measure the accuracy of the two algorithms. • Run multiple trials. • Compare the samples not just their means. Do a statistically sound test of the two samples. • Is any observed difference significant? Is it due to true difference between algorithms or natural variation in the measurements?

  3. Statistical Hypothesis Testing • Statistical Hypothesis: A statement about the parameters of one or more populations • Hypothesis Testing: A procedure for deciding to accept or reject the hypothesis • Identify the parameter of interest • State a null hypothesis, H0 • Specify an alternate hypothesis, H1 • Choose a significance level α • State an appropriate test statistic

  4. Statistical Hypothesis Testing Cont • Null Hypothesis (H0): A statement presumed to be true until statistical evidence shows otherwise • Usually specifies an exact value for a parameter • Example H0: µ = 30 Kg • Alternate Hypothesis (H1): Accepted if the null hypothesis is rejected • Test Statistic: Particular statistic calculated from measurements of a random sample / experiment • A test statistic is assumed to follow a particular distribution (normal, t, chi-square, etc) • That particular distribution can be used to test for the significance of the calculated test statistic.

  5. Error in Hypothesis Testing • Type I error occurs when H0 is rejected but it is in fact true • P(Type I error)=αor significance level • Type II error occurs when we fail to reject H0 but it is in fact false • P(Type II error)=β • power = 1-β = Probability of correctly rejecting H0 • power = ability to distinguish between the two populations

  6. Paired t-Test • Collect data in pairs: • Example: Given a training set DTrain and a test set DTest, train both learning algorithms on DTrain and then test their accuracies on DTest. • Suppose n paired measurements have been made • Assume • The measurements are independent • The measurements for each algorithm follow a normal distribution • The test statistic T0 will follow a t-distribution with n-1 degrees of freedom

  7. Paired t-Test cont Null Hypothesis: H0:µD = Δ0 Test Statistic: Assume: X1 follows N(µ1,σ1) X2 follows N(µ2,σ2) Let: µD = µ1 - µ2 Di = X1i - X2i i=1,2,...,n Rejection Criteria: H1:µD ≠ Δ0 |t0| > tα/2,n-1 H1:µD > Δ0 t0 > tα,n-1 H1:µD < Δ0 t0 < -tα,n-1

  8. Cross Validated t-test • Paired t-Test on the 10 paired accuracies obtained from 10-fold cross validation • Advantages • Large train set size • Most powerful (Diettrich, 98) • Disadvantages • Accuracy results are not independent (overlap) • Somewhat elevated probability of type-1 error (Diettrich, 98)

  9. 5x2 Cross Validated t-test • Run 2-fold cross validation 5 times • Use results from the first of five replications to estimate mean difference • Use results for all folds to estimate the variance • Advantage: • Lowest Type-1 error (Diettrich, 98) • Disadvantage • Not as powerful as 10 fold cross validated t-test (Diettrich, 98)

  10. Re-sampled t-test • Randomly divide data into train / test sets (usually 2/3 – 1/3) • Run multiple trials (usually 30) • Perform a paired t-test between the trial accuracies • This test has very high probability of type-1 error and should never be used.

  11. Calibrated Tests • Bouckaert – ICML 2003: • It is very difficult to estimate the true degrees of freedom because independence assumptions are being violated • Instead of correcting for the mean-difference, calibrate on the degrees of freedom • Recommendation: use 10 times repeated 10-fold cross validation with 10 degrees of freedom

  12. References • R. R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. ICML’03: PP 51-58. • T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998. • D. C. Montgomery et al. Engineering Statistics. 2nd Edition. Wiley Press. 2001

More Related