Evaluation in machine learning
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Evaluation in Machine Learning PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on
  • Presentation posted in: General

Evaluation in Machine Learning. Pádraig Cunningham. Outline. Student’s t-test Test for paired data Cross Validation McNemar’s Test ROC Analysis Other Statistical Tests for Evaluation. William Sealy Gosset.

Download Presentation

Evaluation in Machine Learning

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Evaluation in machine learning

Evaluation in Machine Learning

Pádraig Cunningham


Outline

Outline

  • Student’s t-test

  • Test for paired data

  • Cross Validation

  • McNemar’s Test

  • ROC Analysis

  • Other Statistical Tests for Evaluation


William sealy gosset

William Sealy Gosset

The t-statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews. "Student" was his pen name. Gosset was a statistician for the Guinness brewery in Dublin, Ireland, and was hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes. Gosset published the t test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown not only to fellow statisticians but to his employer - the company insisted on the pseudonym so that it could turn a blind eye to the breach of its rules.

Wikipedia


Student s t test

Student’s t-Test

  • Scores by two rugby teams:

    • Is B better than A?


What does the t statistic mean

What does the t-statistic mean?

  • For a given t-statistic you can look up the confidence

  • i.e. there is a 31.7% chance that this difference is due to chance (according to this test).

31.7%

-0.485


Student s t test1

Student’s t-Test

  • More data and/or clearer difference will give statistical significance


Student s t test paired

Student’s t-Test (paired)

  • Scores paired, i.e. against same team

    • With paired data statistical significance can be determined with less observations

    • We can say with 95% confidence that B are better than A


Student s t test formulae

Student’s t-test: Formulae

  • Two samples, A and B

    • is the average in A, is the variance in A

    • Test for paired data, 1 Sample (D is difference in pairs)


Paired t test example

Paired t-Test example

  • t-Test can be used for comparing errors in regression systems.

  • It can also be used for comparing classifiers if multiple test sets are available

    • Also with cross validation (more later)

  • = 5.2


Evaluation in machine learning1

U

Train

Test

Evaluation in Machine Learning

  • Supervised Learning

    • Typical Question:

      • Which is better, Classifier A or Classifier B?

    • Evaluate Generalisation Accuracy

    • Hold back some training data to use for testing

      • Use performance on Test data as a proxy for performance on unseen data (i.e. Generalization).

Training

Data


Problems with hold out validation

100

160

Problems with ‘Hold-out’ Validation

  • Imagine 200 samples are available for training:

    • 50:50 split underestimates generalisation acc.

    • 80:20 estimate based on a small sample (40)

      • Different hold-out sets - different results

Accuracy

200

# Samples


K fold cross validation

k-Fold Cross Validation

  • Having your cake and eating it too…

  • Divide data into k folds

    • For each fold in turn

      • Use that fold for testing and

      • Use the remainder of the data for training


Comparing two classifiers

Tuning is

explicit

Comparing Two Classifiers

  • Divide dataset into k folds (say 10)

  • For each of the k folds

    • Create training and test sets T & S

    • Divide T into sets T1 and T2

    • For each of the classifiers

      • Use T2 to tune parameters on a model trained with T1

      • Use these ‘good’ parameters to train a model with T

      • Measure Accuracy on S

      • Record 0-1 loss results for each classifier

  • Assess significance of results (e.g. McNemar’s test).

(Salzberg, 1997)


Mcnemar s test

McNemar’s test

  • Which is better C1 or C2?

  • Which is better C2 or C3?

  • McNemar’s test captures this notion:

    • n01 number misclassified by 1st but not 2nd classifier

    • n10 number misclassified by 2nd but not 1st classifier

MNS score for C2 v’s C1 = 1/2

MNS score for C2 v’s C1 = 1/6

For test to be applicable (n01 + n10) > 10

>3.84 required for statistical significance at 95%


Mcnemar s test example

McNemar’s test example


Other tests

Other Tests

Dietterich’s 5x2cv paired t-test (Dietterich, 1998)

  • 5 repetitions of 2-fold cross validation

    • 2-fold  no overlap in training data

    • This gives 10 pairs of error estimates from which a t statistic can be derived

      + flexible on choice of loss function

    • training sets comprise 50% of data

      Demšar’s comparisons over multiple datasets (Demšar, 2006)

  • Comparisons between classifiers done on multiple datasets

    •  a table of results

    • Averaging across datasets is dodgy

    • Demšar’s Test

      • Wilcoxon Signed Ranks Test to compare a pair of classifiers

      • Friedman’s Test to combine these scores for multiple classifiers

        Counts of wins, losses and ties

This methodology could

become the standard


Loss functions

Loss Functions

How you keep the score…

  • Regression

    • Quadratic Loss Function

      • Minimize Mean Squared Error

      • Big errors matter more

  • Classification

    • Misclassification Rate

      • aka: 0-1 Loss Function

    • Many alternatives are possible and appropriate in different circumstances, e.g. F measure.


Loss functions roc curves

Loss Functions: ROC Curves

Ranking Classifiers

  • Many (binary) classifiers return a numeric score between 0 and 1.

  • Classifier bias can be controlled by adjusting a threshold.

  • For a given test set the ROC curve shows classifier performance over a range of thresholds/biases.


References

References

Salzberg, S., (1997) On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data Mining and Knowledge Discovery, 1, 317–327.

Dietterich, T.G., (1998) Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, 10:1895–1924.

Demšar, J., (2006) Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, 7(Jan):1--30.


  • Login