- 118 Views
- Uploaded on
- Presentation posted in: General

Evaluation in Machine Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Evaluation in Machine Learning

Pádraig Cunningham

- Student’s t-test
- Test for paired data
- Cross Validation
- McNemar’s Test
- ROC Analysis
- Other Statistical Tests for Evaluation

The t-statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews. "Student" was his pen name. Gosset was a statistician for the Guinness brewery in Dublin, Ireland, and was hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes. Gosset published the t test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown not only to fellow statisticians but to his employer - the company insisted on the pseudonym so that it could turn a blind eye to the breach of its rules.

Wikipedia

- Scores by two rugby teams:
- Is B better than A?

- For a given t-statistic you can look up the confidence
- i.e. there is a 31.7% chance that this difference is due to chance (according to this test).

31.7%

-0.485

- More data and/or clearer difference will give statistical significance

- Scores paired, i.e. against same team
- With paired data statistical significance can be determined with less observations
- We can say with 95% confidence that B are better than A

- Two samples, A and B
- is the average in A, is the variance in A
- Test for paired data, 1 Sample (D is difference in pairs)

- t-Test can be used for comparing errors in regression systems.
- It can also be used for comparing classifiers if multiple test sets are available
- Also with cross validation (more later)

- = 5.2

U

Train

Test

- Supervised Learning
- Typical Question:
- Which is better, Classifier A or Classifier B?

- Evaluate Generalisation Accuracy
- Hold back some training data to use for testing
- Use performance on Test data as a proxy for performance on unseen data (i.e. Generalization).

- Typical Question:

Training

Data

100

160

- Imagine 200 samples are available for training:
- 50:50 split underestimates generalisation acc.
- 80:20 estimate based on a small sample (40)
- Different hold-out sets - different results

Accuracy

200

# Samples

- Having your cake and eating it too…
- Divide data into k folds
- For each fold in turn
- Use that fold for testing and
- Use the remainder of the data for training

- For each fold in turn

Tuning is

explicit

- Divide dataset into k folds (say 10)
- For each of the k folds
- Create training and test sets T & S
- Divide T into sets T1 and T2
- For each of the classifiers
- Use T2 to tune parameters on a model trained with T1
- Use these ‘good’ parameters to train a model with T
- Measure Accuracy on S
- Record 0-1 loss results for each classifier

- Assess significance of results (e.g. McNemar’s test).

(Salzberg, 1997)

- Which is better C1 or C2?
- Which is better C2 or C3?
- McNemar’s test captures this notion:
- n01 number misclassified by 1st but not 2nd classifier
- n10 number misclassified by 2nd but not 1st classifier

MNS score for C2 v’s C1 = 1/2

MNS score for C2 v’s C1 = 1/6

For test to be applicable (n01 + n10) > 10

>3.84 required for statistical significance at 95%

Dietterich’s 5x2cv paired t-test (Dietterich, 1998)

- 5 repetitions of 2-fold cross validation
- 2-fold no overlap in training data
- This gives 10 pairs of error estimates from which a t statistic can be derived
+ flexible on choice of loss function

- training sets comprise 50% of data
Demšar’s comparisons over multiple datasets (Demšar, 2006)

- Comparisons between classifiers done on multiple datasets
- a table of results
- Averaging across datasets is dodgy
- Demšar’s Test
- Wilcoxon Signed Ranks Test to compare a pair of classifiers
- Friedman’s Test to combine these scores for multiple classifiers
Counts of wins, losses and ties

This methodology could

become the standard

How you keep the score…

- Regression
- Quadratic Loss Function
- Minimize Mean Squared Error
- Big errors matter more

- Quadratic Loss Function
- Classification
- Misclassification Rate
- aka: 0-1 Loss Function

- Many alternatives are possible and appropriate in different circumstances, e.g. F measure.

- Misclassification Rate

Ranking Classifiers

- Many (binary) classifiers return a numeric score between 0 and 1.
- Classifier bias can be controlled by adjusting a threshold.
- For a given test set the ROC curve shows classifier performance over a range of thresholds/biases.

Salzberg, S., (1997) On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data Mining and Knowledge Discovery, 1, 317–327.

Dietterich, T.G., (1998) Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, 10:1895–1924.

Demšar, J., (2006) Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, 7(Jan):1--30.