Evaluation in Machine Learning

Evaluation in Machine Learning Pádraig Cunningham

Outline • Student’s t-test • Test for paired data • Cross Validation • McNemar’s Test • ROC Analysis • Other Statistical Tests for Evaluation

William Sealy Gosset The t-statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews. "Student" was his pen name. Gosset was a statistician for the Guinness brewery in Dublin, Ireland, and was hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes. Gosset published the t test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown not only to fellow statisticians but to his employer - the company insisted on the pseudonym so that it could turn a blind eye to the breach of its rules. Wikipedia

Student’s t-Test • Scores by two rugby teams: • Is B better than A?

What does the t-statistic mean? • For a given t-statistic you can look up the confidence • i.e. there is a 31.7% chance that this difference is due to chance (according to this test). 31.7% -0.485

Student’s t-Test • More data and/or clearer difference will give statistical significance

Student’s t-Test (paired) • Scores paired, i.e. against same team • With paired data statistical significance can be determined with less observations • We can say with 95% confidence that B are better than A

Student’s t-test: Formulae • Two samples, A and B • is the average in A, is the variance in A • Test for paired data, 1 Sample (D is difference in pairs)

Paired t-Test example • t-Test can be used for comparing errors in regression systems. • It can also be used for comparing classifiers if multiple test sets are available • Also with cross validation (more later) • = 5.2

U Train Test Evaluation in Machine Learning • Supervised Learning • Typical Question: • Which is better, Classifier A or Classifier B? • Evaluate Generalisation Accuracy • Hold back some training data to use for testing • Use performance on Test data as a proxy for performance on unseen data (i.e. Generalization). Training Data

100 160 Problems with ‘Hold-out’ Validation • Imagine 200 samples are available for training: • 50:50 split underestimates generalisation acc. • 80:20 estimate based on a small sample (40) • Different hold-out sets - different results Accuracy 200 # Samples

k-Fold Cross Validation • Having your cake and eating it too… • Divide data into k folds • For each fold in turn • Use that fold for testing and • Use the remainder of the data for training

Tuning is explicit Comparing Two Classifiers • Divide dataset into k folds (say 10) • For each of the k folds • Create training and test sets T & S • Divide T into sets T1 and T2 • For each of the classifiers • Use T2 to tune parameters on a model trained with T1 • Use these ‘good’ parameters to train a model with T • Measure Accuracy on S • Record 0-1 loss results for each classifier • Assess significance of results (e.g. McNemar’s test). (Salzberg, 1997)

McNemar’s test • Which is better C1 or C2? • Which is better C2 or C3? • McNemar’s test captures this notion: • n01 number misclassified by 1st but not 2nd classifier • n10 number misclassified by 2nd but not 1st classifier MNS score for C2 v’s C1 = 1/2 MNS score for C2 v’s C1 = 1/6 For test to be applicable (n01 + n10) > 10 >3.84 required for statistical significance at 95%

McNemar’s test example

Other Tests Dietterich’s 5x2cv paired t-test (Dietterich, 1998) • 5 repetitions of 2-fold cross validation • 2-fold  no overlap in training data • This gives 10 pairs of error estimates from which a t statistic can be derived + flexible on choice of loss function • training sets comprise 50% of data Demšar’s comparisons over multiple datasets (Demšar, 2006) • Comparisons between classifiers done on multiple datasets •  a table of results • Averaging across datasets is dodgy • Demšar’s Test • Wilcoxon Signed Ranks Test to compare a pair of classifiers • Friedman’s Test to combine these scores for multiple classifiers Counts of wins, losses and ties This methodology could become the standard

Loss Functions How you keep the score… • Regression • Quadratic Loss Function • Minimize Mean Squared Error • Big errors matter more • Classification • Misclassification Rate • aka: 0-1 Loss Function • Many alternatives are possible and appropriate in different circumstances, e.g. F measure.

Loss Functions: ROC Curves Ranking Classifiers • Many (binary) classifiers return a numeric score between 0 and 1. • Classifier bias can be controlled by adjusting a threshold. • For a given test set the ROC curve shows classifier performance over a range of thresholds/biases.

References Salzberg, S., (1997) On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data Mining and Knowledge Discovery, 1, 317–327. Dietterich, T.G., (1998) Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, 10:1895–1924. Demšar, J., (2006) Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, 7(Jan):1--30.

Evaluation in Machine Learning

Evaluation in Machine Learning

Presentation Transcript

Topics in Machine Learning

Machine Learning in Bioinformatics

Software Process Evaluation: A Machine Learning Approach

Performance Evaluation of Machine Learning Algorithms

Machine Learning

Machine Learning

MACHINE LEARNING

Machine Learning in DryadLINQ

Machine learning in IDS

Submodularity in Machine Learning

Machine Learning

Machine Learning

Machine Learning in realtime

Machine Learning in GATE

Recent Trends in MT Evaluation: Linguistic Information and Machine Learning

Experiments in Machine Learning

CSI5388: A Critique of our Evaluation Practices in Machine Learning

Machine Learning in Football

Machine learning Courses | Machine Learning Training

CS 391L: Machine Learning: Experimental Evaluation

Experiments in Machine Learning

Machine learning in IDS