Learning Algorithm Evaluation

1 / 33

# Learning Algorithm Evaluation - PowerPoint PPT Presentation

Learning Algorithm Evaluation. Algorithm evaluation: Outline. Why? Overfitting How? Train/Test vs Cross-validation What? Evaluation measures Who wins? Statistical significance. Introduction. Introduction. A model should perform well on unseen data drawn from the same distribution.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Learning Algorithm Evaluation' - edan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Algorithm evaluation: Outline
• Why?
• Overfitting
• How?
• Train/Test vs Cross-validation
• What?
• Evaluation measures
• Who wins?
• Statistical significance
Introduction
• A model should perform well on unseen data drawn from the same distribution
Classification accuracy
• performance measure
• Success: instance’s class is predicted correctly
• Error: instance’s class is predicted incorrectly
• Error rate: #errors/#instances
• Accuracy: #successes/#instances
• Quiz
• 50 examples, 10 classified incorrectly
• Accuracy? Error rate?
Evaluation

Rule #1

Never evaluate on training data!

Train and Test

Step 1: Randomly split data into training and test set (e.g. 2/3-1/3)

a.k.a. holdout set

Train and Test

Step 2: Train model on training data

Train and Test

Step 3: Evaluate model on test data

Train and Test

Quiz: Can I retry with other parameter settings?

Evaluation

Rule #1

Never evaluate on training data!

Rule #2

Never train on test data!

(that includes parameter setting or feature selection)

Train and Test

Step 4: Optimize parameters on separate validation set

validation

testing

Test data leakage
• Never use test data to create the classifier
• Can be tricky: e.g. social network
• Proper procedure uses three sets
• training set: train models
• validation set: optimize algorithm parameters
• test set: evaluate final model
Making the most of the data
• Once evaluation is complete, all the data can be used to build the final classifier
• Trade-off: performance  evaluation accuracy
• More training data, better model (but returns diminish)
• More test data, more accurate error estimate
Train and Test

Step 5: Build final model on ALL data (more data, better model)

k-fold Cross-validation
• Split data (stratified) in k-folds
• Use (k-1) for training, 1 for testing
• Repeat k times
• Average results

Original Fold 1 Fold 2 Fold 3

train

test

Cross-validation
• Standard method:
• Stratified ten-fold cross-validation
• 10? Enough to reduce sampling bias
• Experimentally determined
Leave-One-Out Cross-validation
• A particular form of cross-validation:
• #folds = #instances
• n instances, build classifier n times
• Makes best use of the data, no sampling bias
• Computationally expensive

Original Fold 1 Fold 100

100

………

ROC Analysis
• Stands for “Receiver Operating Characteristic”
• From signal processing: tradeoff between hit rate and false alarm rate over noisy channel
• Compute FPR, TPR and plot them in ROC space
• Every classifier is a point in ROC space
• For probabilistic algorithms
• Collect many points by varying prediction threshold
• Or, make cost sensitive and vary costs (see below)
Confusion Matrix

actual

+

-

TP

FP

+

true positive

false positive

predicted

TN

FN

-

false negative

true negative

FP+TN

TP+FN

TPrate (sensitivity):

FPrate (fall-out):

ROC space

J48

parameters fitted

J48

OneR

classifiers

ROC curves

Change prediction threshold:

Threshold t: (P(+) > t)

Area Under Curve (AUC)

=0.75

ROC curves
• Alternative method (easier, but less intuitive)
• Rank probabilities
• Start curve in (0,0), move down probability list
• If positive, move up. If negative, move right
• Jagged curve—one set of test data
• Smooth curve—use cross-validation
ROC curvesMethod selection
• Overall: use method with largest Area Under ROC curve (AUROC)
• If you aim to cover just 40% of true positives in a sample: use method A
• Large sample: use method B
• In between: choose between A and B with appropriate probabilities

equal

costs

skewed

costs

ROC Space and Costs
Different Costs
• In practice, TP and FN errors incur different costs
• Examples:
• Medical diagnostic tests: does X have leukemia?
• Loan decisions: approve mortgage for X?
• Promotional mailing: will X buy the product?
• Add cost matrix to evaluation that weighs TP,FP,...
Comparing data mining schemes
• Which of two learning algorithms performs better?
• Note: this is domain dependent!
• Obvious way: compare 10-fold CV estimates
• Problem: variance in estimate
• Variance can be reduced using repeated CV
• However, we still don’t know whether results are reliable
Significance tests
• Significance tests tell us how confident we can be that there really is a difference
• Null hypothesis: there is no “real” difference
• Alternative hypothesis: there is a difference
• A significance test measures how much evidence there is in favor of rejecting the null hypothesis
• E.g. 10 cross-validation scores: B better than A?

mean A

mean B

P(perf)

Algorithm A

Algorithm B

perf

x x x xxxxx x x

x x x xxxx x x x

32

Paired t-test

P(perf)

Algorithm A

Algorithm B

• Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different
• Use a pairedt-test when individual samples are paired
• i.e., they use the same randomization
• Same CV folds are used for both algorithms

perf

William Gosset

Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England

Worked as chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".

Performing the test

P(perf)

Algoritme A

Algoritme B

• Fix a significance level 
• Significant difference at % level implies (100-)% chance that there really is a difference
• Scientific work: 5% or smaller (>95% certainty)
• Divide by two (two-tailed test)
• Look up the z-value corresponding to /2:
• If t –zor t z: difference is significant
• null hypothesis can be rejected

perf