learning algorithm evaluation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning Algorithm Evaluation PowerPoint Presentation
Download Presentation
Learning Algorithm Evaluation

Loading in 2 Seconds...

play fullscreen
1 / 33

Learning Algorithm Evaluation - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Learning Algorithm Evaluation. Algorithm evaluation: Outline. Why? Overfitting How? Train/Test vs Cross-validation What? Evaluation measures Who wins? Statistical significance. Introduction. Introduction. A model should perform well on unseen data drawn from the same distribution.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning Algorithm Evaluation' - edan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
algorithm evaluation outline
Algorithm evaluation: Outline
  • Why?
    • Overfitting
  • How?
    • Train/Test vs Cross-validation
  • What?
    • Evaluation measures
  • Who wins?
    • Statistical significance
introduction1
Introduction
  • A model should perform well on unseen data drawn from the same distribution
classification accuracy
Classification accuracy
  • performance measure
    • Success: instance’s class is predicted correctly
    • Error: instance’s class is predicted incorrectly
    • Error rate: #errors/#instances
    • Accuracy: #successes/#instances
  • Quiz
    • 50 examples, 10 classified incorrectly
      • Accuracy? Error rate?
evaluation
Evaluation

Rule #1

Never evaluate on training data!

train and test
Train and Test

Step 1: Randomly split data into training and test set (e.g. 2/3-1/3)

a.k.a. holdout set

train and test1
Train and Test

Step 2: Train model on training data

train and test2
Train and Test

Step 3: Evaluate model on test data

train and test3
Train and Test

Quiz: Can I retry with other parameter settings?

evaluation1
Evaluation

Rule #1

Never evaluate on training data!

Rule #2

Never train on test data!

(that includes parameter setting or feature selection)

train and test4
Train and Test

Step 4: Optimize parameters on separate validation set

validation

testing

test data leakage
Test data leakage
  • Never use test data to create the classifier
    • Can be tricky: e.g. social network
  • Proper procedure uses three sets
    • training set: train models
    • validation set: optimize algorithm parameters
    • test set: evaluate final model
making the most of the data
Making the most of the data
  • Once evaluation is complete, all the data can be used to build the final classifier
  • Trade-off: performance  evaluation accuracy
    • More training data, better model (but returns diminish)
    • More test data, more accurate error estimate
train and test5
Train and Test

Step 5: Build final model on ALL data (more data, better model)

k fold cross validation
k-fold Cross-validation
  • Split data (stratified) in k-folds
  • Use (k-1) for training, 1 for testing
  • Repeat k times
  • Average results

Original Fold 1 Fold 2 Fold 3

train

test

cross validation1
Cross-validation
  • Standard method:
    • Stratified ten-fold cross-validation
  • 10? Enough to reduce sampling bias
    • Experimentally determined
leave one out cross validation
Leave-One-Out Cross-validation
  • A particular form of cross-validation:
    • #folds = #instances
    • n instances, build classifier n times
  • Makes best use of the data, no sampling bias
  • Computationally expensive

Original Fold 1 Fold 100

100

………

roc analysis1
ROC Analysis
  • Stands for “Receiver Operating Characteristic”
  • From signal processing: tradeoff between hit rate and false alarm rate over noisy channel
  • Compute FPR, TPR and plot them in ROC space
  • Every classifier is a point in ROC space
  • For probabilistic algorithms
    • Collect many points by varying prediction threshold
    • Or, make cost sensitive and vary costs (see below)
confusion matrix
Confusion Matrix

actual

+

-

TP

FP

+

true positive

false positive

predicted

TN

FN

-

false negative

true negative

FP+TN

TP+FN

TPrate (sensitivity):

FPrate (fall-out):

roc space
ROC space

J48

parameters fitted

J48

OneR

classifiers

roc curves
ROC curves

Change prediction threshold:

Threshold t: (P(+) > t)

Area Under Curve (AUC)

=0.75

roc curves1
ROC curves
  • Alternative method (easier, but less intuitive)
  • Rank probabilities
  • Start curve in (0,0), move down probability list
  • If positive, move up. If negative, move right
  • Jagged curve—one set of test data
  • Smooth curve—use cross-validation
roc curves method selection
ROC curvesMethod selection
  • Overall: use method with largest Area Under ROC curve (AUROC)
  • If you aim to cover just 40% of true positives in a sample: use method A
  • Large sample: use method B
  • In between: choose between A and B with appropriate probabilities
roc space and costs

equal

costs

skewed

costs

ROC Space and Costs
different costs
Different Costs
  • In practice, TP and FN errors incur different costs
  • Examples:
    • Medical diagnostic tests: does X have leukemia?
    • Loan decisions: approve mortgage for X?
    • Promotional mailing: will X buy the product?
  • Add cost matrix to evaluation that weighs TP,FP,...
comparing data mining schemes
Comparing data mining schemes
  • Which of two learning algorithms performs better?
    • Note: this is domain dependent!
  • Obvious way: compare 10-fold CV estimates
  • Problem: variance in estimate
    • Variance can be reduced using repeated CV
    • However, we still don’t know whether results are reliable
significance tests
Significance tests
  • Significance tests tell us how confident we can be that there really is a difference
    • Null hypothesis: there is no “real” difference
    • Alternative hypothesis: there is a difference
  • A significance test measures how much evidence there is in favor of rejecting the null hypothesis
  • E.g. 10 cross-validation scores: B better than A?

mean A

mean B

P(perf)

Algorithm A

Algorithm B

perf

x x x xxxxx x x

x x x xxxx x x x

paired t test

32

Paired t-test

P(perf)

Algorithm A

Algorithm B

  • Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different
  • Use a pairedt-test when individual samples are paired
    • i.e., they use the same randomization
    • Same CV folds are used for both algorithms

perf

William Gosset

Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England

Worked as chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".

performing the test
Performing the test

P(perf)

Algoritme A

Algoritme B

  • Fix a significance level 
    • Significant difference at % level implies (100-)% chance that there really is a difference
    • Scientific work: 5% or smaller (>95% certainty)
  • Divide by two (two-tailed test)
  • Look up the z-value corresponding to /2:
  • If t –zor t z: difference is significant
    • null hypothesis can be rejected

perf