Model assessment and selection
Download
1 / 44

Model Assessment and Selection - PowerPoint PPT Presentation


  • 158 Views
  • Uploaded on

Model Assessment and Selection. Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007. Goal. Model Selection Model Assessment. A Regression Problem. y = f(x) + noise Can we learn f from this data? Let ’ s consider three methods. Linear Regression. Quadratic Regression. Joining the dots.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Model Assessment and Selection' - keon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Model assessment and selection

Model Assessment and Selection

Lecture Notes for Comp540 Chapter7

Jian Li

Mar.2007


Goal

  • Model Selection

  • Model Assessment


A regression problem
A Regression Problem

  • y = f(x) + noise

  • Can we learn f from this data?

  • Let’s consider three methods...





Which is best
Which is best?

  • Why not choose the method with the best fit to the data?

“How well are you going to

predict future data drawn

from the same distribution?”


Model selection and assessment
Model Selection and Assessment

  • Model Selection: Estimating performances of different models to choose the best one (produces the minimum of the test error)

  • Model Assessment: Having chosen a model, estimating the prediction error on new data


Why errors
Why Errors

  • Why do we want to study errors?

  • In a data-rich situation split the data:

Train

Validation

Test

Model Selection

Model assessment

  • But, that’s not usually the case


Overall motivation
Overall Motivation

  • Errors

    • Measurement of errors (Loss functions)

    • Decomposing Test Error into Bias & Variance

  • Estimating the true error

    • Estimating in-sample error (analytically )

      AIC, BIC, MDL, SRM with VC

    • Estimating extra-sample error (efficientsample reuse)

      Cross Validation & Bootstrapping


Measuring errors loss functions
Measuring Errors: Loss Functions

  • Typical regression loss functions

    • Squared error:

    • Absolute error:


Measuring errors loss functions1
Measuring Errors: Loss Functions

  • Typical classification loss functions

    • 0-1 Loss:

    • Log-likelihood (cross-entropy loss / deviance):


The goal low test error
The Goal: Low Test Error

  • We want to minimize generalization error or test error:

  • But all we really know is training error:

?

  • And this is a bad estimate of test error


Bias variance complexity
Bias, Variance & Complexity

Training error can always be reduced when increasing model complexity, but risks over-fitting.

Typically


Decomposing test error

For squared-error loss & additive noise:

Decomposing Test Error

Model:

Deviation of the average estimate

from the true function’s mean

Irreducible error of target Y

Expected squared deviation of our

estimate around its mean


Further bias decomposition

Average Model

Bias

Average Estimation

Bias

Further Bias Decomposition

For standard linear regression,

Estimation Bias = 0

  • For linear models (eg. Ridge), bias can be further decomposed:

    * is the best fitting linear approximation


Graphical representation of bias variance

Closest fit

(given our observation)



Shrunken fit

Closest fit

In population

(if epsilon=0)

Graphical representation of bias & variance

Model Space

(basic linear regression)

Hypothesis Space

Realization

Model Fitting

Truth

Regularized Model Space

(ridge regression)

Model Bias

Estimation

Variance

Estimation

Bias


Bias variance decomposition examples

Averaging over the training set:

Linear weights on y:

Bias & Variance Decomposition Examples

  • kNN Regression

  • Linear Regression


Simulated example of bias variance decomposition
Simulated Example of Bias Variance Decomposition

Prediction error

-- + -- = --

-- + -- = --

Bias2

Regression with squared error loss

Variance

Bias-Variance different for

0-1 loss

than for

squared error loss

-- + -- <> --

-- + -- <> --

Classification with 0-1 loss

Estimation errors on the right side of the boundary don’t hurt!


Optimism of the training error rate
Optimism of The Training Error Rate

  • Typically: training error rate < true error

    (same data is being used to fit the method and assess its error)

<

overly optimistic


Estimating test error
Estimating Test Error

  • Can we estimate the discrepancy between err and Err?

extra-sample error

Expectation over N new

responses at each xi

Errin --- In-sample error:

Adjustment for optimism of training error


Optimism

For linear fit with d indep inputs/basis funcs:

optimism linearly with # d

Optimism as training sample size

Optimism

Summary:

for squared error, 0-1 and other loss functions:


Ways to estimate prediction error
Ways to Estimate Prediction Error

  • In-sample error estimates:

    • AIC

    • BIC

    • MDL

    • SRM

  • Extra-sample error estimates:

    • Cross-Validation

      • Leave-one-out

      • K-fold

    • Bootstrap


Estimates of in sample prediction error
Estimates of In-Sample Prediction Error

  • General form of the in-sample estimate:

  • For linear fit :


Aic bic
AIC & BIC

Similarly: Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)



Mdl minimum description length
MDL(Minimum Description Length)

  • Regularity ~ Compressibility

  • Learning ~ Finding regularities

Learning model

Input

Samples

Rn

Predictions

R1

Real class

R1

Real model

=?

error


Mdl minimum description length1
MDL(Minimum Description Length)

  • Regularity ~ Compressibility

  • Learning ~ Finding regularities

Description of the model

under optimal coding

Length of transmitting the discrepancy

given the model + optimal coding

under the given model

MDL principle: choose the model with the minimum description length

Equivalent to maximizing the posterior:


Srm with vc vapnik chernovenkis dimension

Vapnik showed that with probability 1-

SRM with VC (Vapnik-Chernovenkis) Dimension

As h increases

A method of selecting a class F from a family of nested classes


Err in estimation
Errin Estimation

  • A trade-off between the fit to the data and the model complexity


Estimation of extra sample err
Estimation of Extra-Sample Err

  • Cross Validation

  • Bootstrap


Cross validation
Cross-Validation

test

train

K-fold

……


How many folds
How many folds?

Computation increases

Variance decreases

bias decreases

k fold

Leave-one-out

k increases


Cross validation choosing k
Cross-Validation: Choosing K

Popular choices for K: 5,10,N


Generalized cross validation
Generalized Cross-Validation

  • LOOCV can be computational expensive for linear fitting with large N

  • Linear fitting

  • For linear fitting under squared-error loss:

  • GCV provides a computationally cheaper approximation


Bootstrap main concept
Bootstrap: Main Concept

“The bootstrap is a computer-based method of statistical inference that can answer many real statistical questions without formulas”

(An Introduction to the Bootstrap, Efron and Tibshirani, 1993)

Step 2: Calculate the

statistic

Step 1: Draw samples

with replacement


How is it coming

Sampling distribution of sample mean

The sample stands for the population

and the distribution of in many

resamples stands for the sampling

distribution

How is it coming

In practice cannot afford

large number of random samples

The theory tells us

the sampling distribution


Bootstrap error estimation with err boot
Bootstrap: Error Estimation with Errboot

Depends on the unknown true

distribution F

A straightforward application of bootstrap to error prediction


Bootstrap error estimation with err 1
Bootstrap: Error Estimation with Err(1)

A CV-inspired improvement on Errboot


Bootstrap error estimation with err 632
Bootstrap: Error Estimation with Err(.632)

An improvement on Err(1)in light-fitting cases

?


Bootstrap error estimation with err 6321
Bootstrap: Error Estimation with Err(.632+)

An improvement on Err(.632)by adaptively accounting for overfitting

  • Depending on the amount of overfitting, the best error estimate is as little as Err(.632) , or as much as Err(1),or something in between

  • Err(.632+)is like Err(.632)with adaptive weights, with Err(1) weighted at least .632

  • Err(.632+) adaptively mixes training error and leave-one-out error using the relative overfitting rate (R)


Bootstrap error estimation with err 6322
Bootstrap: Error Estimation with Err(.632+)


Cross validation bootstrap
Cross Validation & Bootstrap

  • Why bother with cross-validation and bootstrap when analytical estimates are known?

  • AIC, BIC, MDL, SRM all requires knowledge of d,

    which is difficult to attain in most situations.

2) Bootstrap and cross validation gives similar results

to above but also applicable in more complex situation.

3) Estimating the noise variance requires a roughly

working model, cross validation and bootstrap will work

well even if the model is far from correct.


Conclusion
Conclusion

  • Test error plays crucial roles in model selection

  • AIC, BIC and SRMVC have the advantage that you only need the training error

  • If VC-dimension is known, then SRM is a good method for model selection – requires much less computation than CV and bootstrap, but is wildly conservative

  • Methods like CV, Bootstrap give tighter error bounds, but might have more variance

  • Asymptotically AIC and Leave-one-out CV should be the same

  • Asymptotically BIC and a carefully chosen k-fold should be the same

  • BIC is what you want if you want the best structure instead of the best predictor

  • Bootstrap has much wider applicability than just estimating prediction error


ad