1 / 29

Model assessment and cross-validation - overview

Model assessment and cross-validation - overview. Bias-variance decomposition and tradeoff Analytical validation methods (AIC, BIC) Resampling methods (cross-validation, bootstrap methods). Bias, variance and model complexity. Bias, variance decomposition, and tradeoff.

miette
Download Presentation

Model assessment and cross-validation - overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model assessment and cross-validation- overview • Bias-variance decomposition and tradeoff • Analytical validation methods (AIC, BIC) • Resampling methods (cross-validation, bootstrap methods) Data Mining and statistical learning 2008

  2. Bias, variance and model complexity Data Mining and statistical learning 2008

  3. Bias, variance decomposition, and tradeoff Expected prediction error INSERT formula 7.8 We would like to minimize Err! Increasing the complexity of models increases variance and decreases bias Example: Smoothing based on nearest neighbours. Data Mining and statistical learning 2008

  4. Loss function, training error, and test error- quantitative response (regression) • Loss functions (examples) • Training error: • Test error: where the expectation is taken over the joint distribution of (X, Y) Data Mining and statistical learning 2008

  5. Loss function, training error, and test error- qualitative response (classification) Model: G=[1, 2, …, K], pk(X)=Pr(G=k| X). Decision rule: Examples of loss functions: • 0-1 Loss INSERT form. 7.4 • Cross-entropy loss (log-likelihood) INSERT form. 7.5 Data Mining and statistical learning 2008

  6. Training error, and test error- qualitative response (classification) • Training error • 0-1-loss: misclassification rate • Cross-entropy loss: • Test error Data Mining and statistical learning 2008

  7. Model selection Assume that the given model fα(x) is dependent on some tuning parameter (model complexity parameter) α Examples of α: • No. predictors (multiple regression) • Degree of a polynomial (polynomial regression) • Penalty factor (smoothing splines, ridge regression) • Window width (kernels) The aim of model selection: To find a model having minimum test error Data Mining and statistical learning 2008

  8. Model selection and assessment Model selection Estimate the performance of different models in order to choose the best one. Model assessment Having chosen the final model, estimate the test error Data Mining and statistical learning 2008

  9. Model selection and assessmentin data-rich applications • Training set (to produce a fit, appr. 50%) • Validation set (for model selection, 25 %) • Test set (for model assessment, 25%) Example (splines) • Fit the training set using models with smoothing factors λ1,…, λn • Using fitted splines f1*(x),…fn*(x), estimate the prediction error using the validation set and choose the model #i producing minimal error. • Estimate the generalization error using the test set and model #i Data Mining and statistical learning 2008

  10. Model selection and assessment in applicationswith insufficient data • Analytical expressions to select and assess models • Cp(correction for the number of inputs or basis functions) • AIC (Akaike’s information criterion) • BIC (Bayesian information criterion) • Resampling • Cross-validation • The bootstrap Data Mining and statistical learning 2008

  11. Analytical validation methods Background: A model typically overfits the data. The prediction error will on average be higher than the training error Terminology: The difference between the average training and prediction error is called optimism Basic idea: Find an analytical expression for the optimism. Data Mining and statistical learning 2008

  12. Optimism • Training error rate: • In-sample error: • Optimism: Data Mining and statistical learning 2008

  13. Analytical expressions for the optimism • For squared error, 0-1 loss, and some other loss functions 2. For linear models with d inputs or basis functions Data Mining and statistical learning 2008

  14. Cp scores Basic idea: when d parameters are fitted under squared loss. Compute and choose the model with smallest Cp score Properties: • Penalization inreases with increasing model complexity (d) and decreases as the training sample size increases Data Mining and statistical learning 2008

  15. Akaike’s information criterion When the likelihood is maximized it holds asymptotically that where Given a tuning parameter , we set where d() is the effective number of parameters. For Gaussian models, AIC is equivalent to Cp Data Mining and statistical learning 2008

  16. Effective number of parameters (d()) For linear smoothers: Examples: • Simple linear regression (#exact param), ridge regression • Smoothing splines • Kernel smoothers Define the effective number of parameters as Data Mining and statistical learning 2008

  17. Bayesian information criterion Based on Bayesian theory we set • For Gaussian models Properties: • BIC =AIC if ”2” is substituted for log(N) • Since log(N)>1 for N>7, BIC penalizes complex models more heavily than AIC Data Mining and statistical learning 2008

  18. Features of AIC and BIC For large models (assymptotical property) • BIC chooses the right model (if it is present among alternatives) • AIC chooses too complex models For small models • BIC chooses too simple models • AIC is OK Data Mining and statistical learning 2008

  19. Resampling methods Cross-validation K-fold cross-validation (rough scheme, show picture): • Divide data-set in K roughly equally-sized subsets • Remove subset #i and fit the model using remaining data. • Predict the function values for subset #i using the fitted model. • Repeat steps 2-3 for different i • CV= squared difference between observed values and predicted values (another function is possible) Data Mining and statistical learning 2008

  20. Resampling methods Cross-validation Note: if K=N then method is leave-one-out cross-validation. K-fold cross-validation: Data Mining and statistical learning 2008

  21. Model selection using cross-validation Having a model depending on a tuning (complexity) parameter, choose the one with smallest CV: Data Mining and statistical learning 2008

  22. Prediction error and cross-validation curveestimated from a single training set Data Mining and statistical learning 2008

  23. Generalized cross-validation Basic idea: approximate the leave-one-out CV to make it faster Used for linear smoothers: Note: In smoothing problems, GCV is similiar to AIC Data Mining and statistical learning 2008

  24. The bootstrap Why do we need the bootstrap? • To estimate the uncertainty of parameter estimates Example: • Having a sample X1, …, Xn from an unknown distribution we estimate its mean (expectation) by computing the sample mean • How to find uncertainty of the sample mean? Data Mining and statistical learning 2008

  25. Implementation of the crude nonparametric bootstrap • Sample with replacement form the observed data to obtain a bootstrap sample • Repeat step 1 B times • Compute parameter estimates using the bootstrap samples • Compute the variance of estimates from step 3 Data Mining and statistical learning 2008

  26. Bootstrap replicates Data Mining and statistical learning 2008

  27. Benefits and drawbacks of the nonparametric bootstrap • The uncertainty of an estimate can be obtained without any information about the underlying distribution • It is not applicable to small data sets • If the distribution is known (except for the level of some parameters) the bootstarp may be slightly less efficient than conventional parametric methods Data Mining and statistical learning 2008

  28. The bootstrap and estimation of prediction errors • Fit the model to bootstrap samples (role=training) and examine how well it predicts the training set (role=prediction) Data Mining and statistical learning 2008

  29. An improved bootstrap estimator of the prediction error The leave-one-out bootstrap is similar to two-fold CV, and produces biased estimates of the expected squared prediction error. Solution: .632-estimator(pulls bias down) Data Mining and statistical learning 2008

More Related