Agenda

Prelude of Machine Learning 202Statistical Data Analysis in the Computer Age (1991)Bradely Efron and Robert Tibshirani

Agenda • Overview • Bootstrap • Nonparametric Regression • Generalized Additive Models • Classification and Regression Trees • Conclusion

Overview • Classical statistical methods from 1920-1950: • Linear regression, hypothesis testing, standard errors, confidence intervals, etc. • New statistical methods Post 1980: • Based on the power of electronic computation • Require fewer distributional assumptions than their predecessors • How to spend computational wealth wisely?

Bootstrap • Random sample from 164 data points • t(x) = 28.58 • How accurate is t(x)? • A device for extending SE to estimators other than the mean • Suppose t(x) is 25% trimmed mean

Bootstrap • Why use a trimmed mean rather than mean(x)? • If data is from a long-tailed probability distribution, then the trimmed mean can be substantially more accurate than mean(x) • In practice, one does not know a priori if the true probability distribution is long-tailed. The bootstrap can help answer this question.

Nonparametric Regression • Quadratic regression curve at 60% compliance • 27.72 +/- 3.08

Nonparametric Regression • i.e. • Windowing with nearlest 20% data points • Smooth weight function • Weighted linear regression • Nonparametric Regression with loess at 60% compliance • 32.38 +/- ? • How to find SE?

Nonparametric Regression • How to find SE? • Bootstrap • 32.38 +/- 5.71 with B=50 • At 60% compliance • QR: 27.72 +/- 3.08 • NPR: 32.38 +/- 5.71 • On balance, the quadratic estimate should probably be preferred in this case. • It would have to have an unusually large bias to undo its superiority in SE.

Generalized Additive Models • Generalized Linear model: • Generalizes linear regression • Linear model related to response variable using a link function Y = g(b0 + b1*X1 + ... + bm*Xm) • Additive Model: • Non parametric regression method • Estimate a non parametric function for each predictor • Combine all predictor functions to predict the dependent variable • Generalized Additive Model (GAM) : • Blends properties of Additive models with generalized linear model (GLM) • Each predictor function fi(xi) is fit using parametric or non parametric means • Provides good fits to training data at the expense of interpretability

GAM Case Study • Analyze survival of infants after cardiac surgery for heart defects • Dataset: 497 infant records • Explanatory variables: • Age (Days) • Weight (Kg) • Whether Warm-blood cardiopelgia (WBC) was applied • WBC support data: • Of 57 infants who received WBC procedure, 7 died • Of 440 infants who received standard procedure, 133 died

GAM Case Study: Logistic regression results • Three parameter regression model • Age, Weight: continuous variables • WBC applied: binary variable • Results: • WBC has strong beneficial effect: odds ratio of 3.8:1 • Higher weight => Lower risk of death • Age has no significant effect

GAM Case Study: GAM Analysis • Add three individual smooth functions • Use locally weighted scatter plot smoothing (Loess) method • Results: • WBC has strong beneficial effect: odds ratio of 4.2:1 • Lighter infants have 55 times more likely to die than heavier infants • Surprising findings from log odds curve for age !

GAM Case Study: Conclusion • Traditional regression models may lead to oversimplification • Linear logistic regression forces curves to be straight lines • Vital information regarding effect of age lost in a linear model • More acute problem with large number of explanatory variables • GAM analysis exploits computational power to achieve new level of analysis flexibility • A Personal computer can do what required a Mainframe 10 years ago

Classification and Regression Tree • A non parametric technique • An ideal analysis method to apply computer algorithms • Splits based upon how well the splits can explain variability • Once a node is split, the procedure is applied to each “split” recursively

CART Case study • Gain insight into causes of duodenal ulcers • Use sample of 745 rats • 1 out of 56 different alkyl nucleophiles administered to each rat • Response: One of three severity levels (1,2,3), 3 being the highest severity • Skewed misclassification costs • Severe ulcer misclassification is more expensive than mild ulcer misclassification • Analysis tree construction: • Use 745 observations as the training data • Compute ‘apparent’ misclassification rates • Training data misclassification rate has downward bias

CART Case study • Classification tree

CART Case study: Observations • Optimal size of classification tree is a tradeoff • Higher training errors versus overfitting • It is usually better to construct large tree and prune from bottom • How to chose optimal size classification tree ? • Use test data on different tree models to understand misclassification rate in each tree • In the absence of test data, use cross validation approach

CART: Cross validation • Mimic the use of test sample • Standard cross validation approach: • Divide dataset into 10 equal partitions • Use 90% of data as training set and the remaining 10% as test data • Repeat with all different combinations of the training and test data • Cross validation misclassification errors found to be 10% higher than the original • Cross validation and bootstrapping are closely related • Research on hybrid approaches in progress

Conclusion • Computers have enabled a new generation of statistical methods and tools • Replace traditional mathematical ways with computer algorithms. • Freedom from bell-shaped curve assumptions of the traditional approach • Modern Statisticians need to understand: • Mathematical tractability is not required for computer based methods • Which computer based methods to use • When to use each method

Agenda

Agenda

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA