Building useful models some new developments and easily avoidable errors
Download
1 / 90

Building useful models: Some new developments and easily avoidable errors - PowerPoint PPT Presentation


  • 189 Views
  • Uploaded on

Building useful models: Some new developments and easily avoidable errors. Michael Babyak, PhD. What is a model ?. Y = f(x1, x2, x3…xn). Y = a + b1x1 + b2x2…bnxn. Y = e a + b1x1 + b2x2…bnxn. “All models are wrong, some are useful” -- George Box. A useful model is Not very biased

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Building useful models: Some new developments and easily avoidable errors' - liam


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Building useful models some new developments and easily avoidable errors

Building useful models: Some new developments and easily avoidable errors

Michael Babyak, PhD


What is a model ? avoidable errors

Y = f(x1, x2, x3…xn)

Y = a + b1x1 + b2x2…bnxn

Y = e a + b1x1 + b2x2…bnxn


All models are wrong some are useful george box
“All models are wrong, some are useful” avoidable errors-- George Box

  • A useful model is

    • Not very biased

    • Interpretable

    • Replicable (predicts in a new sample)


Some premises
Some Premises avoidable errors

  • “Statistics” is a cumulative, evolving field

  • Newer is not necessarily better, but should be entertained in the context of the scientific question at hand

  • Data analytic practice resides along a continuum, from exploratory to confirmatory. Both are important, but the difference has to be recognized.

  • There’s no substitute for thinking about the problem



Concept of Simulation this stuff?

Y = b X + error

bs1

bs2

bsk-1

bsk

bs3

bs4

………………….


Concept of Simulation this stuff?

Y = b X + error

bs1

bs2

bsk-1

bsk

bs3

bs4

………………….

Evaluate


Simulation Example this stuff?

Y = .4 X + error

bs1

bs2

bsk-1

bsk

bs3

bs4

………………….


Simulation Example this stuff?

Y = .4 X + error

bs1

bs2

bsk-1

bsk

bs3

bs4

………………….

Evaluate


True Model: this stuff?Y = .4*x1 + e


Ingredients of a useful model
Ingredients of a Useful Model this stuff?

Correct probability model

Based on theory

Good measures/no loss of information

Useful Model

Comprehensive

Parsimonious

Tested fairly

Flexible


Correct Model this stuff?

  • Gaussian: General Linear Model

    • Multiple linear regression

  • Binary (or ordinal): Generalized Linear Model

    • Logistic Regression

    • Proportional Odds/Ordinal Logistic

  • Time to event:

    • Cox Regression or parametric survival models


Generalized Linear Model this stuff?

Normal

Binary/Binomial

Count, heavy skew,

Lots of zeros

Poisson, ZIP,

negbin, gamma

General Linear Model/

Linear Regression

Logistic Regression

ANOVA/t-test

ANCOVA

Chi-square

Regression w/

Transformed DV

Can be applied to clustered (e.g, repeated measures data)


Factor analytic family
Factor Analytic Family this stuff?

Structural Equation Models

Partial Least Squares

Latent Variable

Models

(Confirmatory Factor Analysis)

Multiple

regression

Common Factor

Analysis

Principal

Components


Use theory
Use Theory this stuff?

  • Theory and expert information are critical in helping sift out artifact

  • Numbers can look very systematic when the are in fact random

    • http://www.tufts.edu/~gdallal/multtest.htm


Measure well
Measure well this stuff?

  • Adequate range

  • Representative values

  • Watch for ceiling/floor effects


Using all the information
Using all the information this stuff?

  • Preserving cases in data sets with missing data

    • Conventional approaches:

      • Use only complete case

      • Fill in with mean or median

      • Use a missing data indicator in the model


Missing data
Missing Data this stuff?

  • Imputation or related approaches are almost ALWAYS better than deleting incomplete cases

  • Multiple Imputation

  • Full Information Maximum Likelihood


Multiple imputation
Multiple Imputation this stuff?


Modern missing data techniques
Modern Missing Data Techniques this stuff?

  • Preserve more information from original sample

  • Incorporate uncertainty about missingness into final estimates

  • Produce better estimates of population (true) values


Don’t throw waste information from variables this stuff?

  • Use all the information about the variables of interest

  • Don’t create “clinical cutpoints” before modeling

  • Model with ALL the data first, then use prediction to make decisions about cutpoints


Dichotomizing for convenience dubious practice c r a p
Dichotomizing for Convenience = Dubious Practice this stuff?(C.R.A.P.*)

  • Convoluted Reasoning and Anti-intellectual Pomposity

    • Streiner & Norman: Biostatistics: The Bare Essentials


Implausible measurement this stuff?

assumption

“not depressed”

“depressed”

A

B

C

Depression score


Loss of power this stuff?

http://psych.colorado.edu/~mcclella/MedianSplit/

Sometimes through sampling error

You can get a ‘lucky cut.’

http://www.bolderstats.com/jmsl/doc/medianSplit.html


Dichotomization, by definition, this stuff?

reduces the magnitude of the estimate by a minimum of about 30%

Dear Project Officer,

In order to facilitate analysis and interpretation, we have decided to throw away about 30% of our data. Even though this will waste about 3 or 4 hundred thousand dollars worth of subject recruitment and testing money, we are confident that you will understand.

Sincerely,

Dick O. Tomi, PhD

Prof. Richard Obediah Tomi, PhD


Power to detect non-zero b-weight when x is continuous versus dichotomized

True model: y =.4x + e


Dichotomizing will obscure non linearity
Dichotomizing will obscure non-linearity versus dichotomized

Low

High

CESD Score


Dichotomizing will obscure non-linearity: versus dichotomized

Same data as previous slide modeled continuously


Type I error rates for the relation between x2 and y after dichotomizing two continuous predictors.Maxwell and Delaney calculated the effect of dichotomizing two continuous predictors as a function of the correlation between them. The true model is y = .5x1 + 0x2, where all variables are continuous. If x1 and x2 are dichotomized, the error rate for the relation between x2 and y increases as the correlation between x1 and x2 increases.


Is it ever a good idea to categorize quantitatively measured variables
Is it ever a good idea to categorize quantitatively measured variables?

  • Yes:

    • when the variable is truly categorical

    • for descriptive/presentational purposes

    • for hypothesis testing, if enough categories are made.

      • However, using many categories can lead to problems of multiple significance tests and still run the risk of misclassification


Conclusions
CONCLUSIONS variables?

  • Cutting:

    • Doesn’t always make measurement sense

    • Almost always reduces power

    • Can fool you with too much power in some instances

    • Can completely miss important features of the underlying function

  • Modern computing/statistical packages can “handle” continuous variables

  • Want to make good clinical cutpoints? Model first, decide on cuts afterward.


Sample size and the problem of underfitting vs overfitting
Sample size and the problem of underfitting vs overfitting variables?

  • Model assumption is that “ALL” relevant variables be included—the “antiparsimony principle”

  • Tempered by fact that estimating too many unknowns with too little data will yield junk


Sample size requirements
Sample Size Requirements variables?

  • Linear regression

    • minimum of N = 50 + 8:predictor (Green, 1990)

  • Logistic Regression

    • Minimum of N = 10-15/predictor among smallest group (Peduzzi et al., 1990a)

  • Survival Analysis

    • Minimum of N = 10-15/predictor (Peduzzi et al., 1990b)


Consequences of inadequate sample size
Consequences of inadequate sample size variables?

  • Lack of power for individual tests

  • Unstable estimates

  • Spurious good fit—lots of unstable estimates will produce spurious ‘good-looking’ (big) regression coefficients


All-noise, but good fit variables?

R-squares from a population model of completelyrandom variables

Events per predictor ratio


Simulation: number of events/predictor ratio variables?

Y = .5*x1 + 0*x2 + .2*x3 + 0*x4

-- Where r x1 x4 = .4

-- N/p = 3, 5, 10, 20, 50



Peduzzi’s Simulation: number of events/predictor ratio variables?

P(survival) =a + b1*NYHA + b2*CHF + b3*VES

+b4*DM + b5*STD + b6*HTN + b7*LVC

--Events/p = 2, 5, 10, 15, 20, 25

--% relative bias =

(estimated b – true b/true b)*100




Approaches to variable selection
Approaches to variable selection variables?

  • “Stepwise” automated selection

  • Pre-screening using univariate tests

  • Combining or eliminating redundant predictors

  • Fixing some coefficients

  • Theory, expert opinion and experience

  • Penalization/Random effects

  • Propensity Scoring

    • “Matches” individuals on multiple dimensions to improve “baseline balance”

  • Tibshirani’s “Lasso”




Automated Selection: code for SAS.”

Derksen and Keselman (1992) Simulation Study

  • Studied backward and forward selection

  • Some authentic variables and some noise variables among candidate variables

  • Manipulated correlation among candidate predictors

  • Manipulated sample size


Automated Selection: code for SAS.”

Derksen and Keselman (1992) Simulation Study

  • “The degree of correlation between candidate predictors affected the frequency with which the authentic predictors found their way into the model.”

  • “The greater the number of candidate predictors, the greater the number of noise variables were included in the model.”

  • “Sample size was of little practical importance in determining the number of authentic variables contained in the final model.”


Simulation results: Number of noise variables included code for SAS.”

Sample Size

20 candidate predictors; 100 samples


Simulation results: R-square from noise variables code for SAS.”

Sample Size

20 candidate predictors; 100 samples


Simulation results: R-square from noise variables code for SAS.”

Sample Size

20 candidate predictors; 100 samples


SOME of the problems with code for SAS.”

stepwise variable selection.

1. It yields R-squared values that are badly biased high

2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution

3. The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson Stat in Med)

4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem

5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996).

6. It has severe problems in the presence of collinearity

7. It is based on methods (e.g. F tests for nested models) that were intended to be used to test pre-specified hypotheses.

8. Increasing the sample size doesn't help very much (see Derksen and Keselman)

9. It allows us to not think about the problem

10. It uses a lot of paper


author ={Chatfield, C.}, code for SAS.”  title =  {Model uncertainty, data mining and statistical inference (with discussion)},   journal = JRSSA,   year =     1995,   volume = 158,   pages =   {419-466},

  annote =              

--bias by selecting model because it fits the data well; bias in standard errors; P. 420: ... need for a better balance in the literature and in statistical teaching between techniques and problem solving strategies}.  P. 421: It is `well known' to be `logically unsound and practically misleading' (Zhang, 1992) to make inferences as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes.  However, although statisticians may admit this privately (Breiman (1992) calls it a `quiet scandal'), they (we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation errors for regression coefficients are usually smaller than errors from failing to take into account model specification. P. 422: Statisticians must stop pretending that model uncertainty does not exist and begin to find ways of coping with it.  P. 426: It is indeed strange that we often admit model uncertainty by searching for a best model but then ignore this uncertainty by making inferences and predictions as if certain that the best fitting model is actually true.  


Phantom Degrees of Freedom code for SAS.”

  • Faraway (1992)—showed that any pre-modeling strategy cost a df over and above df used later in modeling.

  • Premodeling strategies included: variable selection, outlier detection, linearity tests, residual analysis.

  • Thus, although not accounted for in final model, these phantom df will render the model too optimistic


Phantom Degrees of Freedom code for SAS.”

  • Therefore, if you transform, select, etc., you must include the DF in (i.e., penalize for) the “Final Model”


Conventional Univariate Pre-selection code for SAS.”

  • Non-significant tests also cost a DF

  • Non-significance is NOT necessarily related to importance

  • Variables may not behave the same way in a multivariable model—variable “not significant” at univariate test may be very important in the presence of other variables


Conventional Univariate Pre-selection code for SAS.”

  • Despite the convention, testing for confounding has not been systematically studied—in many cases leads to overadjustment and underestimate of true effect of variable of interest.

  • At the very least, pulling variables in and out of models inflates the model fit, often dramatically


Better approach code for SAS.”

  • Pick variables a priori

  • Stick with them

  • Penalize appropriately for any data-driven decision about how to model a variable


Spending DF wisely code for SAS.”

  • If not enough N/predictor, combine covariates using techniques that do not look at Y in the sample, PCA, FA, conceptual clustering, collapsing, scoring,established indexes.

  • Save DF for finer-grained look at variables of most interest, e.g, non-linear functions


Help is on the way
Help is on the way? code for SAS.”

  • Penalization/Random effects

  • Propensity Scoring

    • “Matches” individuals on multiple dimensions to improve “baseline balance”

  • Tibshirani’s Lasso



Validation code for SAS.”

  • Apparent fit

    • Usually too optimistic

  • Internal

    • cross-validation, bootstrap

    • honest estimate for model performance

    • provides an upper limit to what would be found on external validation

  • External validation

    • replication with new sample, different circumstances


Validation code for SAS.”

  • Steyerburg, et al. (1999) compared validation methods

  • Found that split-half was far too conservative

  • Bootstrap was equal or superior to all other techniques


Conclusions1
Conclusions code for SAS.”

  • Measure well

  • Use all the information

  • Recognize the limitations based on how much data you actually have

  • In the confirmatory mode, be as explicit as possible about the model a priori, test it, and live with it

  • By all means, explore data, but recognize— and state frankly --the limits post hoc analysis places on inference



Bootstrap code for SAS.”

My Sample

?1

?2

?3

?k-1

?k

?4

………………….

WITH REPLACEMENT

Evaluate


1, 3, 4, 5, 7, 10 code for SAS.”

7

1

1

4

5

10

10

3

2

2

2

1

3

5

1

4

2

7

2

1

1

7

2

7

4

4

1

4

2

10


Can use data to determine where to spend DF code for SAS.”

  • Use Spearman’s Rho to test “importance”

  • Not peeking because we have chosen to include the term in the model regardless of relation to Y

  • Use more DF for non-linearity


Example predict survival from age gender and fare on titanic example using s plus or r software
Example-Predict Survival from age, gender, and fare on Titanic:example using S-Plus (or R) software


If you have already decided to include them (and promise to keep them in the model) you can peek at predictors in order to see where to add complexity


Non linearity using splines
Non-linearity using splines keep them in the model) you can peek at predictors in order to see where to add complexity


Linear Spline keep them in the model) you can peek at predictors in order to see where to add complexity

(piecewise regression)

Y = a + b1(x<10) + b2(10<x<20) + b3 (x >20)


Cubic Spline keep them in the model) you can peek at predictors in order to see where to add complexity

(non-linear piecewise regression)

knots


Logistic regression model keep them in the model) you can peek at predictors in order to see where to add complexity

fitfare<-lrm(survived~(rcs(fare,3)+age+sex)^2,x=T,y=T)

anova(fitfare)

Spline with 3 knots


Wald Statistics Response: survived keep them in the model) you can peek at predictors in order to see where to add complexity

Factor Chi-Square d.f. P

fare (Factor+Higher Order Factors) 55.1 6 <.0001

All Interactions 13.8 4 0.0079

Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001

age (Factor+Higher Order Factors) 22.2 4 0.0002

All Interactions 16.7 3 0.0008

sex (Factor+Higher Order Factors) 208.7 4 <.0001

All Interactions 20.2 3 0.0002

fare * age (Factor+Higher Order Factors) 8.5 2 0.0142

Nonlinear 8.5 1 0.0036

Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036

fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401

Nonlinear 1.5 1 0.2153

Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153

age * sex (Factor+Higher Order Factors) 9.9 1 0.0016

TOTAL NONLINEAR 21.9 3 0.0001

TOTAL INTERACTION 24.9 5 0.0001

TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001

TOTAL 245.3 9 <.0001


Wald Statistics Response: survived keep them in the model) you can peek at predictors in order to see where to add complexity

Factor Chi-Square d.f. P

fare (Factor+Higher Order Factors) 55.1 6 <.0001

All Interactions 13.8 4 0.0079

Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001

age (Factor+Higher Order Factors) 22.2 4 0.0002

All Interactions 16.7 3 0.0008

sex (Factor+Higher Order Factors) 208.7 4 <.0001

All Interactions 20.2 3 0.0002

fare * age (Factor+Higher Order Factors) 8.5 2 0.0142

Nonlinear 8.5 1 0.0036

Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036

fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401

Nonlinear 1.5 1 0.2153

Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153

age * sex (Factor+Higher Order Factors) 9.9 1 0.0016

TOTAL NONLINEAR 21.9 3 0.0001

TOTAL INTERACTION 24.9 5 0.0001

TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001

TOTAL 245.3 9 <.0001


Wald Statistics Response: survived keep them in the model) you can peek at predictors in order to see where to add complexity

Factor Chi-Square d.f. P

fare (Factor+Higher Order Factors) 55.1 6 <.0001

All Interactions 13.8 4 0.0079

Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001

age (Factor+Higher Order Factors) 22.2 4 0.0002

All Interactions 16.7 3 0.0008

sex (Factor+Higher Order Factors) 208.7 4 <.0001

All Interactions 20.2 3 0.0002

fare * age (Factor+Higher Order Factors) 8.5 2 0.0142

Nonlinear 8.5 1 0.0036

Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036

fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401

Nonlinear 1.5 1 0.2153

Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153

age * sex (Factor+Higher Order Factors) 9.9 1 0.0016

TOTAL NONLINEAR 21.9 3 0.0001

TOTAL INTERACTION 24.9 5 0.0001

TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001

TOTAL 245.3 9 <.0001


Wald Statistics Response: survived keep them in the model) you can peek at predictors in order to see where to add complexity

Factor Chi-Square d.f. P

fare (Factor+Higher Order Factors) 55.1 6 <.0001

All Interactions 13.8 4 0.0079

Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001

age (Factor+Higher Order Factors) 22.2 4 0.0002

All Interactions 16.7 3 0.0008

sex (Factor+Higher Order Factors) 208.7 4 <.0001

All Interactions 20.2 3 0.0002

fare * age (Factor+Higher Order Factors) 8.5 2 0.0142

Nonlinear 8.5 1 0.0036

Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036

fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401

Nonlinear 1.5 1 0.2153

Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153

age * sex (Factor+Higher Order Factors) 9.9 1 0.0016

TOTAL NONLINEAR 21.9 3 0.0001

TOTAL INTERACTION 24.9 5 0.0001

TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001

TOTAL 245.3 9 <.0001


Wald Statistics Response: survived keep them in the model) you can peek at predictors in order to see where to add complexity

Factor Chi-Square d.f. P

fare (Factor+Higher Order Factors) 55.1 6 <.0001

All Interactions 13.8 4 0.0079

Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001

age (Factor+Higher Order Factors) 22.2 4 0.0002

All Interactions 16.7 3 0.0008

sex (Factor+Higher Order Factors) 208.7 4 <.0001

All Interactions 20.2 3 0.0002

fare * age (Factor+Higher Order Factors) 8.5 2 0.0142

Nonlinear 8.5 1 0.0036

Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036

fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401

Nonlinear 1.5 1 0.2153

Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153

age * sex (Factor+Higher Order Factors) 9.9 1 0.0016

TOTAL NONLINEAR 21.9 3 0.0001

TOTAL INTERACTION 24.9 5 0.0001

TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001

TOTAL 245.3 9 <.0001


Bootstrap Validation keep them in the model) you can peek at predictors in order to see where to add complexity


Summary keep them in the model) you can peek at predictors in order to see where to add complexity

  • Think about your model

  • Collect enough data


Summary keep them in the model) you can peek at predictors in order to see where to add complexity

  • Measure well

  • Don’t destroy what you’ve measured


Summary keep them in the model) you can peek at predictors in order to see where to add complexity

  • Pick your variables ahead of time and collect enough data to test the model you want

  • Keep all your variables in the model unless extremely unimportant


Summary keep them in the model) you can peek at predictors in order to see where to add complexity

  • Use more df on important variables, fewer df on “nuisance” variables

  • Don’t peek at Y to combine, discard, or transform variables


Summary keep them in the model) you can peek at predictors in order to see where to add complexity

  • Estimate validity and shrinkage with bootstrap


Summary keep them in the model) you can peek at predictors in order to see where to add complexity

  • By all means, tinker with the model later, but be aware of the costs of tinkering

  • Don’t forget to say you tinkered

  • Go collect more data


Web links for references software and more
Web links for references, software, and more keep them in the model) you can peek at predictors in order to see where to add complexity

  • Harrell’s regression modeling text

    • http://hesweb1.med.virginia.edu/biostat/rms/

  • SAS Macros for spline estimation

    • http://hesweb1.med.virginia.edu/biostat/SAS/survrisk.txt

  • Some results comparing validation methods

    • http://hesweb1.med.virginia.edu/biostat/reports/logistic.val.pdf

  • SAS code for bootstrap

    • ftp://ftp.sas.com/pub/neural/jackboot.sas

  • S-Plus home page

    • insightful.com

  • Mike Babyak’s e-mail

    • michael.babyak@duke.edu

  • This presentation

    • http://www.duke.edu/~mbabyak


  • www.duke.edu/~mababyak keep them in the model) you can peek at predictors in order to see where to add complexity

  • michael.babyak @ duke.edu

  • symptomresearch.nih.gov/chapter_8/


ad