missing data analysis and design l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Missing Data: Analysis and Design PowerPoint Presentation
Download Presentation
Missing Data: Analysis and Design

Loading in 2 Seconds...

play fullscreen
1 / 110

Missing Data: Analysis and Design - PowerPoint PPT Presentation


  • 794 Views
  • Uploaded on

Missing Data: Analysis and Design. John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University. Presentation in Four Parts. (1) Introduction: Missing Data Theory (2) A brief analysis demonstration Multiple Imputation with NORM and Proc MI

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Missing Data: Analysis and Design' - brenna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
missing data analysis and design

Missing Data: Analysis and Design

John W. Graham

The Prevention Research Center

and

Department of Biobehavioral Health

Penn State University

presentation in four parts
Presentation in Four Parts
  • (1) Introduction: Missing Data Theory
  • (2) A brief analysis demonstration
    • Multiple Imputation with
      • NORM and Proc MI
    • Amos...break...
  • (3) Attrition Issues
  • (4) Planned missingness designs:
    • 3-form Design
recent papers
Recent Papers
  • Graham, J. W., Cumsille, P. E.,& Elek-Fisk,E. (2003).Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.). Research Methods in Psychology (pp. 87_114). Volume 2 of Handbook of Psychology (I. B. Weiner, Editor-in-Chief). New York: John Wiley & Sons.
  • Collins, L. M., Schafer, J. L.,& Kam, C. M.(2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.
  • Schafer, J. L.,& Graham,J. W.(2002).Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.

jgraham@psu.edu

problem with missing data
Problem with Missing Data
  • Analysis procedures were designed for complete data. . .
solution 1
Solution 1
  • Design new model-based procedures
  • Missing Data + Parameter Estimation in One Step
  • Full Information Maximum Likelihood (FIML)SEM and Other Latent Variable Programs(Amos, Mx, LISREL, Mplus, LTA)
solution 2
Solution 2
  • Data based procedures
    • e.g., Multiple Imputation (MI)
  • Two Steps
    • Step 1: Deal with the missing data
      • (e.g., replace missing values with plausible values
      • Produce a product
    • Step 2: Analyze the product as if there were no missing data
slide8
FAQ
  • Aren't you somehow helping yourself with imputation?. . .
no missing data imputation
NO. Missing data imputation . . .
  • does NOT give you something for nothing
  • DOES let you make use of all data you have

. . .

slide10
FAQ
  • Is the imputed value what the person would have given?
no when we impute a value
NO. When we impute a value . .
  • We do not impute for the sake of the value itself
  • We impute to preserve important characteristics of the whole data set

. . .

we want
We want . . .
  • unbiased parameter estimation
    • e.g., b-weights
  • Good estimate of variability
    • e.g., standard errors
  • best statistical power
causes of missingness
Causes of Missingness
  • Ignorable
    • MCAR: Missing Completely At Random
    • MAR: Missing At Random
  • Non-Ignorable
    • MNAR: Missing Not At Random
mcar missing completely at random
MCAR(Missing Completely At Random)
  • MCAR 1: Cause of missingness completely random process (like coin flip)
  • MCAR 2:
    • Cause uncorrelated with variables of interest
    • Example: parents move
  • No bias if cause omitted
mar missing at random
MAR (Missing At Random)
  • Missingness may be related to measured variables
  • But no residual relationship with unmeasured variables
    • Example: reading speed
  • No bias if you control for measured variables
mnar missing not at random
MNAR (Missing Not At Random)
  • Even after controlling for measured variables ...
  • Residual relationship with unmeasured variables
  • Example: drug use reason for absence
mnar causes
MNAR Causes
  • The recommended methods assume missingness is MAR
  • But what if the cause of missingness is not MAR?
  • Should these methods be used when MAR assumptions not met?

. . .

yes these methods work
YES! These Methods Work!
  • Suggested methods work better than “old” methods
  • Multiple causes of missingness
    • Only small part of missingness may be MNAR
  • Suggested methods usually work very well
revisit question what if the cause of missingness is mnar
Revisit Question: What if THE Cause of Missingness is MNAR?
  • Example model of interest: X  Y

X = Program (prog vs control)

Y = Cigarette Smoking

Z = Cause of missingness: say, Rebelliousness (or smoking itself)

  • Factors to be considered:
    • % Missing (e.g., % attrition)
    • rYZ .
    • rZ,Ymis .
slide20
rYZ
  • Correlation between
    • cause of missingness (Z)
      • e.g., rebelliousness (or smoking itself)
    • and the variable of interest (Y)
      • e.g., Cigarette Smoking
r z ymis
rZ,Ymis
  • Correlation between
    • cause of missingness (Z)
      • e.g., rebelliousness (or smoking itself)
    • and missingness on variable of interest
      • e.g., Missingness on the Smoking variable
  • Missingness on Smoking (Ymis)
    • Dichotomous variable:

Ymis = 1: Smoking variable not missing

Ymis = 0: Smoking variable missing

how could the cause of missingness be purely mnar
How Could the Cause of Missingness be Purely MNAR?
  • rZ,Y = 1.0 AND rZ,Ymis = 1.0
  • We can get rZ,Y = 1.0 if smoking is the cause of missingness on the smoking variable
how could the cause of missingness be purely mnar23
How Could the Cause of Missingness be Purely MNAR?
  • We can get rZ,Ymis = 1.0 like this:
    • If person is a smoker, smoking variable is always missing
    • If person is not a smoker, smoking variable is never missing
  • But is this plausible? ever?
what if the cause of missingness is mnar
What if the cause of missingness is MNAR?

Problems with this statement

  • MAR & MNAR are widely misunderstood concepts
  • I argue that the cause of missingness is never purely MNAR
  • The cause of missingness is virtually never purely MAR either.
mar vs mnar
MAR vs MNAR:
  • MAR and MNAR form a continuum
  • Pure MAR and pure MNAR are just theoretical concepts
    • Neither occurs in the real world
  • MAR vs MNAR NOT dimension of interest
mar vs mnar what is the dimension of interest
MAR vs MNAR: What IS the Dimension of Interest?
  • Question of Interest:How much estimation bias?
    • when cause of missingness cannot be included in the model
bottom line
Bottom Line ...
  • All missing data situations are partly MAR and partly MNAR
  • Sometimes it matters ...
    • bias affects statistical conclusions
  • Often it does not matter
    • bias has minimal effects on statistical conclusions

(Collins, Schafer, & Kam, Psych Methods, 2001)

methods old vs mar vs mnar
Methods:"Old" vs MAR vs MNAR
  • MAR methods (MI and ML)
    • are ALWAYS at least as good as,
    • usually better than "old" methods (e.g., listwise deletion)
  • Methods designed to handle MNAR missingness are NOT always better than MAR methods
references
References
  • Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119-128.
  • Graham, J. W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P., & Schafer, J.L. (1997). Analysis with missing data in prevention research. In K. Bryant, M. Windle, & S. West (Eds.), The science of prevention: methodological advances from alcohol and substance abuse research. (pp. 325-366). Washington, D.C.: American Psychological Association.
  • Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351.
old procedures analyze complete cases listwise deletion
Old Procedures: Analyze Complete Cases(listwise deletion)
  • may produce bias
  • you always lose some power
    • (because you are throwing away data)
  • reasonable if you lose only 5% of cases
  • often lose substantial power
analyze complete cases listwise deletion
Analyze Complete Cases(listwise deletion)
    • 1 1 1 1
    • 0 1 1 1
    • 1 0 1 1
    • 1 1 0 1
    • 1 1 1 0
  • very common situation
  • only 20% (4 of 20) data points missing
  • but discard 80% of the cases
other old procedures
Other "Old" Procedures
  • Pairwise deletion
    • May be of occasional use for preliminary analyses
  • Mean substitution
    • Never use it
  • Regression-based single imputation
    • generally not recommended ... except ...
recommended model based procedures
Recommended Model-Based Procedures
  • Multiple Group SEM (Structural Equation Modeling)
  • LatentTransitionAnalysis (Collins et al.)
    • A latent class procedure
recommended model based procedures35
Recommended Model-Based Procedures
  • Raw Data Maximum Likelihood SEMaka Full Information Maximum Likelihood (FIML)
    • Amos (James Arbuckle)
    • LISREL 8.5+ (Jöreskog & Sörbom)
    • Mplus (Bengt Muthén)
    • Mx (Michael Neale)
amos 7 mx mplus lisrel 8 8
Amos 7, Mx, Mplus, LISREL 8.8
  • Structural Equation Modeling (SEM) Programs
  • In Single Analysis ...
    • Good Estimation
    • Reasonable standard errors
    • Windows Graphical Interface
limitation with model based procedures
Limitation with Model-Based Procedures
  • That particular model must be what you want
recommended data based procedures
Recommended Data-Based Procedures

EM Algorithm (ML parameter estimation)

  • Norm-Cat-Mix, EMcov, SAS, SPSS

Multiple Imputation

  • NORM, Cat, Mix, Pan (Joe Schafer)
  • SAS Proc MI
  • LISREL 8.5+
em algorithm
EM Algorithm
  • Expectation - Maximization

Alternate between

E-step: predict missing data

M-step: estimate parameters

  • Excellent parameter estimates
  • But no standard errors
    • must use bootstrap
    • or multiple imputation
multiple imputation
Multiple Imputation
  • Problem with Single Imputation:Too Little Variability
    • Because of Error Variance
    • Because covariance matrix is only one estimate
too little error variance
Too Little Error Variance
  • Imputed value lies on regression line
restore error
Restore Error . . .
  • Add random normal residual
covariance matrix regression line only one estimate
Covariance Matrix (Regression Line) only One Estimate
  • Obtain multiple plausible estimates of the covariance matrix
  • ideally draw multiple covariance matrices from population
  • Approximate this with
    • Bootstrap
    • Data Augmentation (Norm)
    • MCMC (SAS 8.2, 9)
data augmentation
Data Augmentation
  • stochastic version of EM
  • EM
    • E (expectation) step: predict missing data
    • M (maximization) step: estimate parameters
  • Data Augmentation
    • I (imputation) step: simulate missing data
    • P (posterior) step: simulate parameters
data augmentation47
Data Augmentation
  • Parameters from consecutive steps ...
    • too related
    • i.e., not enough variability
  • after 50 or 100 steps of DA ...

covariance matrices are like random draws from the population

multiple imputation allows
Multiple Imputation Allows:
  • Unbiased Estimation
  • Good standard errors
    • provided number of imputations is large enough
    • too few imputations  reduced power with small effect sizes
slide49

From Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (in press). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science.

multiple imputation basic steps
Multiple Imputation:Basic Steps
  • Impute
  • Analyze
  • Combine results
imputation and analysis
Imputation and Analysis
  • Impute 40 datasets
    • a missing value gets a different imputed value in each dataset
  • Analyze each data set with USUAL procedures
    • e.g., SAS, SPSS, LISREL, EQS, STATA
  • Save parameter estimates and SE’s
combine the results parameter estimates to report
Combine the ResultsParameter Estimates to Report
  • Average of estimate (b-weight) over 40 imputed datasets
combine the results standard errors to report
Combine the ResultsStandard Errors to Report

Sum of:

  • “within imputation” variance

average squared standard error

    • usual kind of variability
  • “between imputation” variance

sample variance of parameter estimates over 40 datasets

    • variability due to missing data
materials for spss regression
Materials for SPSS Regression

Starting place

http://methodology.psu.edu

  • downloads

missing data software

Joe Schafer's Missing Data Programs

John Graham's Additional NORM Utilities

http://mcgee.hhdev.psu.edu/missing/index.html

materials for spss regression56
Materials for SPSS Regression
  • SPSS (NORMSPSS)
    • The following six files provide a new (not necessarily better) way to use SPSS regression with NORM imputed datasets
    • steps.pdf
    • norm2mi.exe
    • selectif.sps
    • space.exe
    • spssinf.bat
    • minfer.exe
inclusive missing data strategies

Inclusive Missing Data Strategies

Auxiliary Variables:

What’s All the Fuss?

John Graham

IES Summer Research Training Institute, June 27, 2007

what is an auxiliary variable
What Is an Auxiliary Variable?
  • A variable correlated with the variables in your model
    • but not part of the model
    • not necessarily related to missingness
    • used to "help" with missing data estimation
benefit of auxiliary variables
Benefit of Auxiliary Variables
  • Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.
  • Graham, J. W., & Collins, L. M. (2007). Using modern missing data methods with auxiliary variables to mitigate the effects of attrition on statistical power. Technical Report, The Methodology Center, Penn State University.
benefit of auxiliary variables62
Benefit of Auxiliary Variables
  • Example from Graham & Collins (2007)

X Y Z

1 1 1 500 complete cases

1 0 1 500 cases missing Y

  • X, Y variables in the model (Y sometimes missing)
  • Z is auxiliary variable
benefit of auxiliary variables63
Benefit of Auxiliary Variables
  • Effective sample size (N')
    • Analysis involving N cases, with auxiliary variable(s)
    • gives statistical power equivalent to N' complete cases without auxiliary variables
benefit of auxiliary variables64
Benefit of Auxiliary Variables
  • It matters how highly Y and Z (the auxiliary variable) are correlated
  • For example increase
    • rYZ = .40 N = 500 gives power of N' = 542 (8%)
    • rYZ = .60 N = 500 gives power of N' = 608 (22%)
    • rYZ = .80 N = 500 gives power of N' = 733 (47%)
    • rYZ = .90 N = 500 gives power of N' = 839 (68%)
empirical illustration the model
Empirical IllustrationThe Model
  • Alcohol-related Harm Prevention (AHP) Project with College Students

Intent make Vehicle Plans1

Took VehicleRisks 3

PhysicalHarm 5

Alcohol Use1

how much data
How Much Data?

Intent Alcohol VehRisk Harm Freq

_______ ____ ____ ______ ____

0 0 0 0 59

0 0 0 1 109

0 0 1 0 99

0 0 1 1 122

0 1 0 0 1

0 1 0 1 2

0 1 1 1 5

1 1 0 0 100

1 1 0 1 46

1 1 1 0 136

1 1 1 1 344  Complete

Total 1023

1 = data0 = missing

empirical illustration complete cases n 344
Empirical IllustrationComplete Cases (N = 344)

Intent make Vehicle Plans1

t = -6

Took VehicleRisks 3

PhysicalHarm 5

ns

t = 0.2

Alcohol Use1

t = 5

empirical illustration simple mi no aux vars
Empirical IllustrationSimple MI (no Aux Vars)

Intent make Vehicle Plans1

t = -9

Took VehicleRisks 3

PhysicalHarm 5

t = 3

Alcohol Use1

t = 7

N = 1023

empirical illustration mi with aux vars
Empirical IllustrationMI with Aux Vars

Intent make Vehicle Plans1

t = -10

Took VehicleRisks 3

PhysicalHarm 5

t = 6

Alcohol Use1

t = 8

Auxiliary Variables:

Intent2, Intent3, Intent4, Intent5Alcohol2, Alcohol3, Alcohol4, Alcohol5Risks1, Risks3, Risks4, Risks5Harm1, Harm2, Harm3, Harm4

N = 1023

adding auxiliary variables mi
Adding Auxiliary Variables: MI
  • Simply add Auxiliary variables to imputation model
  • Couldn't be easier
    • Except ...
    • There are limits to how many variables can be included in NORM conveniently
  • My current thinking:
    • add Aux Vars judiciously
empirical illustration mi with aux vars74
Empirical IllustrationMI with Aux Vars

Intent make Vehicle Plans1

t = -10

Took VehicleRisks 3

PhysicalHarm 5

t = 6

Alcohol Use1

t = 8

Auxiliary Variables:

Intent2, Intent3, Intent4, Intent5Alcohol2, Alcohol3, Alcohol4, Alcohol5Risks1, Risks3, Risks4, Risks5Harm1, Harm2, Harm3, Harm4

N = 1023

adding auxiliary variables amos and other fiml sem programs
Adding Auxiliary Variables: Amos (and other FIML/SEM programs)

Graham, J. W. (2003). Adding missing-data relevant variables to FIML-based structural equation models. Structural Equation Modeling, 10, 80-100.

  • Extra DV model
    • Good for manifest variable models
  • Saturated Correlates ("Spider") Model
    • Better for latent variable models
covariate model
Covariate Model

NOT Adequate

Aux Variable Changes XY Estimate

extra dv model
Extra DV Model

Good for Manifest Variable Models

Aux Variable does NOT Change XY Estimate

spider model graham 2003
Spider Model (Graham, 2003)

Aux

Good for Latent Variable ModelsAux Variable does NOT Change XY Estimate

extra dv model amos
Extra DV Model (Amos)

Real world version gets a little clumsy ...

but Amos does provide some excellent drawing tools

Large models easier in text-based SEM programs (e.g., LISREL)

slide80

Using Missing Data Analysis and Design to Develop Cost-Effective Measurement Strategies in Prevention Research

John Graham

IES Summer Research Training Institute, June 27, 2007

planned missingness
Planned Missingness
  • Why would anyone want to plan to have missing data?
  • To manage costs, data quality, and statistical power
  • In fact, we've been doing it for decades. . .
common sampling designs
Common Sampling Designs
  • Random sampling of
    • Subjects
    • Items
  • Goal:
    • Collect smaller, more manageable amount of data
    • Draw reasonable conclusions
why not use planned missingness
Why NOT UsePlanned Missingness?
  • Past: Not convenient to do analyses
  • Present: Many statistical solutions
  • Now is time to consider design alternatives
lighten burden on respondents
Lighten Burden on Respondents
  • The problem:
    • 7th graders can answer only 100 questions
    • We want to ask 133 questions
  • One Solution: The 3-form design
idea grew out of practical need
Idea Grew out of Practical Need
  • Project SMART (1982)
    • NIDA-funded drug abuse prevention project
      • Johnson, Flay, Hansen, Graham
3 form design
3-Form Design

Student Received Item Set?

----------------------------

X A B C

Form 1 yes yes yes NO

Form 2 yes yes NO yes

Form 3 yes NO yes yes

3 form design89
3-Form Design

Item Sets totalX A B C asked 34 33 33 33 = 133

totalfor eachform X A B C student1 34 33 33 0 = 1002 34 33 0 33 = 1003 34 0 33 33 = 100

  • Think of it as “leveraging” resources
3 form design item order
3-Form Design: Item Order

Form 1: X A BForm 2: X C AForm 3 X B C

3 form design item order91
3-Form Design: Item Order

Form 1: X A B CForm 2: X C A BForm 3 X B C A

3 form design item order92
3-Form Design: Item Order

Form 1: X A B CForm 2: X C A BForm 3 X B C A

  • Give questions as shown, measure reasons for non-completion
    • poor reading
    • low motivation
    • conscientiousness
  • "Managed" missingness
3 form design graham flay et al 1984
3-Form Design(Graham, Flay et al., 1984)

Item Sets X A B C totalForm 33 33 33 33 133

_____ _____________________________________

1 33 33 33 0 100 2 33 33 0 33 100

3 33 0 3333 100

6 form design e g king king et al 2002
6-Form Design(e.g., King, King et al., 2002)

Item Sets X A B C D totalForm 33 33 33 33 33 167

_____ _____________________________________

1 33 33 33 0 0 100 2 33 33 0 33 0 100

3 33 33 0 0 33 100

4 33 0 3333 0 100

5 33 0 33 0 33 100

6 33 0 0 3333 100

split questionnaire survey design sqsd raghunathan grizzle 1995
Split Questionnaire Survey DesignSQSD (Raghunathan & Grizzle, 1995)

Item Sets X A B C D E totalForm 33 33 33 33 33 33 200

_____ _____________________________________

1 33 33 33 0 0 0 100 2 33 33 0 33 0 0 100

3 33 33 0 0 33 0 ...

4 33 33 0 0 0 33

5 33 0 3333 0 0

6 33 0 33 0 33 0

7 33 0 33 0 0 33

8 33 0 0 3333 0

9 33 0 0 33 0 33

10 33 0 0 0 33 33

family of designs
Family of Designs
  • 3-form Design
    • All combinations of 3 sets taken 2 at a time
  • SQSD (10-form design)
    • All combinations of 5 sets taken 2 at a time
  • 6-form design
    • All combinations of 4 sets taken 2 at a time
  • Complete cases (1-form design)
    • All combinations of 2 sets taken 2 at a time
evaluating designs benefits and costs99
Evaluating Designs (Benefits and costs)
  • Number of item sets (4 vs 3)Number of items (133 vs 100)
  • Number of (correlation) effectsSample sizes.....
slide100

Effects tested with n = N/3 (100)

Number of

Effects

Effects tested with n = 2N/3 (200)

Effects tested with total N (300)

Effects tested with total N (300)

evaluating designs benefits and costs101
Evaluating Designs (Benefits and costs)
  • Number of effects tested with good power (power ≥ .80)
  • Take multiple effect sizes into account
evaluating designs benefits and costs103
Evaluating Designs (Benefits and costs)
  • Number of effects tested with good power (power ≥ .80) …Still Something Missing
  • It's not how many effects
  • But WHICH effects can be tested:
  • Tradeoff Matrix
slide104

powerratio

1.271.20

2.13

1.36

3 form design105
3-Form Design

Student Received Item Set?

----------------------------

X A B C

core peer parent other

Form 1 yes yes yes NO

Form 2 yes yes NO yes

Form 3 yes NO yes yes

3 form design implementation strategies
3-Form Design:Implementation Strategies
  • Core Questions in "X" set
  • Keep related questions together in A or B or C sets
  • Example for Collaboration (Hansen & Graham)
    • X set (core items)
      • A: Hansen Set
      • B: Graham set
      • C: Other
back against the wall concept
"Back Against the Wall" Concept

3-form design better received if one of these is true:

  • You CAN ask some number of questions (e.g., 100)
    • You WANT to ask some larger number of questions (e.g., 133)
  • You have been asking 133 questions of respondents
    • Data Collectors (or data gate keepers) say you MUST reduce number of questions
some future directions
Some Future Directions
  • Current power calculations based on zero-order correlations
    • (beneficial) effect of auxiliary variables not taken into account
  • Current power calculations based on level one correlation analysis
    • loss of power will be discounted in multilevel analyses
change in fmi adding 15 aux vars from x set
Change in FMI adding 15 Aux Vars from X set

DV: Trouble Dataset: AAPT 7th graders