- 83 Views
- Uploaded on
- Presentation posted in: General

Missing Data: Analysis and Design

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Missing Data: Analysis and Design

John W. Graham

The Prevention Research Center

and

Department of Biobehavioral Health

Penn State University

- (1) Introduction: Missing Data Theory
- (2) A brief analysis demonstration
- Multiple Imputation with
- NORM and Proc MI

- Amos...break...

- Multiple Imputation with
- (3) Attrition Issues
- (4) Planned missingness designs:
- 3-form Design

- Graham, J. W., Cumsille, P. E.,& Elek-Fisk,E. (2003).Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.). Research Methods in Psychology (pp. 87_114). Volume 2 of Handbook of Psychology (I. B. Weiner, Editor-in-Chief). New York: John Wiley & Sons.
- Collins, L. M., Schafer, J. L.,& Kam, C. M.(2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.
- Schafer, J. L.,& Graham,J. W.(2002).Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.
jgraham@psu.edu

Part I:A Brief Introduction toAnalysis with Missing Data

- Analysis procedures were designed for complete data. . .

- Design new model-based procedures
- Missing Data + Parameter Estimation in One Step
- Full Information Maximum Likelihood (FIML)SEM and Other Latent Variable Programs(Amos, Mx, LISREL, Mplus, LTA)

- Data based procedures
- e.g., Multiple Imputation (MI)

- Two Steps
- Step 1: Deal with the missing data
- (e.g., replace missing values with plausible values
- Produce a product

- Step 2: Analyze the product as if there were no missing data

- Step 1: Deal with the missing data

- Aren't you somehow helping yourself with imputation?. . .

- does NOT give you something for nothing
- DOES let you make use of all data you have
. . .

- Is the imputed value what the person would have given?

- We do not impute for the sake of the value itself
- We impute to preserve important characteristics of the whole data set
. . .

- unbiased parameter estimation
- e.g., b-weights

- Good estimate of variability
- e.g., standard errors

- best statistical power

- Ignorable
- MCAR: Missing Completely At Random
- MAR: Missing At Random

- Non-Ignorable
- MNAR: Missing Not At Random

- MCAR 1: Cause of missingness completely random process (like coin flip)
- MCAR 2:
- Cause uncorrelated with variables of interest
- Example: parents move

- No bias if cause omitted

- Missingness may be related to measured variables
- But no residual relationship with unmeasured variables
- Example: reading speed

- No bias if you control for measured variables

- Even after controlling for measured variables ...
- Residual relationship with unmeasured variables
- Example: drug use reason for absence

- The recommended methods assume missingness is MAR
- But what if the cause of missingness is not MAR?
- Should these methods be used when MAR assumptions not met?
. . .

- Suggested methods work better than “old” methods
- Multiple causes of missingness
- Only small part of missingness may be MNAR

- Suggested methods usually work very well

- Example model of interest: X Y
X = Program (prog vs control)

Y = Cigarette Smoking

Z = Cause of missingness: say, Rebelliousness (or smoking itself)

- Factors to be considered:
- % Missing (e.g., % attrition)
- rYZ .
- rZ,Ymis .

- Correlation between
- cause of missingness (Z)
- e.g., rebelliousness (or smoking itself)

- and the variable of interest (Y)
- e.g., Cigarette Smoking

- cause of missingness (Z)

- Correlation between
- cause of missingness (Z)
- e.g., rebelliousness (or smoking itself)

- and missingness on variable of interest
- e.g., Missingness on the Smoking variable

- cause of missingness (Z)
- Missingness on Smoking (Ymis)
- Dichotomous variable:
Ymis = 1: Smoking variable not missing

Ymis = 0: Smoking variable missing

- Dichotomous variable:

- rZ,Y = 1.0 AND rZ,Ymis = 1.0
- We can get rZ,Y = 1.0 if smoking is the cause of missingness on the smoking variable

- We can get rZ,Ymis = 1.0 like this:
- If person is a smoker, smoking variable is always missing
- If person is not a smoker, smoking variable is never missing

- But is this plausible? ever?

Problems with this statement

- MAR & MNAR are widely misunderstood concepts
- I argue that the cause of missingness is never purely MNAR
- The cause of missingness is virtually never purely MAR either.

- MAR and MNAR form a continuum
- Pure MAR and pure MNAR are just theoretical concepts
- Neither occurs in the real world

- MAR vs MNAR NOT dimension of interest

- Question of Interest:How much estimation bias?
- when cause of missingness cannot be included in the model

- All missing data situations are partly MAR and partly MNAR
- Sometimes it matters ...
- bias affects statistical conclusions

- Often it does not matter
- bias has minimal effects on statistical conclusions
(Collins, Schafer, & Kam, Psych Methods, 2001)

- bias has minimal effects on statistical conclusions

- MAR methods (MI and ML)
- are ALWAYS at least as good as,
- usually better than "old" methods (e.g., listwise deletion)

- Methods designed to handle MNAR missingness are NOT always better than MAR methods

- Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119-128.
- Graham, J. W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P., & Schafer, J.L. (1997). Analysis with missing data in prevention research. In K. Bryant, M. Windle, & S. West (Eds.), The science of prevention: methodological advances from alcohol and substance abuse research. (pp. 325-366). Washington, D.C.: American Psychological Association.
- Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351.

- may produce bias
- you always lose some power
- (because you are throwing away data)

- reasonable if you lose only 5% of cases
- often lose substantial power

- 1 1 1 1
- 0 1 1 1
- 1 0 1 1
- 1 1 0 1
- 1 1 1 0

- Pairwise deletion
- May be of occasional use for preliminary analyses

- Mean substitution
- Never use it

- Regression-based single imputation
- generally not recommended ... except ...

- Multiple Group SEM (Structural Equation Modeling)
- LatentTransitionAnalysis (Collins et al.)
- A latent class procedure

- Raw Data Maximum Likelihood SEMaka Full Information Maximum Likelihood (FIML)
- Amos (James Arbuckle)
- LISREL 8.5+ (Jöreskog & Sörbom)
- Mplus (Bengt Muthén)
- Mx (Michael Neale)

- Structural Equation Modeling (SEM) Programs
- In Single Analysis ...
- Good Estimation
- Reasonable standard errors
- Windows Graphical Interface

- That particular model must be what you want

EM Algorithm (ML parameter estimation)

- Norm-Cat-Mix, EMcov, SAS, SPSS
Multiple Imputation

- NORM, Cat, Mix, Pan (Joe Schafer)
- SAS Proc MI
- LISREL 8.5+

- Expectation - Maximization
Alternate between

E-step: predict missing data

M-step: estimate parameters

- Excellent parameter estimates
- But no standard errors
- must use bootstrap
- or multiple imputation

- Problem with Single Imputation:Too Little Variability
- Because of Error Variance
- Because covariance matrix is only one estimate

- Imputed value lies on regression line

- Add random normal residual

- Obtain multiple plausible estimates of the covariance matrix
- ideally draw multiple covariance matrices from population
- Approximate this with
- Bootstrap
- Data Augmentation (Norm)
- MCMC (SAS 8.2, 9)

- stochastic version of EM
- EM
- E (expectation) step: predict missing data
- M (maximization) step: estimate parameters

- Data Augmentation
- I (imputation) step: simulate missing data
- P (posterior) step: simulate parameters

- Parameters from consecutive steps ...
- too related
- i.e., not enough variability

- after 50 or 100 steps of DA ...
covariance matrices are like random draws from the population

- Unbiased Estimation
- Good standard errors
- provided number of imputations is large enough
- too few imputations reduced power with small effect sizes

From Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (in press). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science.

Part II:Illustration of Missing Data Analysis: Multiple Imputation with NORM and Proc MI

- Impute
- Analyze
- Combine results

- Impute 40 datasets
- a missing value gets a different imputed value in each dataset

- Analyze each data set with USUAL procedures
- e.g., SAS, SPSS, LISREL, EQS, STATA

- Save parameter estimates and SE’s

- Average of estimate (b-weight) over 40 imputed datasets

Sum of:

- “within imputation” variance
average squared standard error

- usual kind of variability

- “between imputation” variance
sample variance of parameter estimates over 40 datasets

- variability due to missing data

Starting place

http://methodology.psu.edu

- downloads
missing data software

Joe Schafer's Missing Data Programs

John Graham's Additional NORM Utilities

http://mcgee.hhdev.psu.edu/missing/index.html

- SPSS (NORMSPSS)
- The following six files provide a new (not necessarily better) way to use SPSS regression with NORM imputed datasets
- steps.pdf
- norm2mi.exe
- selectif.sps
- space.exe
- spssinf.bat
- minfer.exe

exit for sample analysis

Inclusive Missing Data Strategies

Auxiliary Variables:

What’s All the Fuss?

John Graham

IES Summer Research Training Institute, June 27, 2007

- A variable correlated with the variables in your model
- but not part of the model
- not necessarily related to missingness
- used to "help" with missing data estimation

- Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.
- Graham, J. W., & Collins, L. M. (2007). Using modern missing data methods with auxiliary variables to mitigate the effects of attrition on statistical power. Technical Report, The Methodology Center, Penn State University.

- Example from Graham & Collins (2007)
X Y Z

1 1 1 500 complete cases

1 0 1500 cases missing Y

- X, Y variables in the model (Y sometimes missing)
- Z is auxiliary variable

- Effective sample size (N')
- Analysis involving N cases, with auxiliary variable(s)
- gives statistical power equivalent to N' complete cases without auxiliary variables

- It matters how highly Y and Z (the auxiliary variable) are correlated
- For exampleincrease
- rYZ = .40N = 500 gives power of N' = 542(8%)
- rYZ = .60N = 500 gives power of N' = 608 (22%)
- rYZ = .80N = 500 gives power of N' = 733(47%)
- rYZ = .90N = 500 gives power of N' = 839(68%)

- Alcohol-related Harm Prevention (AHP) Project with College Students

Intent make Vehicle Plans1

Took VehicleRisks 3

PhysicalHarm 5

Alcohol Use1

Intent Alcohol VehRisk Harm Freq

_______ ____ ____ ______ ____

0 0 0 0 59

0 0 0 1 109

0 0 1 0 99

0 0 1 1 122

0 1 0 0 1

0 1 0 1 2

0 1 1 1 5

1 1 0 0 100

1 1 0 1 46

1 1 1 0 136

1 1 1 1 344 Complete

Total 1023

1 = data0 = missing

Intent make Vehicle Plans1

t = -6

Took VehicleRisks 3

PhysicalHarm 5

ns

t = 0.2

Alcohol Use1

t = 5

Intent make Vehicle Plans1

t = -9

Took VehicleRisks 3

PhysicalHarm 5

t = 3

Alcohol Use1

t = 7

N = 1023

Intent make Vehicle Plans1

t = -10

Took VehicleRisks 3

PhysicalHarm 5

t = 6

Alcohol Use1

t = 8

Auxiliary Variables:

Intent2, Intent3, Intent4, Intent5Alcohol2, Alcohol3, Alcohol4, Alcohol5Risks1, Risks3, Risks4, Risks5Harm1, Harm2, Harm3, Harm4

N = 1023

- Multiple Imputation
- Amos

- Simply add Auxiliary variables to imputation model
- Couldn't be easier
- Except ...
- There are limits to how many variables can be included in NORM conveniently

- My current thinking:
- add Aux Vars judiciously

Intent make Vehicle Plans1

t = -10

Took VehicleRisks 3

PhysicalHarm 5

t = 6

Alcohol Use1

t = 8

Auxiliary Variables:

Intent2, Intent3, Intent4, Intent5Alcohol2, Alcohol3, Alcohol4, Alcohol5Risks1, Risks3, Risks4, Risks5Harm1, Harm2, Harm3, Harm4

N = 1023

Graham, J. W. (2003). Adding missing-data relevant variables to FIML-based structural equation models. Structural Equation Modeling, 10, 80-100.

- Extra DV model
- Good for manifest variable models

- Saturated Correlates ("Spider") Model
- Better for latent variable models

NOT Adequate

Aux Variable Changes XY Estimate

Good for Manifest Variable Models

Aux Variable does NOT Change XY Estimate

Aux

Good for Latent Variable ModelsAux Variable does NOT Change XY Estimate

Real world version gets a little clumsy ...

but Amos does provide some excellent drawing tools

Large models easier in text-based SEM programs (e.g., LISREL)

Using Missing Data Analysis and Design to Develop Cost-Effective Measurement Strategies in Prevention Research

John Graham

IES Summer Research Training Institute, June 27, 2007

Planned Missingness Designs:The 3-Form Design

- Why would anyone want to plan to have missing data?
- To manage costs, data quality, and statistical power
- In fact, we've been doing it for decades. . .

- Random sampling of
- Subjects
- Items

- Goal:
- Collect smaller, more manageable amount of data
- Draw reasonable conclusions

- Past: Not convenient to do analyses
- Present: Many statistical solutions
- Now is time to consider design alternatives

- The problem:
- 7th graders can answer only 100 questions
- We want to ask 133 questions

- One Solution: The 3-form design

- Project SMART (1982)
- NIDA-funded drug abuse prevention project
- Johnson, Flay, Hansen, Graham

- NIDA-funded drug abuse prevention project

Student Received Item Set?

----------------------------

X A B C

Form 1yes yes yes NO

Form 2yes yes NO yes

Form 3yes NO yes yes

Item SetstotalXABCasked34333333= 133

totalfor eachformXABCstudent1343333 0=10023433 033= 100334 03333=100

- Think of it as “leveraging” resources

Form 1: XABForm 2:XCAForm 3XBC

Form 1: XABCForm 2:XCABForm 3XBCA

Form 1: XABCForm 2:XCABForm 3XBCA

- Give questions as shown, measure reasons for non-completion
- poor reading
- low motivation
- conscientiousness

- "Managed" missingness

Item SetsXABCtotalForm33333333133

__________________________________________

1333333 010023333 033100

333 03333100

Item SetsXABCDtotalForm3333333333167

__________________________________________

1333333 0010023333 0330100

33333 0 033100

433 03333 0100

533 033 033100

633 0 03333100

Item SetsXABCDEtotalForm333333333333 200

__________________________________________

1333333 000 10023333 03300 100

33333 0 0330 ...

43333 0 0 033

533 03333 0 0

633 033 033 0

733 033 0 033

833 0 03333 0

933 0 033 033

1033 0 0 03333

- 3-form Design
- All combinations of 3 sets taken 2 at a time

- SQSD (10-form design)
- All combinations of 5 sets taken 2 at a time

- 6-form design
- All combinations of 4 sets taken 2 at a time

- Complete cases (1-form design)
- All combinations of 2 sets taken 2 at a time

- Number of item sets (4 vs 3)Number of items (133 vs 100)
- Number of (correlation) effectsSample sizes.....

Effects tested with n = N/3 (100)

Number of

Effects

Effects tested with n = 2N/3 (200)

Effects tested with total N (300)

Effects tested with total N (300)

- Number of effects tested with good power (power ≥ .80)
- Take multiple effect sizes into account

30-40 scenario = Mild Leveraging Scenario

Effect Size (r)

- Number of effects tested with good power (power ≥ .80) …Still Something Missing
- It's not how many effects
- But WHICH effects can be tested:
- Tradeoff Matrix

powerratio

1.271.20

2.13

1.36

Student Received Item Set?

----------------------------

X A B C

corepeerparent other

Form 1yes yes yes NO

Form 2yes yes NO yes

Form 3yes NO yes yes

- Core Questions in "X" set
- Keep related questions together in A or B or C sets
- Example for Collaboration (Hansen & Graham)
- X set (core items)
- A: Hansen Set
- B: Graham set
- C: Other

- X set (core items)

3-form design better received if one of these is true:

- You CAN ask some number of questions (e.g., 100)
- You WANT to ask some larger number of questions (e.g., 133)

- You have been asking 133 questions of respondents
- Data Collectors (or data gate keepers) say you MUST reduce number of questions

- Current power calculations based on zero-order correlations
- (beneficial) effect of auxiliary variables not taken into account

- Current power calculations based on level one correlation analysis
- loss of power will be discounted in multilevel analyses

DV: Trouble Dataset: AAPT 7th graders

- the end