- 797 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Missing Data: Analysis and Design' - brenna

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Missing Data: Analysis and Design

### Part II:Illustration of Missing Data Analysis: Multiple Imputation with NORM and Proc MI

### Inclusive Missing Data Strategies

### Using Missing Data Analysis and Design to Develop Cost-Effective Measurement Strategies in Prevention Research

John W. Graham

The Prevention Research Center

and

Department of Biobehavioral Health

Penn State University

Presentation in Four Parts

- (1) Introduction: Missing Data Theory
- (2) A brief analysis demonstration
- Multiple Imputation with
- NORM and Proc MI
- Amos...break...
- (3) Attrition Issues
- (4) Planned missingness designs:
- 3-form Design

Recent Papers

- Graham, J. W., Cumsille, P. E.,& Elek-Fisk,E. (2003).Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.). Research Methods in Psychology (pp. 87_114). Volume 2 of Handbook of Psychology (I. B. Weiner, Editor-in-Chief). New York: John Wiley & Sons.
- Collins, L. M., Schafer, J. L.,& Kam, C. M.(2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.
- Schafer, J. L.,& Graham,J. W.(2002).Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.

jgraham@psu.edu

Problem with Missing Data

- Analysis procedures were designed for complete data. . .

Solution 1

- Design new model-based procedures
- Missing Data + Parameter Estimation in One Step
- Full Information Maximum Likelihood (FIML)SEM and Other Latent Variable Programs(Amos, Mx, LISREL, Mplus, LTA)

Solution 2

- Data based procedures
- e.g., Multiple Imputation (MI)
- Two Steps
- Step 1: Deal with the missing data
- (e.g., replace missing values with plausible values
- Produce a product
- Step 2: Analyze the product as if there were no missing data

FAQ

- Aren't you somehow helping yourself with imputation?. . .

NO. Missing data imputation . . .

- does NOT give you something for nothing
- DOES let you make use of all data you have

. . .

FAQ

- Is the imputed value what the person would have given?

NO. When we impute a value . .

- We do not impute for the sake of the value itself
- We impute to preserve important characteristics of the whole data set

. . .

We want . . .

- unbiased parameter estimation
- e.g., b-weights
- Good estimate of variability
- e.g., standard errors
- best statistical power

Causes of Missingness

- Ignorable
- MCAR: Missing Completely At Random
- MAR: Missing At Random
- Non-Ignorable
- MNAR: Missing Not At Random

MCAR(Missing Completely At Random)

- MCAR 1: Cause of missingness completely random process (like coin flip)
- MCAR 2:
- Cause uncorrelated with variables of interest
- Example: parents move
- No bias if cause omitted

MAR (Missing At Random)

- Missingness may be related to measured variables
- But no residual relationship with unmeasured variables
- Example: reading speed
- No bias if you control for measured variables

MNAR (Missing Not At Random)

- Even after controlling for measured variables ...
- Residual relationship with unmeasured variables
- Example: drug use reason for absence

MNAR Causes

- The recommended methods assume missingness is MAR
- But what if the cause of missingness is not MAR?
- Should these methods be used when MAR assumptions not met?

. . .

YES! These Methods Work!

- Suggested methods work better than “old” methods
- Multiple causes of missingness
- Only small part of missingness may be MNAR
- Suggested methods usually work very well

Revisit Question: What if THE Cause of Missingness is MNAR?

- Example model of interest: X Y

X = Program (prog vs control)

Y = Cigarette Smoking

Z = Cause of missingness: say, Rebelliousness (or smoking itself)

- Factors to be considered:
- % Missing (e.g., % attrition)
- rYZ .
- rZ,Ymis .

rYZ

- Correlation between
- cause of missingness (Z)
- e.g., rebelliousness (or smoking itself)
- and the variable of interest (Y)
- e.g., Cigarette Smoking

rZ,Ymis

- Correlation between
- cause of missingness (Z)
- e.g., rebelliousness (or smoking itself)
- and missingness on variable of interest
- e.g., Missingness on the Smoking variable
- Missingness on Smoking (Ymis)
- Dichotomous variable:

Ymis = 1: Smoking variable not missing

Ymis = 0: Smoking variable missing

How Could the Cause of Missingness be Purely MNAR?

- rZ,Y = 1.0 AND rZ,Ymis = 1.0
- We can get rZ,Y = 1.0 if smoking is the cause of missingness on the smoking variable

How Could the Cause of Missingness be Purely MNAR?

- We can get rZ,Ymis = 1.0 like this:
- If person is a smoker, smoking variable is always missing
- If person is not a smoker, smoking variable is never missing
- But is this plausible? ever?

What if the cause of missingness is MNAR?

Problems with this statement

- MAR & MNAR are widely misunderstood concepts
- I argue that the cause of missingness is never purely MNAR
- The cause of missingness is virtually never purely MAR either.

MAR vs MNAR:

- MAR and MNAR form a continuum
- Pure MAR and pure MNAR are just theoretical concepts
- Neither occurs in the real world
- MAR vs MNAR NOT dimension of interest

MAR vs MNAR: What IS the Dimension of Interest?

- Question of Interest:How much estimation bias?
- when cause of missingness cannot be included in the model

Bottom Line ...

- All missing data situations are partly MAR and partly MNAR
- Sometimes it matters ...
- bias affects statistical conclusions
- Often it does not matter
- bias has minimal effects on statistical conclusions

(Collins, Schafer, & Kam, Psych Methods, 2001)

Methods:"Old" vs MAR vs MNAR

- MAR methods (MI and ML)
- are ALWAYS at least as good as,
- usually better than "old" methods (e.g., listwise deletion)
- Methods designed to handle MNAR missingness are NOT always better than MAR methods

References

- Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119-128.
- Graham, J. W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P., & Schafer, J.L. (1997). Analysis with missing data in prevention research. In K. Bryant, M. Windle, & S. West (Eds.), The science of prevention: methodological advances from alcohol and substance abuse research. (pp. 325-366). Washington, D.C.: American Psychological Association.
- Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351.

Old Procedures: Analyze Complete Cases(listwise deletion)

- may produce bias
- you always lose some power
- (because you are throwing away data)
- reasonable if you lose only 5% of cases
- often lose substantial power

Analyze Complete Cases(listwise deletion)

- 1 1 1 1
- 0 1 1 1
- 1 0 1 1
- 1 1 0 1
- 1 1 1 0
- very common situation
- only 20% (4 of 20) data points missing
- but discard 80% of the cases

Other "Old" Procedures

- Pairwise deletion
- May be of occasional use for preliminary analyses
- Mean substitution
- Never use it
- Regression-based single imputation
- generally not recommended ... except ...

Recommended Model-Based Procedures

- Multiple Group SEM (Structural Equation Modeling)
- LatentTransitionAnalysis (Collins et al.)
- A latent class procedure

Recommended Model-Based Procedures

- Raw Data Maximum Likelihood SEMaka Full Information Maximum Likelihood (FIML)
- Amos (James Arbuckle)
- LISREL 8.5+ (Jöreskog & Sörbom)
- Mplus (Bengt Muthén)
- Mx (Michael Neale)

Amos 7, Mx, Mplus, LISREL 8.8

- Structural Equation Modeling (SEM) Programs
- In Single Analysis ...
- Good Estimation
- Reasonable standard errors
- Windows Graphical Interface

Limitation with Model-Based Procedures

- That particular model must be what you want

Recommended Data-Based Procedures

EM Algorithm (ML parameter estimation)

- Norm-Cat-Mix, EMcov, SAS, SPSS

Multiple Imputation

- NORM, Cat, Mix, Pan (Joe Schafer)
- SAS Proc MI
- LISREL 8.5+

EM Algorithm

- Expectation - Maximization

Alternate between

E-step: predict missing data

M-step: estimate parameters

- Excellent parameter estimates
- But no standard errors
- must use bootstrap
- or multiple imputation

Multiple Imputation

- Problem with Single Imputation:Too Little Variability
- Because of Error Variance
- Because covariance matrix is only one estimate

Too Little Error Variance

- Imputed value lies on regression line

Restore Error . . .

- Add random normal residual

Covariance Matrix (Regression Line) only One Estimate

- Obtain multiple plausible estimates of the covariance matrix
- ideally draw multiple covariance matrices from population
- Approximate this with
- Bootstrap
- Data Augmentation (Norm)
- MCMC (SAS 8.2, 9)

Data Augmentation

- stochastic version of EM
- EM
- E (expectation) step: predict missing data
- M (maximization) step: estimate parameters
- Data Augmentation
- I (imputation) step: simulate missing data
- P (posterior) step: simulate parameters

Data Augmentation

- Parameters from consecutive steps ...
- too related
- i.e., not enough variability
- after 50 or 100 steps of DA ...

covariance matrices are like random draws from the population

Multiple Imputation Allows:

- Unbiased Estimation
- Good standard errors
- provided number of imputations is large enough
- too few imputations reduced power with small effect sizes

From Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (in press). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science.

Multiple Imputation:Basic Steps

- Impute
- Analyze
- Combine results

Imputation and Analysis

- Impute 40 datasets
- a missing value gets a different imputed value in each dataset
- Analyze each data set with USUAL procedures
- e.g., SAS, SPSS, LISREL, EQS, STATA
- Save parameter estimates and SE’s

Combine the ResultsParameter Estimates to Report

- Average of estimate (b-weight) over 40 imputed datasets

Combine the ResultsStandard Errors to Report

Sum of:

- “within imputation” variance

average squared standard error

- usual kind of variability
- “between imputation” variance

sample variance of parameter estimates over 40 datasets

- variability due to missing data

Materials for SPSS Regression

Starting place

http://methodology.psu.edu

- downloads

missing data software

Joe Schafer's Missing Data Programs

John Graham's Additional NORM Utilities

http://mcgee.hhdev.psu.edu/missing/index.html

Materials for SPSS Regression

- SPSS (NORMSPSS)
- The following six files provide a new (not necessarily better) way to use SPSS regression with NORM imputed datasets
- steps.pdf
- norm2mi.exe
- selectif.sps
- space.exe
- spssinf.bat
- minfer.exe

Auxiliary Variables:

What’s All the Fuss?

John Graham

IES Summer Research Training Institute, June 27, 2007

What Is an Auxiliary Variable?

- A variable correlated with the variables in your model
- but not part of the model
- not necessarily related to missingness
- used to "help" with missing data estimation

Benefit of Auxiliary Variables

- Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.
- Graham, J. W., & Collins, L. M. (2007). Using modern missing data methods with auxiliary variables to mitigate the effects of attrition on statistical power. Technical Report, The Methodology Center, Penn State University.

Benefit of Auxiliary Variables

- Example from Graham & Collins (2007)

X Y Z

1 1 1 500 complete cases

1 0 1 500 cases missing Y

- X, Y variables in the model (Y sometimes missing)
- Z is auxiliary variable

Benefit of Auxiliary Variables

- Effective sample size (N')
- Analysis involving N cases, with auxiliary variable(s)
- gives statistical power equivalent to N' complete cases without auxiliary variables

Benefit of Auxiliary Variables

- It matters how highly Y and Z (the auxiliary variable) are correlated
- For example increase
- rYZ = .40 N = 500 gives power of N' = 542 (8%)
- rYZ = .60 N = 500 gives power of N' = 608 (22%)
- rYZ = .80 N = 500 gives power of N' = 733 (47%)
- rYZ = .90 N = 500 gives power of N' = 839 (68%)

Empirical IllustrationThe Model

- Alcohol-related Harm Prevention (AHP) Project with College Students

Intent make Vehicle Plans1

Took VehicleRisks 3

PhysicalHarm 5

Alcohol Use1

How Much Data?

Intent Alcohol VehRisk Harm Freq

_______ ____ ____ ______ ____

0 0 0 0 59

0 0 0 1 109

0 0 1 0 99

0 0 1 1 122

0 1 0 0 1

0 1 0 1 2

0 1 1 1 5

1 1 0 0 100

1 1 0 1 46

1 1 1 0 136

1 1 1 1 344 Complete

Total 1023

1 = data0 = missing

Empirical IllustrationComplete Cases (N = 344)

Intent make Vehicle Plans1

t = -6

Took VehicleRisks 3

PhysicalHarm 5

ns

t = 0.2

Alcohol Use1

t = 5

Empirical IllustrationSimple MI (no Aux Vars)

Intent make Vehicle Plans1

t = -9

Took VehicleRisks 3

PhysicalHarm 5

t = 3

Alcohol Use1

t = 7

N = 1023

Empirical IllustrationMI with Aux Vars

Intent make Vehicle Plans1

t = -10

Took VehicleRisks 3

PhysicalHarm 5

t = 6

Alcohol Use1

t = 8

Auxiliary Variables:

Intent2, Intent3, Intent4, Intent5Alcohol2, Alcohol3, Alcohol4, Alcohol5Risks1, Risks3, Risks4, Risks5Harm1, Harm2, Harm3, Harm4

N = 1023

Methods for Adding Auxiliary Variables

- Multiple Imputation
- Amos

Adding Auxiliary Variables: MI

- Simply add Auxiliary variables to imputation model
- Couldn't be easier
- Except ...
- There are limits to how many variables can be included in NORM conveniently
- My current thinking:
- add Aux Vars judiciously

Empirical IllustrationMI with Aux Vars

Intent make Vehicle Plans1

t = -10

Took VehicleRisks 3

PhysicalHarm 5

t = 6

Alcohol Use1

t = 8

Auxiliary Variables:

Intent2, Intent3, Intent4, Intent5Alcohol2, Alcohol3, Alcohol4, Alcohol5Risks1, Risks3, Risks4, Risks5Harm1, Harm2, Harm3, Harm4

N = 1023

Adding Auxiliary Variables: Amos (and other FIML/SEM programs)

Graham, J. W. (2003). Adding missing-data relevant variables to FIML-based structural equation models. Structural Equation Modeling, 10, 80-100.

- Extra DV model
- Good for manifest variable models
- Saturated Correlates ("Spider") Model
- Better for latent variable models

Spider Model (Graham, 2003)

Aux

Good for Latent Variable ModelsAux Variable does NOT Change XY Estimate

Extra DV Model (Amos)

Real world version gets a little clumsy ...

but Amos does provide some excellent drawing tools

Large models easier in text-based SEM programs (e.g., LISREL)

John Graham

IES Summer Research Training Institute, June 27, 2007

Planned Missingness

- Why would anyone want to plan to have missing data?
- To manage costs, data quality, and statistical power
- In fact, we've been doing it for decades. . .

Common Sampling Designs

- Random sampling of
- Subjects
- Items
- Goal:
- Collect smaller, more manageable amount of data
- Draw reasonable conclusions

Why NOT UsePlanned Missingness?

- Past: Not convenient to do analyses
- Present: Many statistical solutions
- Now is time to consider design alternatives

Lighten Burden on Respondents

- The problem:
- 7th graders can answer only 100 questions
- We want to ask 133 questions
- One Solution: The 3-form design

Idea Grew out of Practical Need

- Project SMART (1982)
- NIDA-funded drug abuse prevention project
- Johnson, Flay, Hansen, Graham

3-Form Design

Student Received Item Set?

----------------------------

X A B C

Form 1 yes yes yes NO

Form 2 yes yes NO yes

Form 3 yes NO yes yes

3-Form Design

Item Sets totalX A B C asked 34 33 33 33 = 133

totalfor eachform X A B C student1 34 33 33 0 = 1002 34 33 0 33 = 1003 34 0 33 33 = 100

- Think of it as “leveraging” resources

3-Form Design: Item Order

Form 1: X A BForm 2: X C AForm 3 X B C

3-Form Design: Item Order

Form 1: X A B CForm 2: X C A BForm 3 X B C A

3-Form Design: Item Order

Form 1: X A B CForm 2: X C A BForm 3 X B C A

- Give questions as shown, measure reasons for non-completion
- poor reading
- low motivation
- conscientiousness
- "Managed" missingness

3-Form Design(Graham, Flay et al., 1984)

Item Sets X A B C totalForm 33 33 33 33 133

_____ _____________________________________

1 33 33 33 0 100 2 33 33 0 33 100

3 33 0 3333 100

6-Form Design(e.g., King, King et al., 2002)

Item Sets X A B C D totalForm 33 33 33 33 33 167

_____ _____________________________________

1 33 33 33 0 0 100 2 33 33 0 33 0 100

3 33 33 0 0 33 100

4 33 0 3333 0 100

5 33 0 33 0 33 100

6 33 0 0 3333 100

Split Questionnaire Survey DesignSQSD (Raghunathan & Grizzle, 1995)

Item Sets X A B C D E totalForm 33 33 33 33 33 33 200

_____ _____________________________________

1 33 33 33 0 0 0 100 2 33 33 0 33 0 0 100

3 33 33 0 0 33 0 ...

4 33 33 0 0 0 33

5 33 0 3333 0 0

6 33 0 33 0 33 0

7 33 0 33 0 0 33

8 33 0 0 3333 0

9 33 0 0 33 0 33

10 33 0 0 0 33 33

Family of Designs

- 3-form Design
- All combinations of 3 sets taken 2 at a time
- SQSD (10-form design)
- All combinations of 5 sets taken 2 at a time
- 6-form design
- All combinations of 4 sets taken 2 at a time
- Complete cases (1-form design)
- All combinations of 2 sets taken 2 at a time

Evaluating Designs (Benefits and costs)

- Number of item sets (4 vs 3)Number of items (133 vs 100)
- Number of (correlation) effectsSample sizes.....

Effects tested with n = N/3 (100)

Number of

Effects

Effects tested with n = 2N/3 (200)

Effects tested with total N (300)

Effects tested with total N (300)

Evaluating Designs (Benefits and costs)

- Number of effects tested with good power (power ≥ .80)
- Take multiple effect sizes into account

30-40 scenario = Mild Leveraging Scenario

Effect Size (r)

Evaluating Designs (Benefits and costs)

- Number of effects tested with good power (power ≥ .80) …Still Something Missing
- It's not how many effects
- But WHICH effects can be tested:
- Tradeoff Matrix

3-Form Design

Student Received Item Set?

----------------------------

X A B C

core peer parent other

Form 1 yes yes yes NO

Form 2 yes yes NO yes

Form 3 yes NO yes yes

3-Form Design:Implementation Strategies

- Core Questions in "X" set
- Keep related questions together in A or B or C sets
- Example for Collaboration (Hansen & Graham)
- X set (core items)
- A: Hansen Set
- B: Graham set
- C: Other

"Back Against the Wall" Concept

3-form design better received if one of these is true:

- You CAN ask some number of questions (e.g., 100)
- You WANT to ask some larger number of questions (e.g., 133)
- You have been asking 133 questions of respondents
- Data Collectors (or data gate keepers) say you MUST reduce number of questions

Some Future Directions

- Current power calculations based on zero-order correlations
- (beneficial) effect of auxiliary variables not taken into account
- Current power calculations based on level one correlation analysis
- loss of power will be discounted in multilevel analyses

Change in FMI adding 15 Aux Vars from X set

DV: Trouble Dataset: AAPT 7th graders

Download Presentation

Connecting to Server..