On the Merits of Planning and Planning for Missing Data*
This presentation is the property of its rightful owner.
Sponsored Links
1 / 117

Todd D. Little University of Kansas Director, Quantitative Training Program PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

On the Merits of Planning and Planning for Missing Data* *You’re a fool for not using planned missing data design. Todd D. Little University of Kansas Director, Quantitative Training Program Director, Center for Research Methods and Data Analysis

Download Presentation

Todd D. Little University of Kansas Director, Quantitative Training Program

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Todd d little university of kansas director quantitative training program

  • On the Merits of Planning and Planning for Missing Data*

  • *You’re a fool for not using planned missing data design

Todd D. Little

University of Kansas

Director, Quantitative Training Program

Director, Center for Research Methods and Data Analysis

Director, Undergraduate Social and Behavioral Sciences Methodology Minor

Member, Developmental Psychology Training Program

crmda.KU.edu

Workshop presented 05-21-2012 @

Max Planck Institute for Human Development in Berlin, Germany

Very Special Thanks to: Mijke Rhemtulla & Wei Wu

crmda.KU.edu


University of kansas

University of Kansas

crmda.KU.edu


University of kansas1

University of Kansas

crmda.KU.edu


University of kansas2

University of Kansas

crmda.KU.edu


University of kansas3

University of Kansas

crmda.KU.edu


University of kansas4

University of Kansas

crmda.KU.edu


Road map

  • Learn about the different types of missing data

  • Learn about ways in which the missing data process can be recovered

  • Understand why imputing missing data is not cheating

    • Learn why NOT imputing missing data is more likely to lead to errors in generalization!

  • Learn about intentionally missing designs

  • Discuss imputation with large longitudinal datasets

  • Introduce a simple method for significance testing

Road Map

crmda.KU.edu


Key considerations

Key Considerations

  • Recoverability

    • Is it possible to recover what the sufficient statistics would have been if there was no missing data?

      • (sufficient statistics = means, variances, and covariances)

    • Is it possible to recover what the parameter estimates of a model would have been if there was no missing data.

  • Bias

    • Are the sufficient statistics/parameter estimates systematically different than what they would have been had there not been any missing data?

  • Power

    • Do we have the same or similar rates of power (1 – Type II error rate) as we would without missing data?

crmda.KU.edu


Types of missing data

Types of Missing Data

  • Missing Completely at Random (MCAR)

    • No association with unobserved variables (selective process) and no association with observed variables

  • Missing at Random (MAR)

    • No association with unobserved variables, but maybe related to observed variables

      • Random in the statistical sense of predictable

  • Non-random (Selective) Missing (MNAR)

    • Some association with unobserved variables and maybe with observed variables

crmda.KU.edu


Effects of imputing missing data

Effects of imputing missing data

crmda.KU.edu


Effects of imputing missing data1

Effects of imputing missing data

Statistical Power: Will always be greater when missing data is imputed!

crmda.KU.edu


Bad missing data corrections

Bad Missing Data Corrections

  • List-wise Deletion

    • If a single data point is missing, delete subject

    • N is uniform but small

    • Variances biased, means biased

    • Acceptable only if power is not an issue and the incomplete data is MCAR

  • Pair-wise Deletion

    • If a data point is missing, delete paired data points when calculating the correlation

    • N varies per correlation

    • Variances biased, means biased

    • Matrix often non-positive definite

    • Acceptable only if power is not an issue and the incomplete data is MCAR

www.crmda.ku.edu


Bad imputation techniques

Bad Imputation Techniques

  • Sample-wise Mean Substitution

    • Use the mean of the sample for any missing value of a given individual

    • Variances reduced

    • Correlations biased

  • Subject-wise Mean Substitution

    • Use the mean score of other items for a given missing value

      • Depends on the homogeneity of the items used

      • Is like regression imputation with regression weights fixed at 1.0

www.crmda.ku.edu


Questionable imputation techniques

Questionable Imputation Techniques

  • Regression Imputation – Focal Item Pool

    • Regress the variable with missing data on to other items selected for a given analysis

    • Variances reduced

    • Assumes MCAR and MAR

  • Regression Imputation – Full Item Pool

    • Variances reduced

    • Attempts to account for NMAR in as much as items in the pool correlate with the unobserved variables responsible for the missingness

www.crmda.ku.edu


Modern missing data analysis

Modern Missing Data Analysis

MI or FIML

  • In 1978, Rubin proposed Multiple Imputation (MI)

    • An approach especially well suited for use with large public-use databases.

    • First suggested in 1978 and developed more fully in 1987.

    • MI primarily uses the Expectation Maximization (EM) algorithm and/or the Markov Chain Monte Carlo (MCMC) algorithm.

  • Beginning in the 1980’s, likelihood approaches developed.

    • Multiple group SEM

    • Full Information Maximum Likelihood (FIML).

      • An approach well suited to more circumscribed models

crmda.KU.edu


Full information maximum likelihood

Full Information Maximum Likelihood

  • FIML maximizes the casewise -2loglikelihood of the available data to compute an individual mean vector and covariance matrix for every observation.

    • Since each observation’s mean vector and covariance matrix is based on its own unique response pattern, there is no need to fill in the missing data.

  • Each individual likelihood function is then summed to create a combined likelihood function for the whole data frame.

    • Individual likelihood functions with greater amounts of missing are given less weight in the final combined likelihood function than those will a more complete response pattern, thus controlling for the loss of information.

  • Formally, the function that FIML is maximizing is

    where

crmda.KU.edu


Multiple imputation

Multiple Imputation

  • Multiple imputation involves generating m imputed datasets (usually between 20 and 100), running the analysis model on each of these datasets, and combining the m sets of results to make inferences.

    • By filling in m separate estimates for each missing value we can account for the uncertainty in that datum’s true population value.

  • Data sets can be generated in a number of ways, but the two most common approaches are through an MCMC simulation technique such as Tanner & Wong’s (1987) Data Augmentation algorithm or through bootstrapping likelihood estimates, such as the bootstrapped EM algorithm used by Amelia II.

    • SAS uses data augmentation to pull random draws from a specified posterior distribution (i.e., stationary distribution of EM estimates).

  • After m data sets have been created and the analysis model has been run on each separately, the resulting estimates are commonly combined with Rubin’s Rules (Rubin, 1987).

crmda.KU.edu


Good data imputation techniques

Good Data Imputation Techniques

  • (But only if variables related to missingness are included in analysis, or missingness is MCAR)

  • EM Imputation

    • Imputes the missing data values a number of times starting with the E step

    • The E(stimate)-step is a stochastic regression-based imputation

    • The M(aximize)-step is to calculate a complete covariance matrix based on the estimated values.

    • The E-step is repeated for each variable but the regression is now on the covariance matrix estimated from the first E-step.

    • The M-step is repeated until the imputed estimates don’t differ from one iteration to the other

  • MCMC imputation is a more flexible (but computer-intensive) algorithm.

crmda.KU.edu


Good data imputation techniques1

Good Data Imputation Techniques

  • (But only if variables related to missingness are included in analysis, or missingness is MCAR)

  • Multiple (EM or MCMC) Imputation

    • Impute N (say 20) datasets

    • Each data set is based on a resampling plan of the original sample

      • Mimics a random selection of another sample from the population

    • Run your analyses N times

      • Calculate the mean and standard deviation of the N analyses

crmda.KU.edu


Fraction missing

Fraction Missing

  • Fraction Missing is a measure of efficiency lost due to missing data. It is the extent to which parameter estimates have greater standard errors than they would have had all data been observed.

  • It is a ratio of variances:

    Estimated parameter variance in the complete data set

    Between-imputation variance

crmda.KU.edu


Fraction missing1

Fraction Missing

  • Fraction of Missing Information (asymptotic formula)

  • Varies by parameter in the model

  • Is typically smaller for MCAR than MAR data

crmda.KU.edu


Estimate missing data with sas

Estimate Missing Data With SAS

Obs BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6

1 65 95 95 100 23 25 25 27

2 10 10 40 25 25 27 28 27

3 95 100 100 100 27 29 29 28

4 90 100 100 100 30 30 27 29

5 30 80 90 100 23 29 29 30

6 40 50 . . 28 27 3 3

7 40 70 100 95 29 29 30 30

8 95 100 100 100 28 30 29 30

9 50 80 75 85 26 29 27 25

10 55 100 100 100 30 30 30 30

11 50 100 100 100 30 27 30 24

12 70 95 100 100 28 28 28 29

13 100 100 100 100 30 30 30 30

14 75 90 100 100 30 30 29 30

15 0 5 10 . 3 3 3 .

crmda.KU.edu


Proc mi

PROC MI data=sample out=outmi

seed = 37851 nimpute=100

EM maxiter = 1000;

MCMC initial=em (maxiter=1000);

Var BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6;

run;

out=

Designates output file for imputed data

nimpute =

# of imputed datasets

Default is 5

Var

Variables to use in imputation

PROC MI

crmda.KU.edu


Proc mi output imputed dataset

PROC MI output: Imputed dataset

Obs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6

1 1 65 95 95 100 23 25 25 27

2 1 10 10 40 25 25 27 28 27

3 1 95 100 100 100 27 29 29 28

4 1 90 100 100 100 30 30 27 29

5 1 30 80 90 100 23 29 29 30

6 1 40 50 21 12 28 27 3 3

7 1 40 70 100 95 29 29 30 30

8 1 95 100 100 100 28 30 29 30

9 1 50 80 75 85 26 29 27 25

10 1 55 100 100 100 30 30 30 30

11 1 50 100 100 100 30 27 30 24

12 1 70 95 100 100 28 28 28 29

13 1 100 100 100 100 30 30 30 30

14 1 75 90 100 100 30 30 29 30

15 1 0 5 10 8 3 3 3 2

crmda.KU.edu


What to say to reviewers

What to Say to Reviewers:

  • I pity the fool who does not impute

    • Mr. T

  • If you compute you must impute

    • Johnny Cochran

  • Go forth and impute with impunity

    • Todd Little

  • If math is God’s poetry, then statistics are God’s elegantly reasoned prose

    • Bill Bukowski

crmda.KU.edu


Planned missing data designs

Planned missing data designs

  • In planned missing data designs, participants are randomly assigned to conditions in which they do not respond to all items, all measures, and/or all measurement occasions

  • Why would you want to do this?

    • Long assessments can reduce data quality

    • Repeated assessments can induce practice effects

    • Collecting data can be time- and cost-intensive

    • Less taxing assessments may reduce unplanned missingness

crmda.KU.edu


Planned missing data designs1

Planned missing data designs

  • Cross-Sectional Designs

    • Matrix sampling (brief)

    • Three-Form Design (and Variations)

    • Two-Method Measurement (very cool)

  • Longitudinal Designs

    • Developmental Time-Lag

    • Wave- to Age-based designs

    • Monotonic Sample Reduction

    • Growth-Curve Planned Missing

crmda.KU.edu


Multiple matrix sampling

Multiple matrix sampling

crmda.KU.edu


Multiple matrix sampling1

Multiple matrix sampling

Test a few participants on full item bank

crmda.KU.edu


Multiple matrix sampling2

Multiple matrix sampling

Or, randomly sample items and people…

crmda.KU.edu


Multiple matrix sampling3

Multiple matrix sampling

  • Assumptions

    • The K items are a random sample from a population of items (just as N participants are a random sample from a population)

  • Limitations

    • Properties of individual items or relations between items are not of interest

  • Not used much outside of ability testing domain.

crmda.KU.edu


3 form intentionally missing design

3-Form Intentionally Missing Design

  • Graham Graham, Taylor, Olchowski, & Cumsille(2006)

  • Raghunathan & Grizzle (1995) “split questionnaire design”

  • Wacholder et al. (1994) “partial questionnaire design”

crmda.KU.edu


3 form design

3-form design

  • What goes in the Common Set?

crmda.KU.edu


3 form design example

3-form design: Example

  • 21 questions made up of 7 3-question subtests

crmda.KU.edu


3 form design example1

3-form design: Example

  • Common Set (X)

crmda.KU.edu


3 form design example2

3-form design: Example

  • Common Set (X)

crmda.KU.edu


3 form design example3

3-form design: Example

  • Set A

I start conversations.

I get stressed out easily.

I am always prepared.

I have a rich vocabulary.

I am interested in people.

crmda.KU.edu


3 form design example4

3-form design: Example

  • Set B

I am the life of the party.

I get irritated easily.

I like order.

I have excellent ideas.

I have a soft heart.

crmda.KU.edu


3 form design example5

3-form design: Example

  • Set C

I am comfortable around people.

I have frequent mood swings.

I pay attention to details.

I have a vivid imagination.

I take time out for others.

crmda.KU.edu


Todd d little university of kansas director quantitative training program

crmda.KU.edu


Todd d little university of kansas director quantitative training program

crmda.KU.edu


Todd d little university of kansas director quantitative training program

crmda.KU.edu


Todd d little university of kansas director quantitative training program

crmda.KU.edu


Expansions of 3 form design

Expansions of 3-Form Design

  • (Graham, Taylor, Olchowski, & Cumsille, 2006)

crmda.KU.edu


Expansions of 3 form design1

Expansions of 3-Form Design

  • (Graham, Taylor, Olchowski, & Cumsille, 2006)

crmda.KU.edu


2 method planned missing design

2-Method Planned Missing Design

crmda.KU.edu


2 method measurement

2-Method Measurement

  • Expensive Measure 1

    • Gold standard– highly valid (unbiased) measure of the construct under investigation

    • Problem: Measure 1 is time-consuming and/or costly to collect, so it is not feasible to collect from a large sample

  • Inexpenseive Measure 2

    • Practical– inexpensive and/or quick to collect on a large sample

    • Problem: Measure 2 is systematically biased so not ideal

crmda.KU.edu


2 method measurement1

2-Method Measurement

  • e.g., measuring stress

    • Expensive Measure 1 = collect spit samples, measure cortisol

    • Inexpensive Measure 2 = survey querying stressful thoughts

  • e.g., measuring intelligence

    • Expensive Measure 1 = WAIS IQ scale

    • Inexpensive Measure 2 = multiple choice IQ test

  • e.g., measuring smoking

    • Expensive Measure 1 = carbon monoxide measure

    • Inexpensive Measure 2 = self-report

  • e.g., Student Attention

    • Expensive Measure 1 = Classroom observations

    • Inexpensive Measure 2 = Teacher report

crmda.KU.edu


2 method measurement2

2-Method Measurement

  • How it works

    • ALL participants receive Measure 2 (the cheap one)

    • A subset of participants also receive Measure 1 (the gold standard)

    • Using both measures (on a subset of participants) enables us to estimate and remove the bias from the inexpensive measure (for all participants) using a latent variable model

crmda.KU.edu


2 method planned missing design1

2-Method Planned Missing Design

Self-Report

Bias

Self-

Report 1

Self-

Report 2

CO

Cotinine

Smoking

crmda.KU.edu


2 method measurement3

2-Method Measurement

  • Example

    • Does child’s level of classroom attention in Grade 1 predict math ability in Grade 3?

    • Attention Measures

      • 1) Direct Classroom Assessment (2 items, N = 60)

      • 2) Teacher Report (2 items, N = 200)

    • Math Ability Measure, 1 item (test score, N = 200)

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Attention

(Grade 1)

Math Score

(Grade 3)

Teacher

Report 1

(N = 200)

Math Score

(Grade 3)

(N = 200)

Teacher

Report 2

(N = 200)

Direct

Assessment 1

(N = 60)

Direct

Assessment 2

(N = 60)

Teacher

Bias

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Attention

(Grade 1)

Math Score

(Grade 3)

Teacher

Report 1

(N = 200)

Math Score

(Grade 3)

(N = 200)

Teacher

Report 2

(N = 200)

Direct

Assessment 1

(N = 60)

Direct

Assessment 2

(N = 60)

Teacher

Bias

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Attention

(Grade 1)

Math Score

(Grade 3)

Teacher

Report 1

(N = 200)

Math Score

(Grade 3)

(N = 200)

Teacher

Report 2

(N = 200)

Direct

Assessment 1

(N = 60)

Direct

Assessment 2

(N = 60)

Teacher

Bias

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Attention

(Grade 1)

Math Score

(Grade 3)

Teacher

Report 1

(N = 200)

Math Score

(Grade 3)

(N = 200)

Teacher

Report 2

(N = 200)

Direct

Assessment 1

(N = 60)

Direct

Assessment 2

(N = 60)

Teacher

Bias

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Attention

(Grade 1)

Math Score

(Grade 3)

Teacher

Report 1

(N = 200)

Math Score

(Grade 3)

(N = 200)

Teacher

Report 2

(N = 200)

Direct

Assessment 1

(N = 60)

Direct

Assessment 2

(N = 60)

Teacher

Bias

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Attention

(Grade 1)

Math Score

(Grade 3)

Teacher

Report 1

(N = 200)

Math Score

(Grade 3)

(N = 200)

Teacher

Report 2

(N = 200)

Direct

Assessment 1

(N = 60)

Direct

Assessment 2

(N = 60)

Teacher

Bias

crmda.KU.edu


2 method planned missing design2

2-Method Planned Missing Design

crmda.KU.edu


2 method planned missing design3

2-Method Planned Missing Design

crmda.KU.edu


Todd d little university of kansas director quantitative training program

2-Method Planned Missing Design

  • Assumptions:

    • expensive measure is unbiased (i.e., valid)

    • inexpensive measure is systematically biased

    • both measures access the same construct

  • Goals

    • Optimize cost

    • Optimize power

crmda.KU.edu


2 method planned missing design4

2-Method Planned Missing Design

  • All participants get the inexpensive measure

  • Only a subset get the expensive measure

  • Cost:

crmda.KU.edu


2 method planned missing design5

2-Method Planned Missing Design

  • Holding cost constant, as Ntotal increases, Nexpensive decreases

  • As Ntotal increases, SEs begin to decrease (power increases); as Ntotal continues to increase, SEs increase again, driving power back down

crmda.KU.edu


2 method planned missing design6

2-Method Planned Missing Design

  • Goal: find the sweet spot!

crmda.KU.edu


Longitudinal missing designs

Longitudinal Missing Designs

  • Rather than specific items missing, longitudinal planned missing designs tend to focus on whole waves missing for individual participants

  • Researchers have long turned complete data into planned missing data with more time points

    • e.g., data at 3 grades transformed into 8 ages

crmda.KU.edu


Developmental time lag model

Developmental Time-Lag Model

  • Use 2-time point data with variable time-lags to measure a growth trajectory + practice effects (McArdle & Woodcock, 1997)

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Time

Age

student

T1

T2

2

4

6

0

1

3

5

1

5;6

5;7

2

5;3

5;8

3

4;9

4;11

4

4;6

5;0

5

4;11

5;4

6

5;7

5;10

7

5;2

5;3

8

5;4

5;8

crmda.KU.edu


Todd d little university of kansas director quantitative training program

T0

T1

T2

T3

T4

T5

T6

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Intercept

1

1

1

1

1

1

1

T0

T1

T2

T3

T4

T5

T6

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Linear growth

Intercept

Growth

1

0

6

1

1

5

1

2

4

3

1

1

1

1

T0

T1

T2

T3

T4

T5

T6

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Constant Practice Effect

Intercept

Growth

Practice

0

1

0

6

1

1

1

5

1

1

2

4

3

1

1

1

1

1

1

1

1

T0

T1

T2

T3

T4

T5

T6

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Exponential Practice Decline

Intercept

Growth

Practice

0

1

0

6

1

1

1

5

.87

1

2

4

3

.67

1

1

.55

1

.45

.35

1

T0

T1

T2

T3

T4

T5

T6

crmda.KU.edu


Todd d little university of kansas director quantitative training program

The Equations for Each Time Point

Constant Practice Effect Declining Practice Effect

crmda.KU.edu


Developmental time lag model1

Developmental Time-Lag model

  • Summary

    • 2 measured time points are formatted according to time-lag

    • This formatting allows a growth-curve to be fit, measuring growth and practice effects

crmda.KU.edu


Wave to age based data

Wave- to Age-based Data

  • The idea of reformatting data to answer a different question is not limited to time-lag designs

  • Wave-based data collection (e.g., data collected at Grade 1-3) can be transformed into age-based data with missingness

crmda.KU.edu


Todd d little university of kansas director quantitative training program

age

grade

5;6-

5;11

6;6-

6;11

7;6-

7;11

4;6-

4;11

5;0-

5;5

6;0-

6;5

7;0-

7;5

2

student

K

1

1

5;6

6;7

7;3

2

5;3

6;0

7;4

3

4;9

5;11

6;10

4

4;6

5;5

6;4

5

4;11

5;9

6;10

6

5;7

6;7

7;5

7

5;2

6;1

7;3

8

5;4

6;5

7;6

crmda.KU.edu


Todd d little university of kansas director quantitative training program

age

  • Out of 3 waves, we create 7 waves of data with high missingness

  • Allows for more fine-tuned age-specific growth modeling

  • Even high amounts of missing data are not typically a problem for estimation

5;6-

5;11

6;6-

6;11

7;6-

7;11

4;6-

4;11

5;0-

5;5

6;0-

6;5

7;0-

7;5

5;6

6;7

7;3

5;3

6;0

7;4

4;9

5;11

6;10

4;6

5;5

6;4

4;11

5;9

6;10

5;7

6;7

7;5

5;2

6;1

7;3

5;4

6;5

7;6

crmda.KU.edu


Monotonic s ample r eduction

Monotonic Sample Reduction

  • Advantages:

    • Cost reduction

    • A lot of power to estimate effects at earlier waves

  • Disadvantages:

    • Very little power to estimate effects dependent on the last wave of data, e.g., growth curve models (may be missing 80% of data)

    • It is important to be able to estimate attrition rates before beginning data collection

crmda.KU.edu


Monotonic sample reduction

Monotonic Sample Reduction

  • Sometimes used in large datasets (e.g., Early Childhood Longitudinal Study) to reduce costs

  • At each wave, a randomly-selected subgroup of the original sample is observed again

  • The remainder of the original participants do not need to be kept track of, dramatically reducing costs

crmda.KU.edu


Growth curve planned missing

Growth-Curve Planned Missing

  • With a particular analysis in mind, missingness may be tailored to maximize power

    • In growth-curve designs, the most important parameters are the growth parameters (e.g., estimate the steepness and the shape of the curve)

    • Estimation precision depends heavily on the first and last time points

    • A planned missing design can take advantage of this by putting missingness in the middle

crmda.KU.edu


Growth curve design

Growth-Curve Design

crmda.KU.edu


Growth curve design ii

Growth Curve Design II

crmda.KU.edu


Growth curve design ii1

Growth Curve Design II

crmda.KU.edu


Efficiency of planned missing designs

Efficiency of Planned Missing Designs

crmda.KU.edu


Combined elements

Combined Elements

crmda.KU.edu


The sequential designs

The Sequential Designs

crmda.KU.edu


Transforming to accelerated longitudinal

Transforming to Accelerated Longitudinal

crmda.KU.edu


Transforming to episodic time

Transforming to Episodic Time

crmda.KU.edu


Planned missing designs summary

Planned Missing Designs: Summary

  • Purposeful missing data can address several issue in study design

    • Cost of data collection

    • Participant burden/fatigue

    • Practice effects

    • Participant dropout

  • Rearranging data can turn one complete design into a more nuanced missing data design

    • Developmental time-lag designs

    • Wave-missing into age-missing

crmda.KU.edu


The impact of auxiliary variables

The Impact of Auxiliary Variables

  • Consider the following Monte Carlo simulation:

    • 60% MAR (i.e., Aux1) missing data

    • 1,000 samples of N = 100

www.crmda.ku.edu

crmda.KU.edu

90


Excluding a correlate of missingness

Excluding A Correlate of Missingness

www.crmda.ku.edu

crmda.KU.edu

91


Todd d little university of kansas director quantitative training program

Simulation Results Showing the Bias Associated with Omitting a Correlate of Missingness.

crmda.KU.edu


Mnar improvements

MNAR improvements

www.crmda.ku.edu

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Simulation Results Showing the Bias Reduction Associated with Including Auxiliary Variables in a MNAR Situation.

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Improvement in power relative to the power of a model with no auxiliary variables.

Q

Q

Simulation results showing the relative power associated with including auxiliary variables in a MCAR Situation.

Q

Q

Q

crmda.KU.edu


Pca auxiliary variables

PCA Auxiliary Variables

  • Use PCA to reduce the dimensionality of the auxiliary variables in a data set.

    • A new smaller set of auxiliary variables are created (e.g., principal components) that contain all the useful information (both linear and non-linear) in the original data set.

  • These principal component scores are then used to inform the missing data handling procedure (i.e., FIML, MI).

www.crmda.ku.edu

crmda.KU.edu

96


The use of pca auxiliary variables

The Use of PCA Auxiliary Variables

  • Consider a series of simulations:

    • MCAR, MAR, MNAR (10-60%) missing data

    • 1,000 samples of N = 50-1000

www.crmda.ku.edu

crmda.KU.edu

97


Todd d little university of kansas director quantitative training program

60% MAR correlation estimates with no auxiliary variables

Simulation results showing XY correlation estimates (with 95 and 99% confidence intervals) associated with a 60% MAR Situation.

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Bias – Linear MAR process

ρAux,Y = .60; 60% MAR

crmda.KU.edu

99


Todd d little university of kansas director quantitative training program

Non-Linear Missingness

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Bias – Non-Linear MAR process

ρAux,Y = .60; 60% non-linear MAR

crmda.KU.edu

101


Todd d little university of kansas director quantitative training program

Bias

ρAux,Y = .60; 60% MAR

crmda.KU.edu

102


Todd d little university of kansas director quantitative training program

Bias

ρAux,Y = .60; 60% MAR

crmda.KU.edu

103


Todd d little university of kansas director quantitative training program

Bias

ρAux,Y = .60; 60% MAR

crmda.KU.edu

104


Todd d little university of kansas director quantitative training program

60% MAR correlation estimates with no auxiliary variables

Simulation results showing XY correlation estimates (with 95 and 99% confidence intervals) associated with a 60% MAR Situation.

crmda.KU.edu

105


Todd d little university of kansas director quantitative training program

60% MAR correlation estimates with all possible auxiliary variables (r = .60)

Simulation results showing XY correlation estimates (with 95 and 99% confidence intervals) associated with a 60% MAR Situation and 8 auxiliary variables.

crmda.KU.edu


Todd d little university of kansas director quantitative training program

60% MAR correlation estimates with 1 PCA auxiliary variable (r = .60)

Simulation results showing XY correlation estimates (with 95 and 99% confidence intervals) associated with a 60% MAR Situation and 1 PCA auxiliary variable.

crmda.KU.edu

107


Todd d little university of kansas director quantitative training program

1 PCA Auxiliary

Auxiliary Variable Power Comparison

All 8 Auxiliary Variables

1 Auxiliary

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Faster and more reliable convergence

crmda.KU.edu


Summary

Summary

  • Including principal component auxiliary variables in the imputation model improves parameter estimation compared to

    • the absence of auxiliary variables and

    • beyond the improvement of typical auxiliary variables in most cases, particularly with the non-linear MAR type of missingness.

  • Improve missing data handling procedures when the number of potential auxiliary variables is beyond a practical limit.

www.crmda.ku.edu

crmda.KU.edu

110


Todd d little university of kansas director quantitative training program

www.quant.ku.edu

crmda.KU.edu


Simple significance testing with mi

Simple Significance Testing with MI

  • Generate multiply imputed datasets (m).

  • Calculate a single covariance matrix on all N*m observations.

    • By combining information from all m datasets, this matrix should represent the best estimate of the population associations.

  • Run the Analysis model on this single covariance matrix and use the resulting estimates as the basis for inference and hypothesis testing.

    • The fit function from this approach should be the best basis for making inferences about model fit and significance.

  • Using a Monte Carlo Simulation, we test the hypothesis that this approach is reasonable.

crmda.KU.edu


Todd d little university of kansas director quantitative training program

Population Model

.52

Factor B

Factor A

1*

1*

.76

.75

.68

.70

.67

.72

.69

.79

.72

.75

.81

.72

.74

.70

.71

.69

.81

.73

.78

.79

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

.49

.43

.53

.48

.52

.38

.42

.51

.35

.49

.45

.52

.50

.38

.53

.35

.47

.44

.55

.39

Note: These are fully standardized parameter estimates

RMSEA = .047, CFI = .967, TLI = .962, SRMR = .021

crmda.KU.edu


Change in chi squared test correlation matrix technique

Change in Chi-squared TestCorrelation Matrix Technique

crmda.KU.edu


Todd d little university of kansas director quantitative training program

  • On the Merits of Planning and Planning for Missing Data*

  • *You’re a fool for not using planned missing data design

Thanks for your attention!

Questions?

crmda.KU.edu

Workshop presented 05-21-2012

Max Planck Institute for Human Development, Berlin, Germany

crmda.KU.edu


Todd d little university of kansas director quantitative training program

References

crmda.KU.edu

Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.

Graham, J. W., Hofer, S. M., & Piccinin, A. M. (1994). Analysis with Missing Data in Drug Prevention Research. In L. M. Collins & L. Seitz (Eds.), National Institute on Drug Abuse Research Monograph Series (pp. 13-62). Washington, DC: National Institute on Drug Abuse.

Graham, J. W., Hofer, S. M., & MacKinnon, D. P. (1996). Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures. Multivariate Behavioral Research, 31, 197-218.

Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data designs in psychological research. Psychological Methods, 11, 323−343.

Graham, J. W., Taylor, B. J.,& Cumsille, P. E. (2001). Planned missing data designs in the analysis of change. In L. M. Collins &A.G. Sayer (Eds.), New methods for the analysis of change (pp. 335−353). Washington, D.C.: American Psychological Association.  

McArdle, J. J. & Woodcock, R. W. (1997). Expanding test-retest designs to include developmental time-lag components. Psychological Methods, 2, 403-435.

Raghunathan, T. E., & Grizzle, J. E. (1995). A split questionnaire survey design. Journal of the American Statistical Association, 90, 54-63.

Shoemaker, D. M. (1971). Principles and procedures of multiple matrix sampling. Southwest regional library technical report 34.

Wacholder, S., Carroll, R. J., Pee, D., & Gail, M. H. (1994). The partial questionnaire design for case-control studies. Statistics in Medicine, 13, 623-634.


Update

Update

Dr. Todd Little is currently at

Texas Tech University

Director, Institute for Measurement, Methodology, Analysis and Policy (IMMAP)

Director, “Stats Camp”

Professor, Educational Psychology and Leadership

Email: [email protected]

IMMAP (immap.educ.ttu.edu)

Stats Camp (Statscamp.org)

www.Quant.KU.edu


  • Login