1 / 50

Missing Data in Epidemiology: Issues & Approaches

Missing Data in Epidemiology: Issues & Approaches. N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course). There are known knowns ; There are things we know we know. We also know there are known unknowns; That is to say we know there are some things we do not know.

Download Presentation

Missing Data in Epidemiology: Issues & Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Missing Data in Epidemiology: Issues & Approaches N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

  2. There are known knowns; • There are things we know we know. • We also know there are known unknowns; • That is to say we know there are some things we do not know. • But there are also unknown unknowns; • The ones we don’t know we don’t know. U.S. Secretary of Defense, Donald H. Rumsfeld Department of Defense news briefing, February 12, 2002

  3. Example (1) RR = 1.5 Now, assume 50% of the females refuse to give you information about their final outcome (decline that question but continue in the study). RR = 1.5

  4. Example (2) • We are missing the outcome status on 50% of the females • Using available data, we find: • Overall estimate of the rate of disease is biased • The RR for risk in females compared to males is OK • Why? • Subjects missing the outcome status are a random subset of all females • Female-specific incidence risk is correct • Prevalence of female sex is lower in study ‘complete cases’ • fails to reflect the 50:50 distribution of sex in the target population • External validity

  5. Example (3) RR = 1.5 Now, assume 50% of the females refuse to give you information about their final outcome. BUT only people not getting outcome refuse. RR = 3.0

  6. Example (4) • The chance the outcome data is missing depends on the true status of the outcome • Using available data, we find: • Overall estimate of the rate of disease is biased • The RR for risk in females compared to males is biased • Why? • Female-specific incidence risk is biased • Over-estimated • Prevalence of female sex is lower in study ‘complete cases’ • Fails to reflect the 50:50 distribution of sex in the target population

  7. Why missing data matters (1) • All studies have missing data • People drop out of studies • People decline one of several questionnaires • People decline to complete certain questions (e.g. income) • People miss questions (pages get stuck together) • Lab tests fail • biological levels are ‘below threshold of detection’ • Missing data is usually not the focus of a study • In many cases, missing data is just ignored

  8. Why missing data matters • Failing to adjust properly for missing data can causes serious problems. • Introduce potential bias in parameter estimation • Weaken the generalizability of the results • Ignoring cases with missing data leads to the loss of information • Decreases statistical power • Increases standard errors • Failing to adjust data properly for missing values can make the data unsuitable for a statistical procedure • Can also make the statistical analyses vulnerable to violations of assumptions

  9. Levels of missing data • Data can be missing at two ‘levels’ • Unit-level non-response • A subject included in the study declines to take part and provides no information at all. • Serious issue in much research • Mainly affects external generalizibility • Not the focus of further discussions • Item-level non-response • Subject participates in the study • Fails to provide information for some items • Applies a skip sequence wrongly • Two pages get stuck together

  10. Types of missing data patterns (1) • Three patterns are generally recognized: • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Missing not at Random (MNAR or NMAR)

  11. Types of missing data patterns (2) • Missing Completely at Random (MCAR) • The probability of a data value being missing is independent of all observed and non-observed data. • Missing data is a random sample of all data • Observed data is an unbiased estimator of the results from total data • Complete-case (listwise deletion) methods work fine • Can identify MCAR by comparing cases with and without missing data • Example • Biosamples collected for genotyping • Some results are missing because the instrument failed for one batch of samples

  12. Types of missing data patterns (3) • Missing at Random (MAR) • The probability of a data value being missing is related to observed data but not to non-observed data. • Can be analyzed using Multiple Imputation methods or likelihood-based methods • Example • Looking at prognostic value of SNPs for sub-types of breast cancer • Eligible subjects with advanced stage breast cancer (III/IV) were more likely to be missing SNP information • Subjects with advanced disease are less cooperative with the study. • Conditional on disease stage, the probability of missing the SNP is unrelated to the value of the SNP.

  13. Types of missing data patterns (4) • Missing Not at Random (MNAR or NMAR) • The probability of a data value being missing is related to the unobserved values. • e.g. high values are more likely to be missing than low values • Can be analyzed using Multiple Imputation methods or likelihood-based methods • much more complex to use • requires modeling the process yielding the missing values • Example • Looking at study which requires measurement of tumor size. • Smaller tumors are less likely to have size recorded • Harder to measure size of small tumors • Requires more complex methods (e.g. MRI or PET scanning). • Probability of size being missing relates to the size of the tumor

  14. Another classification of Patterns • Univariate missing data • Data are missing on only one variable in the analysis set • Monotonic missing data • You can rearrange the data so the following is true: • If a subject is missing data on variable ‘i’, then they are missing data on all variables after that • Longitudinal study with drop-outs. • Arbitrary missing data • Doesn’t meet the above conditions.

  15. Ignorability • And now, some confusing terminology • Rubin introduced the term ‘ignorability’ • If data is MCAR or MAR, then the mechanism which produces the missing data is not important and can be ignored in analysis. • He called this ‘Ignorability’ • This does not mean that the missing data can be ignored!

  16. Missing data in the literature (1) • Peng et al (2006) • Education & psychology journals • 36% had no missing data • 48% had missing data • 16% were unclear • 97% used listwise deletion or pairwise deletion methods.

  17. Missing data in the literature (2) • Klebanoff & Cole (2008) • Looked at the use of multiple imputation methods • 2 years of articles from Amer J Epidem, Annals Epi, Epidemiology& Int J Epidem • 1,105 original research articles • 16 papers (1.4%) used one of • Multiple Imputation (n=12) • Inverse probability weighing • Expectation-minimization algorithm • 99 papers had imput as text

  18. Missing data in the literature (3) • Desai et al (2011) • Focused on molecular epidemiology studies in Cancer Epidemiology, Biomakers and Prevention • 15 month period (2009-2010) • 278 eligible articles • 95% either had missing data or excluded cases with missing data • Only 23 papers (13%) used missing data methods for analysis • 9 dealt with ‘assays below detection limit’ • Single imputation • 7 used ‘missing data indicators’ • 26 (14%) reported differences between subjects with and without missing data.

  19. Methods to handle missing data (1) • Need to decide on a model for missing data • MCAR • MAR • MNAR • If MNAR, how is the data related to the unobserved value? • Set a statistical model for the full data • Commonly assumed to be multivariate normal • Limiting, especially for categorical data • Some other form

  20. Methods to handle missing data (2) • Complete Case (Listwise deletion) • Pairwise deletion (e.g.. Proc Corr) • Corrected complete case method • Weighted regression model with complete cases • Weights related to inverse of probability that a case is complete • Fill the contingency table • Allocate subjects with missing values of a row/column to cells in proportion to the complete cases. • Replacement with the frequency or mean of complete cases • For categorical variables, create multiple variables (one per level) • Impute the percent of the group at each level • Indicator variable for missing data

  21. Methods to handle missing data (3) • Simple/Single imputation • Multiple imputation • Full MLE methods • SAS can use FIMR (Full information Maximum Likelihood) • Assumes multivariate normality and MAR • Linked to Structural Equation Modeling (PROC CALIS) • Reweighting estimation equations • Used in complex survey studies • Sample weights are adjusted to reflect missing data patterns.

  22. Complete Case (listwise deletion) • Subject missing any values for any variable included in analysis or model are excluded. • Most commonly used method (‘the default’) • Usually used without any thought to missing data patterns, etc. • Acceptable if data is MCAR • Leads to lose of sample size and reduced power/precision • Often produces reasonable results • especially if amount if missing data is small • Can be strongly biased is data is MAR • Methodological results from • multiple papers and • theory

  23. Pairwise deletion • Similar to casewise deletion BUT, only subjects with missing data for variables involved in the specific analysis are subject to exclusion. • Consider a case where x1 is missing some data but x2 and x3 are complete. Suppose the analysis looks at these two models: • Y = B2* x2 + B3 * x3 • Y = B1 * x1 + B2 * x2 + B3 * x3 • In the complete case method, subjects missing x1 will be excluded for both models. • Pairwise deletion: • All subjects would be used in model 1; • Some cases would be excluded in model 2. • Leads to different sub-sets being used for different analyses • Complicates interpretation. • PROC CORR in SAS uses this approach

  24. Corrected Complete Case Method • Subjects missing any values for any variable included in analysis or model are excluded. • Regression models use weighted regression. • Weights are computed to reflect the inverse of the probability that a subject will have complete data. • Works OK if data is MAR but can be seriously biased if not true. • Figuring out the weights is difficult • Finding SE’s can be difficult • Results from Vach et al, 1991

  25. Fill the Contingency Table • Under MAR, the distribution of subjects with missing data across the 4 cells in a contingency table is the same as the distribution of the complete cases. • Modify the Contingency table by allocating ‘counts’ of missing subjects to the table • Similar to the ‘corrected complete case’ method. • Leads to non-integer counts in the cells • Computing variance is tricky because standard formulae don’t work • Logistic Regression needs integer counts in the tables. • Results from Vach et al, 1991

  26. Replacement with the frequency or mean of complete cases • Really a type of single imputation • For each subject with missing data, replace the missing value by the mean of the complete cases • For categorical data, define indicator variables • 0/1 if there is valid data • If data is missing, use the proportion of the complete cases with that level of the variable. • Leads to indicator variables which have non-integer components. • Strongly biased method, even with MAR • more biased than Complete Case method • Henry et al, 2013

  27. Indicator variable for missing data • Treat ‘missing values’ as if they are a valid response to the questionnaire • Assign them a code value • Example (Do you drink alcohol?): • Yes: 1 • No: 2 • Missing: 3 • Analysis is done using three levels • 2 dummy variables • This is a very bad method which is strongly biased.

  28. Indicator variable for missing data • Commonly used and commonly taught in epidemiology courses. • Studied by multiple authors (Vach, Greenland) • Very strongly biased in every study, including theoretical analyses • Consider two situations: • Variable is the main effect of interest:

  29. Full Population data OR = 5.44 Now, assume 30% of data is missing, MCAR. Define the ‘missing data’ indicator variable What is OR of Exp +ve to Exp –ve? It is still 5.44=

  30. Confounding example (1) • So, we gain nothing by defining the missing category. • But, suppose the missing data is in a confounder. • Here is the population data. Crude table is as before (OR=5.44): Level 1 Level 2 OR = 9.0 OR = 9.0 Adjusted OR would be 9.0.  strong confounding

  31. Confounding example (2) • Now, 30% of the data on the confounder is missing. We create the missing value indicator level. • Means we now have three 2x2 tables for our confounding analysis. Level 1 Level 2 OR = 9.0 OR = 9.0 Level 3: Missing OR = 5.44

  32. Confounding example (3) • When there is no missing data, the OR’s are as follows. Clearly, there is confounding with the adjusted OR being 9.0 • When we have the missing indicator in the data, the adjusted OR is not 9.0 but around 8. Very strongly biased.

  33. Indicator Variable for Missing Data • This method has no role in handling missing data • Is strongly biased, even with MCAR data. • One core requirement for any method to address missing data is that it gives the ‘right’ answer for MCAR data.

  34. Single Imputation (1) • Replace a missing value with an estimate of what the value should have been • Various methods are possible • Overall mean • Group-specific mean • Last observation carried forward (in follow-up studies) • An extreme value (e.g. missing = heavy alcohol use) • Regression modeling • Works best with monotonic missing data. • To impute Yj, regress Y1 to Yj-1 for all subjects with valid data for Yj • This gives a group of Betas with SE’s. • Select a value of each beta at random from the distributions. • For single imputation, you often use the actual estimated Beta values • Use the regression equation to estimate the mean value of Yj for subjects with missing data. • Hot-deck imputation • MCMC methods

  35. Single Imputation (2) • Hard to generate validate variance estimates • Greenland found regression-based single imputation to be subject to serious errors in the face of mis-specified models.

  36. Multiple Imputation (1) • MI handles missing data in three steps: • Impute missing data ‘m’times to produce ‘m’ complete data sets; • Analyze each data set using a standard statistical procedure; • Combine the ‘m’results into one using formulae from Rubin (1987) or Schafer (1997). • Most MI methods assume • MAR • Multivariate normality • If the assumptions are met, and if these three steps are done correctly, multiple imputation produces estimates that have nearly optimal statistical properties. They are: • Consistent (and, hence, approximately unbiased in large samples), • Asymptotically efficient (almost), and • Asymptotically normal.

  37. Multiple Imputation (2) • One common method uses regression models in step #1 • Three kinds of variables are included in an imputation regression model: • Variables that are of theoretical interest, • Variables that are associated with the missing mechanism, & • Variables that are correlated with the variables with missing data. • Consider adding interactions terms for continuous variables. • Bayesian ideas can be used in step #1 • Regression based • Set a prior distribution for the regression parameters and error term • Fit model to generate posterior distribution • Select at random from posterior distribution to generate several imputation equations

  38. Multiple Imputation (2) • Bayesian ideas can be used in step #1 • MCMC (Markov Chain Monte Carlo) • Divide sample into subsets with the same missing data for variables • e.g. Group #1: missing x1 & x2 Group #2: missing x1, x3 & x4 • Fit regression models within each pattern of missingness • Impute using these models • Uses full data set to update means, variances and covariances • Make a random selection from the posterior distribution of these parameters • Update the regression models • Repeat • FCS (Fully Conditional Specification) • Similar to above but handles categorical data better • No strong theoretical justification

  39. Multiple Imputation (3) • Most MI models assume variables are multivariate normal • Issues arise with categorical variables • Can treat as continuous and then round to generate a suitable categorical value • Round based on the normal approximation to the binomial distribution • Most studies find MI methods to be the most valid of missing variable methods • Some issues/questions • How many replicate (multiples) to include? • What variables to include in model? • How to handle non-normal variables, including categorical variables? • Software limitations

  40. Full MLE methods (1) • Suppose we have a data set and we want to fit a regression model (could be linear, logistic, etc.) • With no missing data, we use Maximum Likelihood methods • n observations on k variables: • Based on regression model assumptions, the likelihood of the data can be given as: • θ is the set of parameters to estimate • We find the values of θ to make ‘L’ as big as possible

  41. Full MLE methods (2) • What if we have some missing data? • Suppose y1 & y2 have missing data which is MAR • For a subject with missing values, we can not generate the likelihood contribution since we don’t know y1 & y2 • Instead, consider all possible values which they might have, combined with the probability of those values. • Add up the likelihood contribution for every possible value: • Substitute this into the MLE equation and estimate ‘θ’

  42. Full MLE methods (3) • FIML is one way to do this in SAS • Part of PROC CALIS • Assumes multivariate normality • MPlus • Software which can handle non-linear models • Can us various regression models • Logistic • Poisson • Tobit • Cox • Etc.

  43. Reweighting estimation equations • Discussed by Henry et al (2013) • Applies to complex surveys • Differential probability of selection from target population • Analysis requires ‘weights’ to adjust for this • Standard weights are proportional to the inverse of the probability of selection • With missing data, complete case analysis leads to different subsets for each set of variables • weights are incorrect • Adjust each weight to account for probability of being a complete case • Do analysis using new weights and complete cases only • Henry shows it produces very good estimates • Limited area of application

  44. Summary • Missing data can be very important • More than 5-10% of data missing is considered a potential source of serious bias • Need to consider the model which produces the missing data • ‘ad hoc’ methods are poor and should not be used • Multiple Imputation or Full MLE methods give excellent results in most situations • If missing data is MNAR, need to consider the model which gives rise to the missing data • If missingness is strongly related to value of variable, problem is complex

  45. One suggested approach (1) § • Describe target population • Clearly describe derivation of analytic data set • Describe population characteristics of analytic data set, including missing values • Describe differences in population characteristics for subjects with valid and missing data for key variables §adapted from Desai et al Cancer Epidemiol Biomarkers Prev; 20(8), 2011

  46. One suggested approach (2) • Investigate possible assumptions for missing data • Assume MCAR if • no data to suggest it is violated & • no mechanism to generate MNAR • Assume MAR if • MCAR is not acceptable, • no mechanism to generate MNAR & • candidate ancillary variables exist • Assume MNAR if • a priori knowledge exists that missing data are related to unknown values • Conduct a CC analysis

  47. One suggested approach (3) • Choose an additional analysis as appropriate • For MAR, • Use Multiple Imputation with suitable ancillary variables • For MNAR, • Use Multiple Imputation, • Need to model the method which generated the missing data. • If a variable is limited by sensitivity of a lab detection device, • Use a likelihood-based method • Implement the additional analysis • Include all potential ancillary variables • Use SAS if you can postulate a joint distribution for ancillary variables • Use STATA or R (fully conditional method).

  48. One suggested approach (4) • Perform sensitivity analyses • Do both CC & MI • Use different subsets of ancillary variables for MI • Use different models for MNAR missing generation • Interpret the results • If all analyses give same results, this is easy • If they differ, need to present a more complex result in the paper.

More Related