1 / 35

Deirdre Hennessy and Claude Nadeau STAR webinar September 30 th 2011

An assessment of methods to impute risk exposure into model actor’s risk profile for microsimulation. Deirdre Hennessy and Claude Nadeau STAR webinar September 30 th 2011. Missing data in general. ……a common and very frustrating problem in survey research! 

doctor
Download Presentation

Deirdre Hennessy and Claude Nadeau STAR webinar September 30 th 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An assessment of methods to impute risk exposure into model actor’s risk profile for microsimulation. Deirdre Hennessy and Claude Nadeau STAR webinar September 30th 2011

  2. Missing data in general • ……a common and very frustrating problem in survey research!  • Variables such as income and self-reported body mass index(BMI) might be regarded as sensitive and are prone to non-response. • Types of missing data and why they occur. • Missing completely at random (MCAR)– Admin/collection errors • Missing at random (MAR)– Missingness related to known respondent characteristics but not the value of the predictor • Missing not at random (MNAR) or informative missing– Missingness related to a value of the predictor

  3. Missing data in general • Why impute? • Reduce non-response bias • Maintain sample size and statistical efficiency • The aim: To produce an approximately unbiased and efficient estimator by choosing the appropriate imputation method which should be….. • Robust under misspecification • Where the type of analysis that needs to be conducted is considered/what is the purpose of imputed data • Practically appropriate…..computing time, availability of variance estimation formulae etc.

  4. Missing data in general • Potential solutions: • V. simple solutions- complete case (respondent) analysis, easy to implement and understand, only valid under limited conditions, can result in estimates that are biased/imprecise, limited applicability. • Less simple solutions- regression methods stilleasy to implement and understand. Regression, linear or logistic, can be used to model numeric or categorical, can make use of many auxiliary variables. However, it may distort the distribution of the predictor variable and inflate the association between the predictor and other variables. In addition, the imputed values are predicted not actually observed in another data source. This is a parametric approach and may be sensitive to misspecification of the regression model.

  5. Missing data in general • Less simple solutions- “hot-deck” methods stilleasy to implement and understand. These methods assign the value from a record with observed data (“donor data”), to a record with missing data. This approach is suitable for dealing with categorical data. This method is non/semi parametric, making distributional assumptions. However, to work well a reasonably large sample size is required. • “Modern” imputation methods- multiple imputation this is a newer method which involves imputing missing data using an appropriate imputation model that incorporates random imputation, repeat many times (3-10), carry out the analysis of interest in each of resulting datasets and combine the estimates using proscribed rules.

  6. Imputation: Probability theory insight Let X and Y be vectors of random variables Z=(X,Y) ~ f(z)=f(x,y) Loose/misplace Y Generate Y* ~ f(y|x) = f(x,y)/(∫f(x,y)dy) Z*=(X,Y*) ~ f(z) (Z and Z* have same dist) Let Y**=E[Y|X] Z**=(X,Y**) ~ f(z) (Z and Z** have different dist)

  7. Imputation: Probability theory insight Let Y be a random variable and X a vector of r.v. Z=(X,Y) ~ f(z)=f(x,y) Hide Y Electric choc = (guess – Y)2 Two guesses Generate Y* ~ f(y|x) = f(x,y)/(∫f(x,y)dy) Y**=E[Y|X] E[choc*]=2E[choc**] Better off with Y**=E[Y|X]

  8. Missing data in microsimulation • …….a slightly different proposition! • Data for microsimulation is assembled from multiple sources and requires imputation of both missing items and missing variables (i.e. important variables that can be gleaned from only from “donor” data). • In reality assembling a database for microsimulation modelling is one big imputation! • Microsimulation modellers are very practical people!

  9. Patchwork quilt of microsimulation data sources Disclaimer: Not my work!

  10. Missing data in microsimulation • Microsimulation modellers are very practical people! • ……so they have used a variety of approaches to impute missing variables. • Approach depends on data sources available • How the resulting imputed variables will be used in the microsimulation model/purpose of the imputed data • How important the resulting imputed variable is, is it a main outcome or exposure of interest? • Imputation for mircosimulation is NOT standardized, it is unclear which approach produces the best results, however standard imputation approaches, described above, can be used and assessed for best results!

  11. Population health model (POHEM) • Developed at Statistics Canada POHEM is an example of a microsimulation tool that has been used to inform health policy issues such as chronic disease screening and treatment. • Has been applied to cancer, osteoarthritis (OA) and cardiovascular disease (CVD/Acute myocardial infraction (AMI))…..with plans to develop models for diabetes and stroke • POHEM integrates data distributions and equations derived from a wide range of sources, including nationally representative cross-sectional and longitudinal surveys, cancer registries, hospitalization databases, vital statistics, Census, as well as parameters in the published literature---a very complicated patchwork quilt!

  12. POHEM……..more technically • Starts with a cross-sectional sample of Canadian adult population (CCHS 1.1) and generates individual life histories by simulating various types of events (e.g., births, deaths, migration, changes in risk factors, disease onset and progression, treatments, changes in quality of life). • It is a case-by-case, longitudinal, continuous time, stochastic, Monte Carlo microsimulation. • It directly encompasses competing risks and comorbidity. • Has disease specific sub-modules (OA and AMI) and incorporates models of risk factor and disease development • It generates plausible health biographies over the life course of synthetic individuals from empirical observations.

  13. Why do we impute BP and cholesterol? • Not available by self-report! • These are core risk factors for CVD development and progression. • Calculations to determine AMI incidence in POHEM use the Framingham risk function– derived from the famous Framingham Heart Study, a long-term follow-up study of CVD risk factors (including physical and laboratory measures) started in 1948.

  14. Current imputation of blood pressure and cholesterol • Uses an old data source, Canadian Heart Health Study (CHHS), collected 1986-1992 on a provincial basis. • Data not exactly comparable to CCHS 2.2 in terms of other data elements collected, geographic coverage etc. • Specifically, using variables common to the CCHS and CHHS, individual’s BP and cholesterol categories were imputed using “hot-deck” methods. In other words individuals, in the CCHS were matched to those in the CHHS based on 5-year age-group, sex, BMI category and diabetes status and were assigned the corresponding categories BP and total cholesterol/blood pressure available in the CHHS.

  15. New model--POHEM nutrition and health outcomes • In preparation for constructing a model of nutrition and health outcomes we revisited the imputation of BP and cholesterol. • Why?.......to incorporate nutrition and food intake into POHEM (CCHS 2.2), to improve and update imputation. Data/graph courtesy of Meltem Tuna

  16. New model-- Population model of nutrition and health outcomes • Times have changed! • Awareness and treatment and of CVD risk factors have changed drastically Graph taken from : F.A. McAlister, K. Wilkins et al. Changes in awareness, treatment and control of hypertension in Canada over the past 2 decades. CMAJ June 14, 2011.

  17. New model-- Population model of nutrition and health outcomes • We also have new data, Canadian Health Measures Survey (CHMS) 2007-2009, collects BP and cholesterol using validated methods. • In addition, collects many common variables, i.e. variables that are available in both the CHMS and CCHS 2.2, therefore makes imputation a bit easier.

  18. Smoking NPHS 1994-2004 Alcohol Sex transition models Total cholesterol & HDL Region Nutrition Health Outcome* blood pressure Obesity Death Income Progression Diabetes Education Physical activity survival data for each transition incidence rates by province, age and sex initial values & transition models Health Person-Oriented Information (HPOI) (HIRD) Registered Persons database for Ontario (ICES) (CCORT I) initial values CCHS 2.2 Vital statistics (and other POHEM disease modules) CHMS 2007-2009 competing risk of death from other causes Simplified POHEM nutrition to outcome model * Outcomes associated with high cholesterol and high blood pressure include hypertension, heart disease, AMI, stroke, heart failure and gastric cancer.

  19. Objective of the study • To investigate various techniques to create imputed variables for BP and cholesterol, using the CHMS 2007-2009 as the donor data and CCHS 2.2 as the recipient data.

  20. Methods: Donor Data Source CHMS 2007-2009 • Sample size: 5,604 of those 3,719 were >18 years. • Collects data on self-reported health, chronic disease status, physical activity etc. in the same or very similar manner to CCHS 2.2. • Collects physical measures of BMI, CVD risk factors, physical activity and fitness– very innovative survey. • Uses validated measures to collect BP and cholesterol, even using a fasting sub-sample to collect cholesterol (~2,600).

  21. Methods: Donor Data Source • Disadvantages of CHMS 2007-2009 • Small sample size, limited age range. • Limited geographic coverage compared to CCHS. Because of cost and logistics considerations, 15 collection sites were chosen (primary sampling units), from 5 regional strata. • Analysts are advised to perform analysis at national level only. • Analytic options are somewhat limited, due to small number of degrees of freedom (11). This needs to be considered in the analysis to obtain the proper results in statistical tests or confidence intervals.

  22. Methods: Donor Data Source CCHS 2.2 2004 • Sample size: 35,107 of those 21,160 were >18 years. • Nationally representative. • Collects data on self-reported health, chronic disease status, physical activity etc. but also detailed food intake data including a 24 hour dietary recall (the gold standard of food intake data available in Canada). • Collects measured BMI, only on a subsample (~12,500), but calculates a special weight to account for missing BMI data.

  23. Methods: Study Sample (donor and recipient data) CCHS-n=35,107 CHMS-n=5,604 21,160 adults 3,719 adults Complete microdata file including food intake and CVD risk factors SBP/DBP Total chol/HDL/LDL Imputation 24

  24. Results: Regression imputation v1.0 • Case of Y**=E[Y|X] • At first attempted a complete case analysis….not much missing data in CHMS…so sample size was even smaller. • Construct a linear regression model of variables common to CHMS and CCHS, including income, education, home ownership, marital status, immigrant status, racial/ethnic origin, chronic disease status, smoking status and BMI. • Modelled the data by sex.

  25. Results: Female model Female model: Survey: Linear regression Number of obs = 1919 Population size = 11900450 Replications = 500 Design df = 499 F( 4, 496) = 167.36 Prob > F = 0.0000 R-squared = 0.4140 ------------------------------------------------------------------------------ | BRR * adjusted_SBP | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dhh_age | .4405467 .0186056 23.68 0.000 .4039918 .4771016 bmi | .3878245 .0883162 4.39 0.000 .2143071 .5613418 hbp_aware | 8.74016 1.249431 7.00 0.000 6.285367 11.19495 edu | 2.645542 .8434096 3.14 0.002 .9884707 4.302614 _cons | 81.89744 2.5815 31.72 0.000 76.82549 86.96939

  26. Results: Male model Male model: Survey: Linear regression Number of obs = 1712 Population size = 11822052 Replications = 500 Design df = 499 F( 4, 496) = 87.87 Prob > F = 0.0000 R-squared = 0.2038 ------------------------------------------------------------------------------ | BRR * adjusted_SBP | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dhh_age | .4569414 .0980163 4.66 0.000 .2643658 .649517 bmi | .5470305 .1854258 2.95 0.003 .1827189 .911342 hbp_aware | 5.371427 1.280766 4.19 0.000 2.855068 7.887786 age_bmi_int | -.0074843 .0037551 -1.99 0.047 -.0148619 -.0001066 _cons | 91.21298 5.3256 17.13 0.000 80.74962 101.6763 ------------------------------------------------------------------------------

  27. Results: Comparing measured and imputed data relationship with age CCHS: Imputed SBP CHMS: Measured SBP

  28. Results: Comparing measured and imputed data relationship with age CCHS: Imputed SBP CHMS: Measured SBP

  29. Results: Comparing the distributions CCHS: Imputed SBP CHMS: Measured SBP

  30. Results: Relationship of imputed BP with salt intake xi: regress imputed_SBP i.na_quint dhhd_age i.na_quint _Ina_quint_1-5 (naturally coded; _Ina_quint_1 omitted) Source | SS df MS Number of obs = 12310 -------------+------------------------------ F( 5, 12304) = 9455.27 Model | 1037614.8 5 207522.96 Prob > F = 0.0000 Residual | 270046.623 12304 21.9478725 R-squared = 0.7935 -------------+------------------------------ Adj R-squared = 0.7934 Total | 1307661.42 12309 106.236203 Root MSE = 4.6849 ------------------------------------------------------------------------------ imputed_SBP | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ina_quint_2 | -.0927673 .1375922 -0.67 0.500 -.3624695 .176935 _Ina_quint_3 | .01377 .1364071 0.10 0.920 -.2536093 .2811493 _Ina_quint_4 | .1428075 .1356282 1.05 0.292 -.1230451 .40866 _Ina_quint_5 | 1.061184 .1363676 7.78 0.000 .7938822 1.328486 dhhd_age | .4537406 .0021169 214.34 0.000 .4495912 .4578901 _cons | 95.81601 .1510044 634.52 0.000 95.52001 96.112 ------------------------------------------------------------------------------

  31. Next steps in imputation of BP and cholesterol from CHMS • Repeat for DBP and total cholesterol/HDL– all important variables in the Framingham risk equation • Repeat modelling categories of BP/cholesterol • Try random regression imputation– impute from a conditional distribution, case of Y* ~ f(y|x) = f(x,y)/(∫f(x,y)dy) • Try hot-deck to impute categorical BP and cholesterol and compare to regression results.

  32. Case study: POHEM BMI model • Over to Claude….

  33. Overall conclusions: • Imputation technique must be fit for purpose • Purpose of data/eventual role in mocrosimulation. • Model used must be assessed and its performance reported in a standard way. • May not be possible to fully standardize and approach to imputation for microsimulation, because it is heavily dependent on the data source and purpose but at least we can make the process and techniques more transparent 34

  34. Can multiple imputation be used in microsimulation? • This techniques may not be fit for purpose because it is difficult computationally intensive. • Proper multiple imputation that involves running analysis in multiple datasets and then combining the estimates according to proscribed rules may not be appropriate for microsimulation….how would we combine multiple runs of POHEM?? • Improper multiple imputation, that runs a regression or hot-deck model multiple times and then incorporates imputed values using a random process may be more appropriate……I shall investigate.

  35. Acknowledgements and contact: • Carol Bennett • Tracey Bushnik • Bill Flanagan • Doug Manuel • Claude Nadeau • Meltem Tuna • deirdre.hennessy@statcan.gc.ca • 613-951-3725

More Related