Weighting and imputation

Weighting and imputation PHC 6716 July 13, 2011 Chris McCarty

Weighting • Weighting is the process of adjusting the contribution of each observation in a survey sample based on independent knowledge about appropriate distributions • Before weighting the implied weight of each observation is 1.0 • After weighting, some observations will have weights >1.0 and some <1.0, and some at 1.0 • No observations should have a weight of 0 • Two general types of weighting: • Design weights -- Adjusting for differences due to intentional disproportionate sampling (e.g. over-sampling African Americans or certain regions) • Post-stratification weights -- Adjusting for differences in population or households when release of sample is intended to be representative (e.g. adjustments for non-response of young people)

Common sources for calculating weights • U.S. Census • Current Population Survey • American Community Survey • For Florida County, Age, Race, Ethnicity the BEBR Population Program

How frequency procedures use weights • All statistical packages have options on procedures to incorporate weights • For frequency procedures the weights are multiplied by the unweighted frequencies, then percentages are calculated on the result

How to make a simple weight

What that would look like in data set

Original and adjusted frequency of Employment variable *This is the sum of the weights for the category

Notes on weighting • Typically you don’t want weights to make enormous differences • Keep in mind that with weighting you are saying you have information extraneous to the survey process that informs you of the proper distribution • You could conceivably up-weight results from a small sample strata • Weights are typically used for accurate estimates of prevalence • Models where you test relationships do not need weights if you include the variables you would use to weight

Weighting with more than one variable • Combined weight with multiplication • Create individual weights for each variable then multiply weights to get a single weight (Wage*Wgender) • Not a good solution with a lot of variables • Combine weights iteratively • Calculate weight for a variable using frequency table • Use that weight in frequency of second variable to create weight • Use that weight in frequency of third variable to create weight • And so on

Consumer Confidence • Survey of approximately 500 Florida households each month • RDD Landline Survey • Five questions (components) averaged into an Overall Index • Until now only post-stratification weighting by proportion of households by county

Potential weighting variablesCounty • Typically we get under-representation from large south Florida counties (Miami-Dade) and over-representation from northern counties (Alachua) • Household proportions are estimated between census years by BEBR • Weights June 2011.xls

Potential weighting variablesAge • RDD tends to lead to over-sampling of seniors with landlines • Cell phones emerged as a problem around 2005 • No reliable age group data until 2010 Census • Elderly tend to be less confident than younger respondents due to fixed incomes • Weights June 2011.xls

Potential weighting variablesHispanic Ethnicity • Cell phones tend to be used disproportionately by Hispanics • 2010 Census provided reliable data about proportion of Hispanic Floridians • Hispanics tend to have lower confidence than non-Hispanics • Weights June 2011.xls

Potential weighting variablesGender • Cell phones are disproportionately used by young males • Monthly CCI uses youngest male/oldest female respondent selection • Weights June 2011.xls

Example 2- FHIS • The state of Florida wanted to estimate rates of the uninsured • They stratified the state into 17 regions and wanted to be able to make estimates for the state and each region with a tolerable margin of error • On the state level they wanted to be able to say something about Blacks, Hispanics and those under 200 percent of the poverty level • Data on these demographics for each Florida telephone exchange were obtained prior to sampling • Strata were created from exchanges • This made it possible to create weights based on known households in each exchange • This required design weights to adjust for disproportionate sampling

Example 3 – Medicaid survey • The state wanted to evaluate Medicaid Reform being conducted in Duval and Broward counties • They wanted to administer a modified CAHPS instrument to Adults and Children separately • They wanted to stratify by plan as well, sampling a minimum number of observations per plan • In the end they wanted to compare plans, counties and adults and children • These weights required knowledge about total enrollment for each one of these characteristics (plan, age, county)

Imputation • Like weighting, imputation involves adjusting the analysis after data collection • Unlike weighting, imputation is the deliberate creation of data that were not actually collected • The main reason for imputation is to retain observations in a statistical analysis that would otherwise be left out • Your ability to discover significant results may be compromised by too many missing values • In some case there may be systematic bias associated with missing data so that not imputing presents an unrepresentative result

Imputation and regressions • Imputation is particularly common when data are analyzed with regression analysis • A regression model explains the variability in a dependent variable using one or more independent variables • Observations can only be included in the regression if they have values for all variables in the model • Models with a lot of variables increase the probability that an observation will have at least one missing value for them

Example ModelIncome = β1(Age) + β2(Education)+ β3(Employed)

Imputation algorithms • Two general categories • Random imputation assigns values randomly, often based on a desired statistical distribution • Deterministic imputation typically assigns values based on existing knowledge • Existing knowledge could be in the data • Single imputation fills missing data with one value, such as the mean of all non-missing values for a continuous variable • Multiple imputation fills in missing data with a set of plausible values • Hot deck imputation fills in missing values with those of an observation that matches on key variables

Weighting and imputation

Weighting and imputation

Presentation Transcript

Weighting Objectives

Data Imputation

Weighting

Imputation

Imputation 2

Weighting and Estimation

Estimation and Weighting Part II

Baby weighting

Weighting Schemes

WHI Imputation

Multiple Imputation

Switching Among Non-Weighting, Clause Weighting, and Variable Weighting in Local Search for SAT

Particle Weighting

Weighting and Matching against Indices

Model Calibration and Weighting

Genotype Imputation

Multiple Imputation

Light weighting

Data Imputation Methods and Technologies

Weighting and imputation