1 / 28

# Weighting and imputation - PowerPoint PPT Presentation

Weighting and imputation. PHC 6716 July 13, 2011 Chris McCarty. Weighting. Weighting is the process of adjusting the contribution of each observation in a survey sample based on independent knowledge about appropriate distributions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Weighting and imputation' - luther

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Weighting and imputation

PHC 6716

July 13, 2011

Chris McCarty

• Weighting is the process of adjusting the contribution of each observation in a survey sample based on independent knowledge about appropriate distributions

• Before weighting the implied weight of each observation is 1.0

• After weighting, some observations will have weights >1.0 and some <1.0, and some at 1.0

• No observations should have a weight of 0

• Two general types of weighting:

• Design weights -- Adjusting for differences due to intentional disproportionate sampling (e.g. over-sampling African Americans or certain regions)

• Post-stratification weights -- Adjusting for differences in population or households when release of sample is intended to be representative (e.g. adjustments for non-response of young people)

• U.S. Census

• Current Population Survey

• American Community Survey

• For Florida County, Age, Race, Ethnicity the BEBR Population Program

• All statistical packages have options on procedures to incorporate weights

• For frequency procedures the weights are multiplied by the unweighted frequencies, then percentages are calculated on the result

*This is the sum of the weights for the category

• Typically you don’t want weights to make enormous differences

• Keep in mind that with weighting you are saying you have information extraneous to the survey process that informs you of the proper distribution

• You could conceivably up-weight results from a small sample strata

• Weights are typically used for accurate estimates of prevalence

• Models where you test relationships do not need weights if you include the variables you would use to weight

• Combined weight with multiplication

• Create individual weights for each variable then multiply weights to get a single weight (Wage*Wgender)

• Not a good solution with a lot of variables

• Combine weights iteratively

• Calculate weight for a variable using frequency table

• Use that weight in frequency of second variable to create weight

• Use that weight in frequency of third variable to create weight

• And so on

• Survey of approximately 500 Florida households each month

• RDD Landline Survey

• Five questions (components) averaged into an Overall Index

• Until now only post-stratification weighting by proportion of households by county

• Typically we get under-representation from large south Florida counties (Miami-Dade) and over-representation from northern counties (Alachua)

• Household proportions are estimated between census years by BEBR

• Weights June 2011.xls

• RDD tends to lead to over-sampling of seniors with landlines

• Cell phones emerged as a problem around 2005

• No reliable age group data until 2010 Census

• Elderly tend to be less confident than younger respondents due to fixed incomes

• Weights June 2011.xls

Potential weighting variablesHispanic Ethnicity

• Cell phones tend to be used disproportionately by Hispanics

• 2010 Census provided reliable data about proportion of Hispanic Floridians

• Hispanics tend to have lower confidence than non-Hispanics

• Weights June 2011.xls

• Cell phones are disproportionately used by young males

• Monthly CCI uses youngest male/oldest female respondent selection

• Weights June 2011.xls

• The state of Florida wanted to estimate rates of the uninsured

• They stratified the state into 17 regions and wanted to be able to make estimates for the state and each region with a tolerable margin of error

• On the state level they wanted to be able to say something about Blacks, Hispanics and those under 200 percent of the poverty level

• Data on these demographics for each Florida telephone exchange were obtained prior to sampling

• Strata were created from exchanges

• This made it possible to create weights based on known households in each exchange

• This required design weights to adjust for disproportionate sampling

• The state wanted to evaluate Medicaid Reform being conducted in Duval and Broward counties

• They wanted to administer a modified CAHPS instrument to Adults and Children separately

• They wanted to stratify by plan as well, sampling a minimum number of observations per plan

• In the end they wanted to compare plans, counties and adults and children

• These weights required knowledge about total enrollment for each one of these characteristics (plan, age, county)

• Like weighting, imputation involves adjusting the analysis after data collection

• Unlike weighting, imputation is the deliberate creation of data that were not actually collected

• The main reason for imputation is to retain observations in a statistical analysis that would otherwise be left out

• Your ability to discover significant results may be compromised by too many missing values

• In some case there may be systematic bias associated with missing data so that not imputing presents an unrepresentative result

• Imputation is particularly common when data are analyzed with regression analysis

• A regression model explains the variability in a dependent variable using one or more independent variables

• Observations can only be included in the regression if they have values for all variables in the model

• Models with a lot of variables increase the probability that an observation will have at least one missing value for them

Example ModelIncome = β1(Age) + β2(Education)+ β3(Employed)

• Two general categories

• Random imputation assigns values randomly, often based on a desired statistical distribution

• Deterministic imputation typically assigns values based on existing knowledge

• Existing knowledge could be in the data

• Single imputation fills missing data with one value, such as the mean of all non-missing values for a continuous variable

• Multiple imputation fills in missing data with a set of plausible values

• Hot deck imputation fills in missing values with those of an observation that matches on key variables