Chapter 8: Nonresponse

Chapter 8: Nonresponse • Reading • 8.1-8.3 • 8.4 (read for concepts) • 8.5 (intro, 8.5.2 are focus) • 8.6 • 8.8 • (no 8.7)

Outline • What is nonresponse (NR)? • Why should we do something about NR? • Strategies to reduce NR • Design phase • After data collection • Callbacks to gain info on nonrespondents (double sampling) • Weighting adjustments – post-stratification only • Imputation of missing values (item NR), a little from mechanisms for NR • Response rate calculations

What is nonresponse? • Failure to obtain data through some part of the data collection process • Nonresponse occurs during data collection process, after sample is selected • Separate from ineligible cases • Can not locate (may not know if eligible) • Locate but refuse to participate (may or may not know eligibility) • Participate but don’t answer all questions (eligibility known) • …

Types of nonresponse • Unit nonresponse • Missing data for entire observation unit • All variables have missing data • Item nonresponse • Missing data for one or more variables for the observation unit • Failure to obtain a response to an individual item = question

Example: random digit dialing (RDD) phone calls • Some case (= phone number) dispositions • Non-working • Rings, but get no answer • Get answer, determine it’s not a household • Get a household, refuse survey participation • Get a household, answer all but a few questions • Get a household and answer all questions • Eligible, unit NR, item NR?

Example: soil survey • Can not reach sample unit (in canyon) • Can reach, but can’t collect data (denied permission by land owner) • Collect data, data sheet destroyed • Forget to collect data for an item

Ignoring nonresponse (is bad) • Impacts are related to differences between nonresponding and responding subpopulations in relation to analysis variables • If population mean is different for responding and nonresponding subpopulations, will get a biased estimate when analyzing data from only the responding subpopulation • Bias depends on • Nonresponse rate • Difference between population means for responding and nonresponding subpopulations • p. 258 subpopulation table and equations

Ignoring nonresponse – 2 • Hard to determine if distributions (parameters) for responding and nonresponding subpopulations are different • Often no information on nonrespondents • Examine causes of NR • Is mechanism generating NR related to analysis variables? • Figure 8.2 – framework for factors • Data collectors (interviewers, field observers) • Survey content (questionnaire, field protocols) • Respondent or field site characteristics

Ignoring nonresponse – 3 • Sample size reductions affect precision • Low response rate  low sample size  higher variances • Increasing sample size will NOT mitigate bias problems • Literary Digest Survey • Less of a concern because often you can anticipate and design for NR sample size attrition

Example: Norwegian voting behavior survey (Table 8.1) • Survey with good follow-up methodology • Examined differences between nonrespondents and full sample • Age-specific voting rates lower for NR portion, especially for younger voters • Low nonresponse, but high bias potential • 90% response rate, but differences are large with respect to main analysis variables • Mechanisms causing NR • Absence or illness  less likely to respond, lower voting rates • Impact: overestimate prevalence of positive voting behaviors

Strategies • Best: design survey to prevent NR • Post-data collection • Perform nonresponse study (call-backs) • Use weights to adjust for NR units • Use a model to impute (fill in) values for missing items

Strategy 1: Design to prevent • Consider likely mechanisms for NR when designing survey • Reduce respondent burden to extent possible • Two main areas • Data collection methodology • Burden for individual, population • Sample design • Burden for population • Remedies for avoiding NR also tend to improve data quality

Factors to consider • Survey content • Salience of topic to respondent • Sensitive topics (socially undesirable behaviors, medical issues) • Timing • Farm surveys avoid peak work times • Holidays associated with higher NR • Interviewers • Training to improve technique • Refusal conversion staff • Observer variation for bird counts

Factors to consider – 2 • Data collection method • Mail/fax/web has highest NR, then phone, then in-person • Interviewer assists in locating process, gaining cooperation to participate, avoiding item NR • Computer-assisted data collection instruments prevent item NR due to data collector error • Guides data collection, checks for completeness

Factors to consider – 3 • Questionnaire design • Key: reduce respondent burden (effort to respond, frustration in responding) • Cognitive psych principles used to simplify, clarify, test questions and questionnaire flow • Examples of factors follow … • Wording of individual questions • Can respondent answer the question? • Does s/he understand the question? • Single concept, simple wording, transition

Factors to consider – 4 • Questionnaire flow/design • Content: is flow logical, assist in cognitive process? • Mail, web, fax: visual interface is very important to helping respondent accurately complete questionnaire • Length of questionnaire • Shorten to extent possible • Allowable length depends on how vested the respondent is likely to be

Factors to consider – 5 • Survey introduction • First contact between respondent and data collector • Want to motivate respondent to participate • Positive: contributions to knowledge base • Negative: confidentiality concersn • Methods (use both if possible) • Advance letter to respondent or land owner (need address) • Phone or written introduction to questionnaire

Factors to consider – 6 • Incentives • Money, gifts, coupons, lottery; penalties • Hard to determine what is appropriate • Generally has a positive effect • Worry: incentive creep, increases cost of survey • Respondents get used to it  increases difficulty and cost in gaining response • Follow-up to obtain response • Mail: repeated notifications after initial mailing • Postcard reminder, 2nd questionnaire mailing • Phone: protocols for repeated attempts to get an answer, refusal conversion

Factors to consider – 7 • Sample design • Use design and estimation principles that increase precision for a given sample size • Stratification, ratio/regression estimation • Less burden on population by using smaller sample size to achieve a given precision level

Example: Census study • Decennial census • Start with a mail survey, then do in-person nonresponse follow-up • Little increases in response rates save big $$ • Much cheaper to do a mail survey • Entire US population, so “sample size” is large • Impact of three methods on response rates • Advance letter notifying household that census forms are coming • Stamped return envelope included with form • Reminder postcard sent a few days after the form • Figure 8.1: letter, postcard > envelope • Increased from 50  65%

Mechanisms for nonresponse • Define a new random variable that indicates whether a unit responds to the survey • We use a random variable because willingness to respond is not a fixed characteristics of a unit • Define the probability that a unit will respond to the survey = propensity score

Types of nonresponse • MCAR: missing completely at random • MAR: missing at random given covariates • Also called ignorable nonresponse • Nonignorable nonresponse

Missing completely at random (MCAR) • Propensity to respond is completely random • Default assumption in many analyses • Often not true • Propensity score is not related to • Known information about the respondent or design factors (x) • Response variables to be observed (y) • Implies • If we take a SRS of n units, responding portion of sample is a SRS of nR units • (sample mean of responding units) is unbiased for (population mean for whole pop)

Missing at random given covariates (ignorable) • Propensity score • Depends on known information about respondent or variables used in sample design (x) • Does not depend on response (y) • Since know values of x for all units in the population, can create adjustments for the nonresponse • Adjustment methods depend on a model for nonresponse • Example: propensity score depends only on gender and age, but does not depend on responses to questions in survey

Nonignorable nonresponse • Propensity score depends on response (y) and can not be completely explained by other factors (x) • Example: crime victims less likely to respond to victimization questions (y) on a survey • Models will not fully adjust for potential nonresponse bias • Very difficult to verify if nonresponse mechanism is nonignorable

Strategy 2: Call-backs and double sampling • Basic idea • Select a subsample of nonrepsondents • Collect data from contacted nonrespondents • Use these data to estimate population mean for nonrespondents, • This subsample is referred to by Lohr as the “call-back” sample • It is a telephone follow-up to a mail survey • Method is more general than that • The sampling design is an example of “double” or “2-phase” sampling (we won’t cover this in general) • We will make the (very unrealistic) assumption that all of the “call-back” sample provides responses to the survey

Framework Whole Population N NM NR nM nR Sample n

Subsample the nonresponding portion of population Whole Population N NM NR nR Sample 100% of the nonresponding part of sample= nMCB = nM units

Estimation • Sample mean from responding population • Sample mean from “call-back” subset of nonresponding population

Estimation – 2 • Estimator for population mean • Estimator for population total

Estimation – 3 • Analysis weights • Respondents in original sample: • Nonrespondent “call-backs”: • Estimator for variance of

Strategy 3: weighting methods for nonresponse • Approaches • Weighting-class adjustment • Post-stratification • In previous chapters • Assume that all SUs/OUs provided a response • Weights were typically inverse of inclusion probability wi = 1 /i • Interpretation of weight • Number of units in the population represented by unit i in the sample

Weighting methods for nonresponse • What if not all SUs/OUs provide a response? • Second probability = probability of responding for unit i = propensity score • Weight for unit i • Interpretation • Number of units in the population represented by responding unit i • Assumes data are missing at random (MAR, ignorable given covariates)

Weighting-class adjustment • Create a set of “weighting” classes such that we can assume propensity score is same within each class • Example: age classes • 15-24, 25-34, 35-44, 45-64, 65+ • Estimate propensity score using initial sampling weights, wi = 1 /i

Weighting-class adjustment – 2 • New analysis weight for responding portion of sample • Estimators for population total tU and mean

Example: SRS design (p. 266) • Inclusion probability for unit i • Estimated propensity score for unit i • Analysis weight for responding unit i

Example: SRS design – 2 • Table 8.2 for analysis weight (= weight factor in table) • Estimator for population total under SRS • Estimator for population mean under SRS

Weighting-class adjustment - 3 • Selecting weighting classes • Use principles for selecting strata • Classes should be groups of similar units in relation to • Propensity score (likelihood of responding) • Response variable • Should maximize variation across classes for these two factors

Post-stratification • Assume SRS • Very similar to weighting-class adjustment • Classes are post-strata • Use population counts rather than sample counts • Weighting-class approach essentially estimates Nh in with

Post-stratification (under SRS) • Assume SRS of n from N • Estimator for population mean • For a particular survey data set (condition on nhR , h = 1, 2, … H)

Strategy 4: Imputation • Missing item (question) data are typical in a survey • Refusals, data collector error, edit erroneous value after data collection • Imputation is a statistical method for “filling in” missing values • If impute all missing values, can get a complete rectangular data set (rows = units, columns = variables) • An indicator variable should be developed to identify which values are imputed

Imputation methods • Deductive imputation • Common method, rarely applicable • Cell mean imputation • Leads to incorrect distribution of y in dataset • Hot-deck imputation (random) • Most common and generally applicable • Regression imputation • Between hot-deck and cell mean • Multiple imputation • Accounting for variation due to imputation process

Deductive imputation • Sufficient information exists to identify the missing value • Relatively uncommon (especially with computer-based systems) • Example for NCVS • Person 7 • Crime victim = no • Violent crime victim = ? • Deductive imputation • Crime victim = no  Violent crime victim = no

Cell mean imputation • Procedure • Divide responding units in to imputation classes • Within a given imputation class: • Calculate the average value for available item data in class • Fill in missing value for nonresponding unit with average value • Properties • Assumes MAR (covariates = classes) • Retains mean estimate for an imputation class • Underestimates variance, distorts distribution of y • All missing values in a class are equal to the class mean

(Random) hot deck imputation • Procedure • Divide responding units in to imputation classes (like weighting classes) • Choose like strata – group similar units in relation to variable with missing value • Within a given imputation class • Randomly select a donor from responding units in class • Filling in missing value for nonresponding unit with value from donor unit • Properties • Retains variation in individual values • Assumes MAR (imputation class = covariate) • Can impute for many variables from same donor

Regression imputation • Procedure • Use a regression model to relate covariate(s) to variable with missing data • Estimate regression parameters with data from responding units • Fill in missing value with predicted value, or derived value from prediction (if > .5, binary y = 1) • Properties • Assumes MAR • Useful when number of responding units in imputation class are too small • Useful if a strong relationship exists that provides a better predicted value for the missing data • May be a form of (conditional) mean imputation • Requires separate model for each variable with missing data

Multiple imputation • Procedure • Select an imputation method • Impute m > 1 values for each missing data item • Result is m (different) data sets with no missing values • Properties • Variation in estimates across data sets provides an estimate of the variability associated with the imputation process • Solution to problem with other methods • Most analysts treat imputed data as “real” rather than “estimated” data • Underestimate variance of estimates

Imputation summary • Most imputation methods assume MAR given covariates • Variation in methods associated with model used to account for covariate • Good methods exist that do not lead to a distorted distribution of y in the data set • Avoid cell mean imputation • Hot deck imputation allows us to perform imputation for >1 variable at a time • Most imputation methods do not account for the fact that you are “estimating” the data when estimating the variance of an estimate • This is the motivation for multiple imputation • Need special estimators for variance in multiple imputation

Outcome rates • MANY ways to describe results of processes between sample selection and completing data collection • Phases • Locating unit • Contacting unit (for people, businesses) • Gaining cooperation of a unit (refusals) • Determining eligibility • Obtaining complete item data for a unit • AAPOR reference • http://www.aapor.org/default.asp?page=survey_methods/response_rate_calculator

Chapter 8: Nonresponse