A Stata program for calibration weighting

A Stata program for calibration weighting John D’Souza National Centre for Social Research

Outline • Description of calibration • Adjust selection weights so that a weighted sample exactly matches the population • Generalizes post-stratification • Several methods: Linear, logistic … • SAS, GenStat • A new Stata program • Limitations and extensions

Sampling • Selection weights: dk = 1/P(Person k is chosen) • Sample frame variables Xk1, …, XkJ with known population totals, P1, …, PJ. • Horvitz-Thompson estimator of Pi ∑dkXki ≈ Pi for i=1,2, …, J. • Calibration: Adjust dk to get calibration weights, wk, giving exact equality: ∑wkXki = Pi for i=1,2, …, J.

Example: School Census Variables include • Age, Gender, Ethnic Group, Exam results • Type of School, Region • Pupil’s Free School Meal eligibility We calibrate to J variables. Eg. Boy (binary) Girl (binary) Region (eg. four categories) FSM eligibility (binary) J= 1 + 1 + (4-1) + 1 = 6

Special case: post-stratification • Simplest case: • One categorical variable • Easy to deal with (post-stratification) • svyset , poststrata() postweight() • More general case: • Several variables (categorical and numerical)

Deville and Sarndal (1992). Minimize the “distance” between w and d subject to the J calibration constraints. Linear calibration: Minimize ∑S (wk- dk)2/dk Involves solving J simultaneous linear equations Logistic calibration: Minimize ∑S (wklog(wk/dk)– wk + dk) Involves solving J simultaneous non-linear equations

GenStat, SAS, Stata • GenStat and SAS • Methods: linear, logistic and bounded. • Estimation: GenStat gives SEs. • SAS handles categorical variables directly. Enter as indicator variables in GenStat. • Stata • Post-stratification (calibration to one categorical variable). Gives SEs. • No routine for general calibration.

A new Stata program • Typical syntax. matrix M=[10000, 10000, 3000, 4000, 3000, 8000] calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) /// method(linear) print(final) • 10,000 boys, 10,000 girls, 3,000 FSM • Variables boys, girls, FSM are binary • Categorical variable region (4 categories) turned into 4 binary indicator variables). Only 3 entered in the syntax (colinearity)

Output

Options • Options available to: • Control amount of output/graphs • Set max number of iterations/tolerance • Methods • linear, logistic, bounded linear and nonresp (blinear sets bounds for wk/dk. GenStat and SAS have something very similar) (nonresp adjusts for non-response – see below)

Limitations (1) • Solves the equations by finding a matrix inverse • Won’t work if J is large • Can have problems with singular or nearly singular matrices • Iterative methods (logistic, blinear) won’t always converge • No obvious solution to 1. Problem 2 and 3 are usually down to problems with the data

Limitations (2) • We need to recode categorical variables (SAS doesn’t) • Stata: tab region, gen(ireg) • More complicated (eg two-phase) problems aren’t handled directly • Need a bit of syntax to handle this • Other packages can handle this directly

Extensions –Standard errors Calibration weights are often incorrectly treated as selection weights. calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) calibmean , selwt(w1) calibwt(w2) yvar(y) /// marginals(boy girl FSM ireg1-ireg3) /// psu(school) designops (strata(region)) This generalizes Stata’s poststrata command

Extension: Method nonresp (1) Example Select schools, then classes, then pupils Assume all schools respond, pupils might not Variables available on responders. (Pop totals available) Gender, Exam results, FSM, Region Variables on non-responders. (Pop totals not available) PTratio: Pupil-teacher ratio topset: Is pupil in the top set?

Extension: Method nonresp (2) serial region topset outc sex FSM ------------------------------------------ 1. 1001 1 1 0 . . 2. 1002 1 0 1 1 0 3. 1003 2 0 0 . . 4. 1004 1 0 1 1 1 5. 1005 3 1 0 . . ------------------------------------------ 6. 1006 1 0 1 0 1 7. 1007 3 1 1 1 0 8. 1008 2 1 0 . . 9. 1009 1 0 1 1 0

Extension: Method nonresp (3) Population totals unknown, but variables are available on all the sample (including non-responders) calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) /// method(nonresp) outc(outc) /// svars(PTratio topset) Responders weighted to pop totals on “marginals” and to selected sample totals on “svars” (Lundstrom & Sarndal, 2005)

Conclusions • We’ve found the program can handle many practical problems • Easy to calculate SEs (but theory assumes no non-response) • Method nonresp isn’t available in many packages • We don’t have to calibrate to population totals • Eg, calibrate Wave n+1 of a survey to totals from Wave n • Calibrate one sample to look like another

Questions

References • Deville, J.-C. and Sarndal, C.-E. 1992. Calibration estimators in survey sampling. Journal of the American Statistical Association87: 376-382 • Background and theory behind calibration • Lundstrom, S. and Sarndal, C.-E. 2005. Estimation in Surveys with Nonresponse. Wiley • Deals with non-response • Singh, A.C. and Mohl, C.A. 1996. Understanding Calibration estimators in Survey Sampling. Survey Methodology22: 107-115 • Discusses several methods of doing bounded calibration

A Stata program for calibration weighting

A Stata program for calibration weighting

Presentation Transcript

Weighting Objectives

Managing Your Company Calibration Program

STATA APPLICATIONS

Weighting

Weighting for Model B

Baby weighting

Weighting Schemes

Calibration: preparation for p a

Switching Among Non-Weighting, Clause Weighting, and Variable Weighting in Local Search for SAT

Particle Weighting

Calibration for WFPC2

Weighting a Domain Wall

Transverse weighting for quark correlators

Model Calibration and Weighting

Confidence Weighting for Sensor Fingerprinting

Preference Weighting for Suitability Studies

STATA

DSM2 Calibration: Optimization Weighting Scheme Using Management Objectives

A stata program for Respondent Driven Sampling

Light weighting

A Feature Weighting Method for Robust Speech Recognition