190 likes | 200 Views
This program in Stata allows for the adjustment of selection weights to match population totals, using methods such as linear and logistic calibration. It is useful for post-stratification and can handle both categorical and numerical variables.
E N D
A Stata program for calibration weighting John D’Souza National Centre for Social Research
Outline • Description of calibration • Adjust selection weights so that a weighted sample exactly matches the population • Generalizes post-stratification • Several methods: Linear, logistic … • SAS, GenStat • A new Stata program • Limitations and extensions
Sampling • Selection weights: dk = 1/P(Person k is chosen) • Sample frame variables Xk1, …, XkJ with known population totals, P1, …, PJ. • Horvitz-Thompson estimator of Pi ∑dkXki ≈ Pi for i=1,2, …, J. • Calibration: Adjust dk to get calibration weights, wk, giving exact equality: ∑wkXki = Pi for i=1,2, …, J.
Example: School Census Variables include • Age, Gender, Ethnic Group, Exam results • Type of School, Region • Pupil’s Free School Meal eligibility We calibrate to J variables. Eg. Boy (binary) Girl (binary) Region (eg. four categories) FSM eligibility (binary) J= 1 + 1 + (4-1) + 1 = 6
Special case: post-stratification • Simplest case: • One categorical variable • Easy to deal with (post-stratification) • svyset , poststrata() postweight() • More general case: • Several variables (categorical and numerical)
Deville and Sarndal (1992). Minimize the “distance” between w and d subject to the J calibration constraints. Linear calibration: Minimize ∑S (wk- dk)2/dk Involves solving J simultaneous linear equations Logistic calibration: Minimize ∑S (wklog(wk/dk)– wk + dk) Involves solving J simultaneous non-linear equations
GenStat, SAS, Stata • GenStat and SAS • Methods: linear, logistic and bounded. • Estimation: GenStat gives SEs. • SAS handles categorical variables directly. Enter as indicator variables in GenStat. • Stata • Post-stratification (calibration to one categorical variable). Gives SEs. • No routine for general calibration.
A new Stata program • Typical syntax. matrix M=[10000, 10000, 3000, 4000, 3000, 8000] calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) /// method(linear) print(final) • 10,000 boys, 10,000 girls, 3,000 FSM • Variables boys, girls, FSM are binary • Categorical variable region (4 categories) turned into 4 binary indicator variables). Only 3 entered in the syntax (colinearity)
Options • Options available to: • Control amount of output/graphs • Set max number of iterations/tolerance • Methods • linear, logistic, bounded linear and nonresp (blinear sets bounds for wk/dk. GenStat and SAS have something very similar) (nonresp adjusts for non-response – see below)
Limitations (1) • Solves the equations by finding a matrix inverse • Won’t work if J is large • Can have problems with singular or nearly singular matrices • Iterative methods (logistic, blinear) won’t always converge • No obvious solution to 1. Problem 2 and 3 are usually down to problems with the data
Limitations (2) • We need to recode categorical variables (SAS doesn’t) • Stata: tab region, gen(ireg) • More complicated (eg two-phase) problems aren’t handled directly • Need a bit of syntax to handle this • Other packages can handle this directly
Extensions –Standard errors Calibration weights are often incorrectly treated as selection weights. calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) calibmean , selwt(w1) calibwt(w2) yvar(y) /// marginals(boy girl FSM ireg1-ireg3) /// psu(school) designops (strata(region)) This generalizes Stata’s poststrata command
Extension: Method nonresp (1) Example Select schools, then classes, then pupils Assume all schools respond, pupils might not Variables available on responders. (Pop totals available) Gender, Exam results, FSM, Region Variables on non-responders. (Pop totals not available) PTratio: Pupil-teacher ratio topset: Is pupil in the top set?
Extension: Method nonresp (2) serial region topset outc sex FSM ------------------------------------------ 1. 1001 1 1 0 . . 2. 1002 1 0 1 1 0 3. 1003 2 0 0 . . 4. 1004 1 0 1 1 1 5. 1005 3 1 0 . . ------------------------------------------ 6. 1006 1 0 1 0 1 7. 1007 3 1 1 1 0 8. 1008 2 1 0 . . 9. 1009 1 0 1 1 0
Extension: Method nonresp (3) Population totals unknown, but variables are available on all the sample (including non-responders) calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) /// method(nonresp) outc(outc) /// svars(PTratio topset) Responders weighted to pop totals on “marginals” and to selected sample totals on “svars” (Lundstrom & Sarndal, 2005)
Conclusions • We’ve found the program can handle many practical problems • Easy to calculate SEs (but theory assumes no non-response) • Method nonresp isn’t available in many packages • We don’t have to calibrate to population totals • Eg, calibrate Wave n+1 of a survey to totals from Wave n • Calibrate one sample to look like another
References • Deville, J.-C. and Sarndal, C.-E. 1992. Calibration estimators in survey sampling. Journal of the American Statistical Association87: 376-382 • Background and theory behind calibration • Lundstrom, S. and Sarndal, C.-E. 2005. Estimation in Surveys with Nonresponse. Wiley • Deals with non-response • Singh, A.C. and Mohl, C.A. 1996. Understanding Calibration estimators in Survey Sampling. Survey Methodology22: 107-115 • Discusses several methods of doing bounded calibration