420 likes | 848 Views
Overview. Classical theory assumessimple random samplingpeople sampled independentlyand with equal probabilityModern surveys usecomplex random samplingComplexitiesclusterspeople not sampled independentlyweightspeople not sampled with equal probability. Example data. ECLS-KEarly Childhood
E N D
1. Analyzing “fancy” random samples:Clusters, weights, and (maybe) strata Paul von Hippel
Ohio State University
May 2004
2. Overview Classical theory assumes
simple random sampling
people sampled independently
and with equal probability
Modern surveys use
complex random sampling
Complexities
clusters
people not sampled independently
weights
people not sampled with equal probability
3. Example data ECLS-K
Early Childhood Longitudinal Survey, Kindergarten cohort
~20,000 kindergarteners
~1000 schools
4. If ECLS-K were a simple random sample…
5. Why ECLS-K isn’t a simple random sample Simple random sampling is impractical
List all 3 million kindergarteners (sampling frame)
How?
Sample 20,000 names
Track them down
probably in close to 20,000 different schools
6. 1. Clusters
7. Cluster: Definition Divide population into clusters
containing several units each
Sample the clusters
then sample within them
8. Clusters in ECLS-K Divide country into 1335 geographic regions
PSUs (primary sampling units)
Sample 100 PSUs
Within each PSU
sample ~10 schools
Within each school
sample ~20 children
Two levels of clustering
children clustered in schools
schools clustered in PSUs
ECLS-K only provides enough information to handle (1)
9. Benefit: Clusters are convenient Easier sampling frame
In SRS
list 3 million children
In cluster sample
List 1335 PSUs
Within 100 PSUs, list all schools
Easier data collection
In SRS, visit almost 20,000 schools
In cluster sample, visit 1000 schools
10. Cost: Clusters increase sampling variation Children from the same school are similar
intraclass correlation
Child 2,3,…,20 in a familiar school,
provides less new information
than child 1 in a new school
11. Cost: Clusters increase sampling variation
12. Tradeoff If children in the same school aren’t too similar
i.e., rods aren’t too short
i.e., intraclass correlation isn’t too high
then benefits outweigh costs
13. Estimating with clusters
14. 2. Weights
15. Weights: Definition Each case is assigned a sampling weight
Definitions
# of cases like this in population
inverse probability of sampling this case
These definitions are proportional
give identical inferences
16. Why ECLS-K doesn’t sample with equal probability If every school / child had equal probability
ECLS-K would have
few private school children
few Asian-Americans
But researchers want to study those groups
So ECLS-K gives them higher sampling probabilities
Weights are assigned to compensate
17. Benefit: Weights can reduce bias Compared to the population, ECLS-K has
too many Asians
too many private schools
These are high-scoring groups
Without weights, ECLS-K’s mean score
probably exceeds population mean
Weights compensate for this
giving lower weights to Asians and private schools
18. Benefit: Weights can reduce bias… …if weights correlate with Y
19. Cost: Weights can increase sampling variation
20. Weighted vs. unweighted sample means: A tradeoff
21. Estimating with weights
22. 3. Strata
23. Strata: Definition Divide population into subpopulations (strata)
Sample within each stratum
(With clusters,
we’d only sample within some.)
24. Strata in ECLS-K Schools divided into two strata
public
private
Schools sampled within each stratum
934 public
346 private
25. Benefit: Strata reduce sampling variation Private schools tend to score higher
Suppose ECLS-K used SRS of 1280 schools
Scores would have 2 sources of sampling variation
mix of public and private schools
mix of high- and low-scoring schools
among publics
among privates
Instead, ECLS-K stratifies,
sampling
934 public
346 private
The mix of public and private schools is fixed by design
elminating the first source of sampling variation
26. Benefit: Strata reduce sampling variation
27. Costs: None!
28. Estimating with strata
29. Clusters vs. strata Clusters
chosen for convenience
only sample within some
hope for low internal similarity
Strata
chosen for theoretical interest
sample within all
hope for high internal similarity
30. Putting it all together
31. Summary ECLS-K is not a simple random sample
most surveys aren’t
If you analyze it like a simple random sample
mean is biased high
~.3 points higher than population mean
mainly because of weights
SE is biased low
half as large as it should be
mainly because of clusters
Stratification in ECLS-K is trivial
could be more important in other surveys
32. 4. Software
33. Stata svy commands
svymeans
svyregress
svylogit
etc
Documentation
Stata, Survey Data [svy] in SRL
34. SAS PROC SURVEYREG
built in
normal regression only
PROC MIXED, NLMIXED, GENMOD
built in
require more explicit models, more user knowledge
IVEWare
SAS callable
“a variety of descriptive and model based analyses”
http://www.isr.umich.edu/src/smp/ive/
SUDAAN
SAS callable
35. SPSS Base SPSS
historically bad with weights
Version 12 improved
still no clusters or strata
SPSS Complex Samples (add-on)
clusters, strata
descriptive statistics only
not at OSU
36. 5. Reading
37. Reading Levy & Lemeshow, Sampling of Populations
simple introduction
focuses on means, proportions, totals, ratios
examples in Stata
Stata, Survey Data (in SRL)
discusses more complex models
including theory
Winship & Radbill (1994)“Sampling Weights and Regression Analysis”, Sociological Methods and Research23(2): 230-257
more on bias/variance tradeoff
38. 6. Bonus topic:The trouble with weights
39. Weighted vs. unweighted sample means: A tradeoff
40. Addressing the tradeoff:Simple approaches Sometimes it’s better to use weights, sometimes not
Compare weighted and unweighted estimates
Are they roughly the same?
Say so
Is the difference in point estimates more than a couple of standard errors?
Use the weights
or add regressors
41. When weights help, when they hurt
42. When weights help in ECLS-K e.g., ECLS-K
Weights, scores negatively correlated
. svyreg C1RSCALE C1CW0
----------------------------------------
C1RSCALE | Coef. Std. Err.
-------------+---------------------------
C1CW0 | -.0050289 .0014183
_cons | 23.24951 .3444241
because weights are lower for high-scoring groups
Asians
private students
So weights reduced bias in estimating mean
43. Why weights help less in regression Regressors X
(usually) reduce residual correlation between weights and Y
If regress scores on race, school sector
little residual correlation with weight
. svyreg C1RSCALE ASIAN S2KPUPRI C1CW0
---------------------------------------------------
C1RSCALE | Coef. Std. Err. t P>|t|
-------------+------------------------------------
ASIAN | 3.928595 .5334304 7.36 0.000
S2KPUPRI | 4.528321 .3694436 12.26 0.000
C1CW0 | .0012251 .0014001 0.88 0.382
_cons | 16.38356 .6275102 26.11 0.000
In a regression with race and sector
weights might hurt more than they help