1 / 42

Analyzing fancy random samples: Clusters, weights, and maybe strata

Overview. Classical theory assumessimple random samplingpeople sampled independentlyand with equal probabilityModern surveys usecomplex random samplingComplexitiesclusterspeople not sampled independentlyweightspeople not sampled with equal probability. Example data. ECLS-KEarly Childhood

aadi
Download Presentation

Analyzing fancy random samples: Clusters, weights, and maybe strata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Analyzing “fancy” random samples: Clusters, weights, and (maybe) strata Paul von Hippel Ohio State University May 2004

    2. Overview Classical theory assumes simple random sampling people sampled independently and with equal probability Modern surveys use complex random sampling Complexities clusters people not sampled independently weights people not sampled with equal probability

    3. Example data ECLS-K Early Childhood Longitudinal Survey, Kindergarten cohort ~20,000 kindergarteners ~1000 schools

    4. If ECLS-K were a simple random sample…

    5. Why ECLS-K isn’t a simple random sample Simple random sampling is impractical List all 3 million kindergarteners (sampling frame) How? Sample 20,000 names Track them down probably in close to 20,000 different schools

    6. 1. Clusters

    7. Cluster: Definition Divide population into clusters containing several units each Sample the clusters then sample within them

    8. Clusters in ECLS-K Divide country into 1335 geographic regions PSUs (primary sampling units) Sample 100 PSUs Within each PSU sample ~10 schools Within each school sample ~20 children Two levels of clustering children clustered in schools schools clustered in PSUs ECLS-K only provides enough information to handle (1)

    9. Benefit: Clusters are convenient Easier sampling frame In SRS list 3 million children In cluster sample List 1335 PSUs Within 100 PSUs, list all schools Easier data collection In SRS, visit almost 20,000 schools In cluster sample, visit 1000 schools

    10. Cost: Clusters increase sampling variation Children from the same school are similar intraclass correlation Child 2,3,…,20 in a familiar school, provides less new information than child 1 in a new school

    11. Cost: Clusters increase sampling variation

    12. Tradeoff If children in the same school aren’t too similar i.e., rods aren’t too short i.e., intraclass correlation isn’t too high then benefits outweigh costs

    13. Estimating with clusters

    14. 2. Weights

    15. Weights: Definition Each case is assigned a sampling weight Definitions # of cases like this in population inverse probability of sampling this case These definitions are proportional give identical inferences

    16. Why ECLS-K doesn’t sample with equal probability If every school / child had equal probability ECLS-K would have few private school children few Asian-Americans But researchers want to study those groups So ECLS-K gives them higher sampling probabilities Weights are assigned to compensate

    17. Benefit: Weights can reduce bias Compared to the population, ECLS-K has too many Asians too many private schools These are high-scoring groups Without weights, ECLS-K’s mean score probably exceeds population mean Weights compensate for this giving lower weights to Asians and private schools

    18. Benefit: Weights can reduce bias… …if weights correlate with Y

    19. Cost: Weights can increase sampling variation

    20. Weighted vs. unweighted sample means: A tradeoff

    21. Estimating with weights

    22. 3. Strata

    23. Strata: Definition Divide population into subpopulations (strata) Sample within each stratum (With clusters, we’d only sample within some.)

    24. Strata in ECLS-K Schools divided into two strata public private Schools sampled within each stratum 934 public 346 private

    25. Benefit: Strata reduce sampling variation Private schools tend to score higher Suppose ECLS-K used SRS of 1280 schools Scores would have 2 sources of sampling variation mix of public and private schools mix of high- and low-scoring schools among publics among privates Instead, ECLS-K stratifies, sampling 934 public 346 private The mix of public and private schools is fixed by design elminating the first source of sampling variation

    26. Benefit: Strata reduce sampling variation

    27. Costs: None!

    28. Estimating with strata

    29. Clusters vs. strata Clusters chosen for convenience only sample within some hope for low internal similarity Strata chosen for theoretical interest sample within all hope for high internal similarity

    30. Putting it all together

    31. Summary ECLS-K is not a simple random sample most surveys aren’t If you analyze it like a simple random sample mean is biased high ~.3 points higher than population mean mainly because of weights SE is biased low half as large as it should be mainly because of clusters Stratification in ECLS-K is trivial could be more important in other surveys

    32. 4. Software

    33. Stata svy commands svymeans svyregress svylogit etc Documentation Stata, Survey Data [svy] in SRL

    34. SAS PROC SURVEYREG built in normal regression only PROC MIXED, NLMIXED, GENMOD built in require more explicit models, more user knowledge IVEWare SAS callable “a variety of descriptive and model based analyses” http://www.isr.umich.edu/src/smp/ive/ SUDAAN SAS callable

    35. SPSS Base SPSS historically bad with weights Version 12 improved still no clusters or strata SPSS Complex Samples (add-on) clusters, strata descriptive statistics only not at OSU

    36. 5. Reading

    37. Reading Levy & Lemeshow, Sampling of Populations simple introduction focuses on means, proportions, totals, ratios examples in Stata Stata, Survey Data (in SRL) discusses more complex models including theory Winship & Radbill (1994) “Sampling Weights and Regression Analysis”, Sociological Methods and Research23(2): 230-257 more on bias/variance tradeoff

    38. 6. Bonus topic: The trouble with weights

    39. Weighted vs. unweighted sample means: A tradeoff

    40. Addressing the tradeoff: Simple approaches Sometimes it’s better to use weights, sometimes not Compare weighted and unweighted estimates Are they roughly the same? Say so Is the difference in point estimates more than a couple of standard errors? Use the weights or add regressors

    41. When weights help, when they hurt

    42. When weights help in ECLS-K e.g., ECLS-K Weights, scores negatively correlated . svyreg C1RSCALE C1CW0 ---------------------------------------- C1RSCALE | Coef. Std. Err. -------------+--------------------------- C1CW0 | -.0050289 .0014183 _cons | 23.24951 .3444241 because weights are lower for high-scoring groups Asians private students So weights reduced bias in estimating mean

    43. Why weights help less in regression Regressors X (usually) reduce residual correlation between weights and Y If regress scores on race, school sector little residual correlation with weight . svyreg C1RSCALE ASIAN S2KPUPRI C1CW0 --------------------------------------------------- C1RSCALE | Coef. Std. Err. t P>|t| -------------+------------------------------------ ASIAN | 3.928595 .5334304 7.36 0.000 S2KPUPRI | 4.528321 .3694436 12.26 0.000 C1CW0 | .0012251 .0014001 0.88 0.382 _cons | 16.38356 .6275102 26.11 0.000 In a regression with race and sector weights might hurt more than they help

More Related