Analyzing fancy random samples: Clusters, weights, and maybe strata

1. Analyzing �fancy� random samples:Clusters, weights, and (maybe) strata Paul von Hippel Ohio State University May 2004

2. Overview Classical theory assumes simple random sampling people sampled independently and with equal probability Modern surveys use complex random sampling Complexities clusters people not sampled independently weights people not sampled with equal probability

3. Example data ECLS-K Early Childhood Longitudinal Survey, Kindergarten cohort ~20,000 kindergarteners ~1000 schools

4. If ECLS-K were a simple random sample�

5. Why ECLS-K isn�t a simple random sample Simple random sampling is impractical List all 3 million kindergarteners (sampling frame) How? Sample 20,000 names Track them down probably in close to 20,000 different schools

6. 1. Clusters

7. Cluster: Definition Divide population into clusters containing several units each Sample the clusters then sample within them

8. Clusters in ECLS-K Divide country into 1335 geographic regions PSUs (primary sampling units) Sample 100 PSUs Within each PSU sample ~10 schools Within each school sample ~20 children Two levels of clustering children clustered in schools schools clustered in PSUs ECLS-K only provides enough information to handle (1)

9. Benefit: Clusters are convenient Easier sampling frame In SRS list 3 million children In cluster sample List 1335 PSUs Within 100 PSUs, list all schools Easier data collection In SRS, visit almost 20,000 schools In cluster sample, visit 1000 schools

10. Cost: Clusters increase sampling variation Children from the same school are similar intraclass correlation Child 2,3,�,20 in a familiar school, provides less new information than child 1 in a new school

11. Cost: Clusters increase sampling variation

12. Tradeoff If children in the same school aren�t too similar i.e., rods aren�t too short i.e., intraclass correlation isn�t too high then benefits outweigh costs

13. Estimating with clusters

14. 2. Weights

15. Weights: Definition Each case is assigned a sampling weight Definitions # of cases like this in population inverse probability of sampling this case These definitions are proportional give identical inferences

16. Why ECLS-K doesn�t sample with equal probability If every school / child had equal probability ECLS-K would have few private school children few Asian-Americans But researchers want to study those groups So ECLS-K gives them higher sampling probabilities Weights are assigned to compensate

17. Benefit: Weights can reduce bias Compared to the population, ECLS-K has too many Asians too many private schools These are high-scoring groups Without weights, ECLS-K�s mean score probably exceeds population mean Weights compensate for this giving lower weights to Asians and private schools

18. Benefit: Weights can reduce bias� �if weights correlate with Y

19. Cost: Weights can increase sampling variation

20. Weighted vs. unweighted sample means: A tradeoff

21. Estimating with weights

22. 3. Strata

23. Strata: Definition Divide population into subpopulations (strata) Sample within each stratum (With clusters, we�d only sample within some.)

24. Strata in ECLS-K Schools divided into two strata public private Schools sampled within each stratum 934 public 346 private

25. Benefit: Strata reduce sampling variation Private schools tend to score higher Suppose ECLS-K used SRS of 1280 schools Scores would have 2 sources of sampling variation mix of public and private schools mix of high- and low-scoring schools among publics among privates Instead, ECLS-K stratifies, sampling 934 public 346 private The mix of public and private schools is fixed by design elminating the first source of sampling variation

26. Benefit: Strata reduce sampling variation

27. Costs: None!

28. Estimating with strata

29. Clusters vs. strata Clusters chosen for convenience only sample within some hope for low internal similarity Strata chosen for theoretical interest sample within all hope for high internal similarity

30. Putting it all together

31. Summary ECLS-K is not a simple random sample most surveys aren�t If you analyze it like a simple random sample mean is biased high ~.3 points higher than population mean mainly because of weights SE is biased low half as large as it should be mainly because of clusters Stratification in ECLS-K is trivial could be more important in other surveys

32. 4. Software

33. Stata svy commands svymeans svyregress svylogit etc Documentation Stata, Survey Data [svy] in SRL

34. SAS PROC SURVEYREG built in normal regression only PROC MIXED, NLMIXED, GENMOD built in require more explicit models, more user knowledge IVEWare SAS callable �a variety of descriptive and model based analyses� http://www.isr.umich.edu/src/smp/ive/ SUDAAN SAS callable

35. SPSS Base SPSS historically bad with weights Version 12 improved still no clusters or strata SPSS Complex Samples (add-on) clusters, strata descriptive statistics only not at OSU

36. 5. Reading

37. Reading Levy & Lemeshow, Sampling of Populations simple introduction focuses on means, proportions, totals, ratios examples in Stata Stata, Survey Data (in SRL) discusses more complex models including theory Winship & Radbill (1994)�Sampling Weights and Regression Analysis�, Sociological Methods and Research23(2): 230-257 more on bias/variance tradeoff

38. 6. Bonus topic:The trouble with weights

39. Weighted vs. unweighted sample means: A tradeoff

40. Addressing the tradeoff:Simple approaches Sometimes it�s better to use weights, sometimes not Compare weighted and unweighted estimates Are they roughly the same? Say so Is the difference in point estimates more than a couple of standard errors? Use the weights or add regressors

41. When weights help, when they hurt

42. When weights help in ECLS-K e.g., ECLS-K Weights, scores negatively correlated . svyreg C1RSCALE C1CW0 ---------------------------------------- C1RSCALE | Coef. Std. Err. -------------+--------------------------- C1CW0 | -.0050289 .0014183 _cons | 23.24951 .3444241 because weights are lower for high-scoring groups Asians private students So weights reduced bias in estimating mean

43. Why weights help less in regression Regressors X (usually) reduce residual correlation between weights and Y If regress scores on race, school sector little residual correlation with weight . svyreg C1RSCALE ASIAN S2KPUPRI C1CW0 --------------------------------------------------- C1RSCALE | Coef. Std. Err. t P>|t| -------------+------------------------------------ ASIAN | 3.928595 .5334304 7.36 0.000 S2KPUPRI | 4.528321 .3694436 12.26 0.000 C1CW0 | .0012251 .0014001 0.88 0.382 _cons | 16.38356 .6275102 26.11 0.000 In a regression with race and sector weights might hurt more than they help

Analyzing fancy random samples: Clusters, weights, and maybe strata

Analyzing fancy random samples: Clusters, weights, and maybe strata

Presentation Transcript

Fancy dress UK | Fancy dress costume hire

CA Weighmaster Program

Quasi-Random Techniques

Random Sampling - Random Samples

Coring Recovering rock samples for:

Guardian Strata

Strata Management Financial Services

Clusters and Superclusters

Chapter 5 Stratified Random Sampling

STRATA Research ON CLIMATE CHANGE

Physiographic Strata

Representative sample

Differences Amongst Samples

Sampling Designs

STA 291 Spring 2010

Statistical weights of mixed DNA profiles

Prediction and Perfect Samples

Calo weights: RunII vs RunI

Generating Random Samples

Ch 13 實習

Strata Inspection Report Melbourne