1 / 32

Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design

Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design. Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005 Research supported by EPA Cooperative Agreements R829095 and R829096. Motivation.

yestin
Download Presentation

Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005 Research supported by EPA Cooperative Agreements R829095 and R829096

  2. Motivation • In resource monitoring and assessment, time and expense constraints may make two-stage sampling more efficient • Select a sample of watersheds; sample different bodies of water within selected watersheds • Select a sample of lakes; sample at different locations in selected lakes • Samples are not always sufficiently dense in small watersheds; availability of cheap auxiliary information (primarily from GIS) suggests incorporating a model • Auxiliary information may be available on different scales • Often many study variables; rather than fit a model for each one, would like one set of weights that can be applied reasonably well to all variables, i.e.,

  3. Outline • Two-stage structure • Model-free, model-assisted, and model-based estimators • Penalized splines • Simulation results • Properties of model-assisted estimator using penalized spline

  4. Two-Stage Structure • Population of elements U = {1,…, k,…, N} is partitioned into clusters or primary sampling units (PSUs), U1,…, Ui,…, . So,where Ni is the number of elements or secondary sampling units (SSUs) in Ui.

  5. Case A: Cluster Level Auxiliaries (Our focus) • The auxiliary information is available for all clusters in the population • Leads to regression modeling of quantities associated with the clusters, such as cluster totals and means • Cluster quantities can be computed for all clusters • Population quantities can be computed from cluster estimates • Example: Lake represents a cluster; auxiliary information is elevation

  6. Case B: Complete Element Level Auxiliaries • The auxiliary information is available for all elements in the population • Leads to regression modeling of quantities associated with the elements • Cluster and population quantities can then be computed from element estimates and observations • Example: EMAP hexagon is cluster; lake is element; auxiliary information is elevation

  7. Case C: Limited Element Level Auxiliaries • The auxiliary information is available for all elements in selected clusters only • Leads to regression modeling of quantities associated with the elements • Regression estimators can be used for cluster-level quantities only for the clusters selected in the first-stage sample • Population-level quantities can be estimated using design-based estimators • Example: Aerial photography of selected sites (clusters); for each point (element) in site, we have percent forested, urban, industrial

  8. Case D: Limited Cluster Level Auxiliaries • The auxiliary information is available for all clusters in the first-stage sample • Not a very interesting case • Design-based estimator can be used for population quantities • Example: Cluster is lake; auxiliary information is measure of size which is not available until site is visited

  9. Sampling • First stage: A sample of clusters, sI, is selected based on a design, pI(·) with inclusion probabilities Ii and Iij • Ii and Iij are the first and second order inclusion probabilities, respectively • Second stage: For every i  sI, a sample si is drawn from Ui based on the design pi(· | sI) • Typically require second stage design to be invariant and independent of the first stage

  10. Other Notation • is the total for the variable yover the entire population • Where required, we will assume the population model:where i is the mean of the y’s in PSU i • xi is some auxiliary variable that is a known quantity (usually a total or mean) for PSU i

  11. The Estimators (for population totals) • Model-free • Model-assisted • Model-based

  12. Model-Free Estimator • If no other information than the sampling design is available, the Horvitz-Thompson Estimator is often usedwhere • Notes: • Always design unbiased • Variance is large for small sample sizes • Does not make use of auxiliary information

  13. Model-Assisted Estimator where is the PSU total predicted by the model • Properties: • Asymptotically unbiased and consistent even if model is misspecified • Variance is generally smaller than with HT, but larger than with the model-based estimator • Can incorporate auxiliary information

  14. Model-Based Estimator where is the ith PSU mean predicted by the model • Properties: • Unbiased if model is correctly specified • Variance is generally smaller than with HT • Can incorporate auxiliary information

  15. Notes on the Models • 3 different models considered • Linear • Penalized spline with random effect for PSU • Penalized spline with no random effect for PSU • Extend model specification for penalized spline with random effect for PSU:where yij is the response for the jth element in PSU i

  16. Penalized Splines (P-Splines) • With a linear model, we assume • For a penalized spline,where 1 < …< K are K fixed knots and

  17. Simulation Study • 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400) • PSU = f(I) + , where f(·) is one of eight functions and  ~ N(0, 2I) • We use first order inclusion probabilities proportional to size (pps) • Auxiliary data is often proportional to size of cluster • Generate the response of interest yij = i + ij where yij is the jth element in the ith cluster and ij ~ iid N(0, 2)

  18. First Four Functions

  19. Second Four Functions

  20. Some Simulation Results

  21. More Simulation Results

  22. Why not use model-based? • In survey contexts, such as those found in environmental monitoring, it is often desirable to obtain a single set of survey weights that can be used to predict any study variable. To accommodate this: • Smoothing parameter for spline is selected by fixing the degrees of freedom for the smooth rather than using a data driven approach • With model-based, sampling design is ignored and estimates rely solely on the form of f(·)

  23. Relative MSE (Fitting to bump)

  24. Relative MSE (Fitting to bump)

  25. Relative Bias (Fitting to bump)

  26. Relative Bias (Fitting to bump)

  27. Relative Variance (Fitting to bump)

  28. Relative Variance (Fitting to bump)

  29. Properties of Model-Assisted Estimator • The penalized spline estimator, , is linear operator • It is location and scale invariant, in the sense thatprovided an intercept is kept in the model and

  30. Properties of Model-Assisted Estimator • Under mild assumptions, the penalized spline estimator, , is design -consistent for ty, in the sense that and has the following asymptotic distributional property:

  31. Properties of Model-Assisted Estimator • Again, under mild assumptions, the estimator • The previous two results lead to:

  32. Summary • Two-stage sampling designs are used frequently in natural resource monitoring and assessment • Sample sizes are often sparse; model-free estimators will have high variance • Model-based estimators make use of auxiliary information and have good properties provided model is correctly specified • Modeling with p-splines solves problem of correctly specifying model • Often, model can’t be fit to all study variables; model-assisted estimators still have reasonably good properties when weights from one model are applied to all study variables

More Related