Sample Design Aspects of Analytical Planning. November 12, 2007 The First Arab Statistical Conference. Introduction.
November 12, 2007
The First Arab Statistical Conference
We never want to be in a situation where we cannot answer key analytical questions because data simply were not collected properly, either because the questions were worded incorrectly, were incomplete or omitted altogether, or asked of the wrong persons or not the whole set of persons needed, or the domains of estimation were not pre-identified and the sample size is insufficient, etc.
To avoid these situations (and avoid wasting money, time, and resources), we conduct analytical planning.
We use the term to mean “planning for data analysis”, in this case, statistical analysis. Analytical planning thus consists of planning prior to data collection of the various aspects that will contribute to a quality publication, including report content planning, table planning, questionnaire planning, and sample design planning.
This presentation focuses on the sample design aspects of analytical planning. We will talk about:
1. The population of interest
2. Specifying the types of estimators and the domains of estimation
3. The validity and efficiency of sample design for analytical purposes
For sample design, the main questions are:
What kind of analysis is needed to meet the objectives? What data do we need to do the analysis?
What is the population of interest? For what separate subdomains are estimates required? What level of precision is required?
What statements need to be made? What questions need to be answered? What subgroups or time periods need to be compared?
In a household survey:
In an establishment survey:
The universe of interest, the group of establishments the survey intends to cover, is defined in terms of:
The target universe for the survey includes all operating non-agricultural establishments, while excluding establishments in public administration and diplomatic/international agencies.
In Qatar, a Disability Survey was conducted. It did not use an existing frame because it was important to control how many disability households were interviewed.
The Disability Survey, thus, began with the construction of an exhaustive frame identifying all households in Qatar that contained people with disabilities. This required listing and contacting all households in the country. Questions asked during listing were name of head of household, number of household members by nationality, sex, and existence of a disability.
Then in the second phase, households with disabilities were revisited for the purpose of collecting the detailed health, social, cultural and economic data through a questionnaire.
Although a sample includes only part of a population, it would be misleading to call a collection of numbers a "sample" merely because it includes part of a population. To be acceptable for statistical analysis, a sample must represent the population and must have measurable reliability. In addition, the sampling plan should be practical and efficient. There are four basic criteria for the acceptability of a sampling method:
Note: we are not referring here to nonsampling biases which can affect all types of sampling.
Sometimes, independence of samples is a prerequisite for certain analyses. If independence is an essential condition, then it should be incorporated into the design from the beginning (by adopting a stratification scheme and selecting independent samples).
If samples are not independent, variance calculations must take into account the covariances. In fact, because the covariance reduces the standard error, a common sample design feature is overlap between samples purposely built-in. This is also called sample rotation, because different sampling units (e.g., PSUs or households) rotate in and out of the sample.
Example: to measure the change in a variable from year to year, we purposely keep part of the sample from the previous year (1/4, 1/2, or 3/4) and only add fresh sample to make up the remaining part. The covariance of the new estimate with the previous year’s estimate (due to the correlation between the two samples) reduces the variance of the difference.
The letters refer to “panels” or random subsamples of households, such that four panels make up a full sample.Validity and Efficiency of Sample design for Analytical Purposes
The overall design of the study must keep in mind the total error. The goal is to balance SEs and NSEs, that is, control both biases and variance. While large sample sizes reduce variance, they can also increase biases if they place an undue burden on resources and make it difficult to implement quality control. Likewise, if the questionnaire is flawed or the interviewers are not well trained, no sample design is going to remedy the response errors you will obtain.Types of Errors in Sample Estimates and How They Affect Analyses
difference between the estimate based on a sample and the same measure that would be derived from a complete count under otherwise identical conditions.
mostly variable error (variance), represented by the standard error (square root of the variance) which gives an inverse measure of the precision (reliability) of the estimates
is lower when the sample size is large and when efficient measures are built into the sample design, such as: efficient stratification and sample allocation, selection with probability proportional to size (or other schemes) when primary sampling units vary in size, the creation of certainty strata where large units are selected with a 100% probability.Sampling Error
in addition to variance, sampling error may also include sampling biases, which are systematic rather than random in nature. Sampling biases, such as bias from ratio estimators, for example, tends to be negligible when sample sizes are large. Another possible source of sampling bias is the use of incorrect sampling weights.
uses of the standard error:
coefficient of variation (CV)
effects on the data: statistical inferences based on estimates with poor precision (high sampling error) are often inconclusive
in hypothesis tests, real differences in the population may go unnoticed because observed differences may be attributed totally to sampling variability
in confidence intervals, the range of possibilities for the true values may be so wide that no useful conclusion is possibleSampling Error (cont.)
Errors resulting mainly from procedures used to measure the characteristics and from poor quality in the implementation of the survey methodology
coverage error (e.g., use of an inadequate frame),
response or content error (e.g., a poorly designed questionnaire, recording errors, measurement errors)
data processing errors (coding, keying, etc.)Non-Sampling Error
usually difficult to measure; can be partially measured through coverage evaluation studies, reinterviews or validation against administrative records
effects on the data:
in hypothesis tests, a significant difference or the lack of one may be entirely due to (hidden) nonsampling errors
in confidence intervals, the true value may be far from the limits based on sampling error aloneNon-Sampling Error (cont.)
Even though the total error is the most important, it is generally the variance that is considered during analysis. The reason is that, as we mentioned, the variance can be estimated directly from the sample while the measurement of biases requires more complex techniques. Still, even if the quantification of biases is not possible, analytical reports should discuss their sources, efforts made to control them and their possible effects on the data.
Variances, on the other hand, should be calculated and reported in all sample surveys for as many estimates as possible. Appropriate formulas, responding to the sample design specifications, should be used. Simple-random-sampling formulas or computer programs that assume a simple random sample should not be used to estimate variances in a complex survey (e.g., a three-stage cluster design) because such calculations tend to result in large underestimates of the true variance. The underestimation of the variance is a serious error which obviously leads to incorrect analytical conclusions.Calculation of Biases and Variances
EXAMPLES OF STATEMENTS REQUIRING TESTS