Sample Design Aspects of Analytical Planning

1 / 31

Sample Design Aspects of Analytical Planning - PowerPoint PPT Presentation

Sample Design Aspects of Analytical Planning. November 12, 2007 The First Arab Statistical Conference. Introduction.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Sample Design Aspects of Analytical Planning' - jefferson

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Sample Design Aspects of Analytical Planning

November 12, 2007

The First Arab Statistical Conference

Introduction

We never want to be in a situation where we cannot answer key analytical questions because data simply were not collected properly, either because the questions were worded incorrectly, were incomplete or omitted altogether, or asked of the wrong persons or not the whole set of persons needed, or the domains of estimation were not pre-identified and the sample size is insufficient, etc.

To avoid these situations (and avoid wasting money, time, and resources), we conduct analytical planning.

We use the term to mean “planning for data analysis”, in this case, statistical analysis. Analytical planning thus consists of planning prior to data collection of the various aspects that will contribute to a quality publication, including report content planning, table planning, questionnaire planning, and sample design planning.

Sample Design Aspects of Analytical Planning

This presentation focuses on the sample design aspects of analytical planning. We will talk about:

1. The population of interest

2. Specifying the types of estimators and the domains of estimation

3. The validity and efficiency of sample design for analytical purposes

• The types of errors in sample estimates and how they affect analyses
• Statements made in substantive analysis
Questions to Be Answered in Analytical Planning

For sample design, the main questions are:

What kind of analysis is needed to meet the objectives? What data do we need to do the analysis?

What is the population of interest? For what separate subdomains are estimates required? What level of precision is required?

What statements need to be made? What questions need to be answered? What subgroups or time periods need to be compared?

Examples of Statements to Be Made

Examples:

• In health care analysis, what percentage of the population presents a particular health condition? What percentage of the presenting cases receive treatment? How do these figures compare for different demographic groups? How do these figures compare over time?
• In labor force analysis, the unemployment has/has not increased since last year; by how much. Differences (read: statistically significant differences) in the rate exist/do not exist among various population or geographic subgroups.
Population of Interest
• As one of the first steps in analytical planning, it is necessary to define the exact target population.
• This is the population in scope for the survey, the “Survey Population” or “Target Population”. It is the set of all elements that we want the analysis to cover, that is, all the units of analysis. It is extremely important to specify which units / elements are included as well as those that are excluded.
Population of Interest

In a household survey:

• Will the survey cover the total population or the household population? What is meant by that? Does this include persons living in regular households as well as collective households? Depending on the analytical objectives, some surveys choose to include collective households and others don’t. Collective households are those where unrelated persons share living arrangements in a residential unit, such as labour gatherings (group of workers), students in dormitories, nurses in hospitals, and others. Often, small collective households behave much like regular households and so, only the large ones would require different sampling and data collection approaches.
• Are transient quarters, such as hotels, included? We might include them in a census (as visitors), but they are normally excluded in a household survey because it is not practically possible to capture these people at different field phases. They are also not meaningful for analysis in a household survey where we are usually interested in the conditions of the resident population.
Population of Interest (cont.)
• Thus the household population can be defined to normally include both regular and collective households, and to exclude persons in transient quarters and inmates of institutions (this is not the same as persons living in households located on institutional grounds, such as caretakers or staff, who are not inmates).
• In each country, the scope of coverage is defined appropriately and translated operationally.
• In some countries, a residence criterion is applied to each individual in the household rather than to the dwelling. Rules are developed based on how much time out of the year the person spends at the housing unit and whether the person has another place of residence elsewhere.
• In other countries, residence is defined by whether the person lives in a permanent residence, as opposed to transient quarters, regardless of how long they plan on staying.
Population of Interest (cont.)

In an establishment survey:

The universe of interest, the group of establishments the survey intends to cover, is defined in terms of:

• Size (for example, all establishments employing 1 or more persons, as opposed to sole proprietorships, partnerships, etc.)
• Economic activity: e.g., all manufacturing establishments, or all economic establishments in all activities
• Sector: e.g.,: private industry, government corporations, and mixed ownership establishments, public administration.

Example:

The target universe for the survey includes all operating non-agricultural establishments, while excluding establishments in public administration and diplomatic/international agencies.

Survey Frame
• The target population in turn determines what sort of sampling frame we are going to use. That is, what sampling unit are we going to use to arrive at the unit of analysis? Does the frame exist already? Does it have to be developed?
• Some surveys use area frames, which means that a random sample of geographic areas (e.g., enumeration blocks) is selected first. In subsequent stages, other smaller areas can be defined or a list can be constructed.
• For example, in a household survey, a field operation is conducted after the selection of primary sampling units which could be census enumeration areas, to list all the housing units/households in a sample of PSUs. Then a random sample of housing units/households is drawn from the listings for interviewing purposes.
Survey Frame
• Agricultural production surveys can also use area frames such as census enumeration areas in the first stage and then list farm households in selected areas in a second stage. Such a survey would be for the purpose of estimating characteristics of farm households and, for this purpose, it would desirable to control the number of farm households falling in the sample.
• Other agricultural surveys where the main variable of interest is cultivated area, production, yield and such measures, also use an area frame approach, but instead of listing households, land segments are formed and listed instead.
Survey Frame
• List frames are also used. For example, in a household survey, as mentioned, within selected PSUs, a listing of households is conducted prior to household selection.
• In establishment surveys, list frames are the most common type of frame because of the existence of business registers or other lists of registered establishments.
Example of Frame Development

In Qatar, a Disability Survey was conducted. It did not use an existing frame because it was important to control how many disability households were interviewed.

The Disability Survey, thus, began with the construction of an exhaustive frame identifying all households in Qatar that contained people with disabilities. This required listing and contacting all households in the country. Questions asked during listing were name of head of household, number of household members by nationality, sex, and existence of a disability.

Then in the second phase, households with disabilities were revisited for the purpose of collecting the detailed health, social, cultural and economic data through a questionnaire.

Specifying Types of Estimators
• The estimates calculated should be reliable and effective for answering the questions in the survey objectives. This includes presenting both absolute measures (totals or aggregates) and relative measures that permit comparisons between groups. It also means presenting reliable measures for separate subgroups of interest.
• The survey tabulation plans must be reviewed before specifying the sample design in order to identify the type of estimators and the domains for which separate estimates are needed. The choice of sampling procedures must then ensure that the parameters of interest and the variance of their estimators will be estimated correctly and efficiently from the sample.
• Most survey estimates consist of:
• totals (aggregates), both quantitative (the value of a variable) and qualitative (number of elements that possess an attribute)
• means
• proportions/percentages, and
• ratios.
Specifying Domains of Estimation
• Estimates are usually required for subgroups of the population as well as for the population as a whole. In principle, whenever separate estimates of a given precision are desired for subgroups of the population, a stratified design should be used in which "strata" are defined beforehand and the sample is appropriately allocated and selected.
• This approach ensures an adequate sample size in each stratum which is important for estimation and analytical purposes.
• However, the definition of strata requires advance knowledge of the value of the classification variable for each population unit. It is not always possible to have this information in advance.
• For instance, it is usually possible to classify households by geographic area before selection. On the other hand, it usually is not possible to classify persons as male or female, employed or unemployed, etc. until after data collection. Of course, estimates are still possible for subpopulations that are defined after the sample selection phase. However, we cannot control their precision in advance and their resulting variance can be large if small numbers of observations occur in the cells.
Specifying Domains of Estimation
• In Qatar, separate estimates by urban/rural are not of real interest since Qatar is small and predominantly urban. Municipalities are very small in population size, thus it is not efficient to design separate estimates of a given precision for that level.
• One important interest, however, is in separate estimates of a given precision for Qataris and non-Qatari households. If left to chance, we would not be able to control the number of households of each type falling in sample and, hence, we would not be able to control the precision beforehand for the separate estimates.
• However, since residential areas that are predominantly Qatari can be physically identified in the field from non-Qatari ones, an area frame is constructed which classifies each primary sampling unit (PSU) as Qatari or non-Qatari. This permits selecting separate samples of a target size from each domain, before data collection.

Although a sample includes only part of a population, it would be misleading to call a collection of numbers a "sample" merely because it includes part of a population. To be acceptable for statistical analysis, a sample must represent the population and must have measurable reliability. In addition, the sampling plan should be practical and efficient. There are four basic criteria for the acceptability of a sampling method:

• KNOWN PROBABILITY OF SELECTION OF EACH POPULATION UNIT
• must be known in advance and must be greater than zero
• only probability sampling meets this criterion
• MEASURABLE RELIABILITY
• must be able to measure the variance of the estimates from the sample
• only probability sampling meets this criterion

UNBIASEDNESS

• The estimates calculated from the sample should be reasonably free of bias
• This requires using proper weights during the estimation
• Also requires the use of unbiased and/or consistent estimators
• Only probability sampling can control sampling biases
• Biases arising from nonsampling errors must also be controlled
• ECONOMY, FEASIBILITY, AND EFFICIENCY
• The chosen sample design should be practical (feasible) and within the resources of the survey organization to ensure timeliness and allow for the proper control of nonsampling errors
• Efficiency means the design should provide the most information with the most accuracy for the lowest cost
• We see that the mathematical properties of sample design, such as the unbiased calculation of estimates and variances, only apply to a specific type of sampling called probability sampling or random sampling.
• A probability sampling scheme is the only sampling scheme which has the following features:
• population units have a known (a priori) probability of selection greater than zero;
• sampling biases (selection and/or estimation biases) can be controlled to obtain virtually unbiased estimates of population parameters;

Note: we are not referring here to nonsampling biases which can affect all types of sampling.

• it is possible to measure the variance (the reliability or precision) of the estimates directly from the sample.
• Examples of

probability sampling:

• simple random sampling, stratified random sampling, cluster sampling, sampling with probability proportional to size, systematic sampling

non-probability sampling:

• haphazard (without a scheme), convenience, judgment (quota)
• Probability sampling is not only a requirement for the calculation of estimates, it is also one for major statistical analysis methods – hypothesis tests, confidence intervals, regression and others. These methods usually require random samples, unbiased estimates of the population parameters from the sample, and unbiased estimates of their standard errors.

Sometimes, independence of samples is a prerequisite for certain analyses. If independence is an essential condition, then it should be incorporated into the design from the beginning (by adopting a stratification scheme and selecting independent samples).

If samples are not independent, variance calculations must take into account the covariances. In fact, because the covariance reduces the standard error, a common sample design feature is overlap between samples purposely built-in. This is also called sample rotation, because different sampling units (e.g., PSUs or households) rotate in and out of the sample.

Example: to measure the change in a variable from year to year, we purposely keep part of the sample from the previous year (1/4, 1/2, or 3/4) and only add fresh sample to make up the remaining part. The covariance of the new estimate with the previous year’s estimate (due to the correlation between the two samples) reduces the variance of the difference.

Illustration of Annual Sample Rotation

The letters refer to “panels” or random subsamples of households, such that four panels make up a full sample.

For all estimates obtained from a sampling procedure, the accuracy of the estimates is a function of both sampling and nonsampling error. Other data sources are also subject to error, for example, error in models, biases due to incomplete registers, recording errors, etc.

The overall design of the study must keep in mind the total error. The goal is to balance SEs and NSEs, that is, control both biases and variance. While large sample sizes reduce variance, they can also increase biases if they place an undue burden on resources and make it difficult to implement quality control. Likewise, if the questionnaire is flawed or the interviewers are not well trained, no sample design is going to remedy the response errors you will obtain.

Types of Error

1. Total Error

should be kept in mind when making inferences based on survey estimates

measured by the mean square error (MSE)

MSE ( ) = Var ( ) + Bias2( )

(bias is the difference between the expected value of the estimates and the true value of the parameter.)

Total Error
2. Sampling Error

difference between the estimate based on a sample and the same measure that would be derived from a complete count under otherwise identical conditions.

mostly variable error (variance), represented by the standard error (square root of the variance) which gives an inverse measure of the precision (reliability) of the estimates

is lower when the sample size is large and when efficient measures are built into the sample design, such as: efficient stratification and sample allocation, selection with probability proportional to size (or other schemes) when primary sampling units vary in size, the creation of certainty strata where large units are selected with a 100% probability.

Sampling Error
2. Sampling Error (cont.)

in addition to variance, sampling error may also include sampling biases, which are systematic rather than random in nature. Sampling biases, such as bias from ratio estimators, for example, tends to be negligible when sample sizes are large. Another possible source of sampling bias is the use of incorrect sampling weights.

uses of the standard error:

hypothesis tests

confidence intervals

coefficient of variation (CV)

effects on the data: statistical inferences based on estimates with poor precision (high sampling error) are often inconclusive

in hypothesis tests, real differences in the population may go unnoticed because observed differences may be attributed totally to sampling variability

in confidence intervals, the range of possibilities for the true values may be so wide that no useful conclusion is possible

Sampling Error (cont.)
3. Non-Sampling Error

Errors resulting mainly from procedures used to measure the characteristics and from poor quality in the implementation of the survey methodology

mostly biases

major types:

coverage error (e.g., use of an inadequate frame),

response or content error (e.g., a poorly designed questionnaire, recording errors, measurement errors)

nonresponse error

data processing errors (coding, keying, etc.)

Non-Sampling Error
3. Non-Sampling Error (cont.)

usually difficult to measure; can be partially measured through coverage evaluation studies, reinterviews or validation against administrative records

effects on the data:

in hypothesis tests, a significant difference or the lack of one may be entirely due to (hidden) nonsampling errors

in confidence intervals, the true value may be far from the limits based on sampling error alone

Non-Sampling Error (cont.)

Even though the total error is the most important, it is generally the variance that is considered during analysis. The reason is that, as we mentioned, the variance can be estimated directly from the sample while the measurement of biases requires more complex techniques. Still, even if the quantification of biases is not possible, analytical reports should discuss their sources, efforts made to control them and their possible effects on the data.

Variances, on the other hand, should be calculated and reported in all sample surveys for as many estimates as possible. Appropriate formulas, responding to the sample design specifications, should be used. Simple-random-sampling formulas or computer programs that assume a simple random sample should not be used to estimate variances in a complex survey (e.g., a three-stage cluster design) because such calculations tend to result in large underestimates of the true variance. The underestimation of the variance is a serious error which obviously leads to incorrect analytical conclusions.

Calculation of Biases and Variances
Substantive Analysis
• This section can be given different titles, for example, “Key Findings” or “Main Results”, etc. Under any title, this part consists of a series of analytical statements – comparisons and conclusions – based on descriptive analysis and significance testing. It states conclusions that can be drawn from the data and answers the key data questions that the survey intended to answer. For example, what was the unemployment rate (or whatever variable the survey was measuring)? Did the unemployment rate go up, down, or stayed the same? Was it differential for different groups of people? Any surprising or new information/revelation?
• Comparability factors must be taken into account and reported to the users. There may be differences due to: changes in survey scope, changes in definitions in the survey concepts, etc. or are they actual changes n the value of the variable that occurred over time? The comparability statements can be part of the substantive analysis or they can be presented instead in the technical appendix.
• The substantive analysis must be free of erroneous statements or deductions, which can undermine the quality of the results. The analyst must be technically well-qualified to avoid these errors. Analytical statements should not be made without significance (hypothesis) testing.
Substantive Analysis

EXAMPLES OF STATEMENTS REQUIRING TESTS

• UNEMPLOYMENT RATE
• decreased overall by 0.1 percentage point to 4.4%. The male unemployment rate decreased by 0.1 percentage point to 4.0%, and the female unemployment rate remained at 4.9%.
• PARTICIPATION RATE
• increased by 0.2 percentage points to 64.9%.