Ch 2: probability sampling, SRS

1 / 108

# Ch 2: probability sampling, SRS - PowerPoint PPT Presentation

Ch 2: probability sampling, SRS. Overview of probability sampling Establish basic notation and concepts Population distribution of Y : object of inference Sampling distribution of an estimator under a design: assessing the quality of the estimate used to make inference

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Ch 2: probability sampling, SRS' - wendi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ch 2: probability sampling, SRS
• Overview of probability sampling
• Establish basic notation and concepts
• Population distribution of Y : object of inference
• Sampling distribution of an estimator under a design: assessing the quality of the estimate used to make inference
• Apply these to SRS
• Selecting a SRS sample
• Estimating population parameters (means, totals, proportions)
• Estimating standard errors and confidence intervals
• Determining the sample size
Assume ideal setting
• Sampled population = target population
• Sampling frame is complete and does not contain any OUs beyond the target pop
• No unit nonresponse
• Measurement process is perfect
• All measurements are accurate
• No missing data (no item nonresponse)
• That is, nonsampling error is absent
Survey error model

Total Survey Error

Sampling Error

Nonsampling Error

=

+

Due to the sampling process (i.e., we observe only part of population)

Measurement errorNonresponse errorFrame error

Assessed via bias and variance

Probability sample
• DEFN: A sample in which each unit in the population has a known, nonzero probability of being included in the sample
• Known probability  we can quantify the probability of a SU of being included in the sample
• Assign during design, use in estimation
• Nonzero probability  every SU has a positive chance of being included in the sample
• Proper survey estimates represent entire target population (under our ideal setting)
Probability sampling relies on random selection methods
• Random sampling is NOT a haphazard method of selection
• Involves very specific rules that include an element of chance as to which unit is selected
• Only the outcome of the probability sampling process (i.e., the resulting sample) is random
• More complicated than non-random samples, but provides important advantages
• Avoid bias that can be induced by selector
• Required to calculate valid statistical estimates (e.g., mean) and measures of the quality of the estimates (e.g., standard error of mean)
Representative sample
• Goal is to have a “representative sample”
• Probability sampling is used to achieve this by giving each OU in target population an explicit chance to be included in the sample
• Sample reflects variability in the population
• Applies to the sample, but does not apply to the OU/SU (don’t expect each observation to be a “typical” pop unit
• Can create legitimate sample designs that deliberately skew the sample to include adequate numbers of important parts of the variation
• Common example: oversampling minorities, women
• MUST use estimation procedures that take into account the sample design to make inferences about the target population (e.g., sample weights)
Basic sampling designs
• Simple selection methods
• Simple random sampling (Ch 2 & 3)
• Select the sample using, e.g., a random number table
• Systematic sampling (2.6, 5.6)
• Random start, take every k-th SU
• Probability proportional to size (6.2.3)
• “Larger” SU’s have a higher chance of being included in sample
• Selection methods with explicit structure
• Stratified sampling (Ch 4)
• Divide population into groups (strata)
• Take sample in every stratum
• Cluster sampling (Ch 5 & 6)
• OUs aggregated into larger units called clusters
• SU is a cluster
Examples
• Select a sample of n faculty from the 1500 UNL faculty on campus
• Goal: estimate total (or average) number of hours faculty spend per week teaching courses
• Simple random sampling (SRS)
• Number faculty from 1 to 1500
• Select a set of n random numbers (integers) between 1 and 1500
• Faculty with ids that match the random numbers are included in the sample
Examples - 2
• Systematic sampling (SYS)
• Choose a random number between 1 and 1500/n
• Select faculty member with that id, and then take every k-th faculty member in the list, with sampling interval k is 1500/n
• SRS / SYS
• Each faculty member has an equal chance of being included in sample
• Each sample of n faculty is equally likely
Examples - 3
• Probability proportional to size (PPS)
• With pps design, we assign a selection probability to each faculty member that is proportional to the number of courses taught by a faculty member that semester
• “Size” measure = # of courses taught by faculty member
• Faculty who teach more courses are more likely to be included in the sample, but those that teach less still have a positive chance of being included
• Motivation: faculty that spend more hours on courses are more critical to getting good estimate of total hours spent
• Data from faculty with higher inclusion probabilities will be “down weighted” relative to those with lower probabilities during the estimation process
• Typically accomplished using weights for each observation in the dataset
Examples - 4
• Stratified random sampling (STS)
• Organize list of faculty by college
• Stratum = college
• Allocate n (divide sample size) among colleges so that we select nh faculty in the h-th college
• Sum of nhover strata equals n
• Use SRS, e.g., to select sample in each of the college strata
• Could use SYS or PPS rather than SRS
• Could have different selection methods in each stratum
Examples - 5
• Cluster sampling (CS)
• Aggregate faculty into departments
• OU = faculty member, SU = dept
• Select a sample of departments, e.g., using SRS
• Very common to use PPS for selecting clusters
• “Size” measure = number of OUs in the the cluster SU
• Many variants for cluster sampling
• After selecting clusters, may want to select a sample of OUs in the cluster rather than taking data on every OU
• E.g., select 15 depts in the first stage of sampling, then select 10 faculty in each dept in a second stage of sampling
• This is called 2-stage sampling
Examples - 6
• Complex sample designs (Ch 7)
• Combine basic selection methods (SRS, SYS, PPS) with different methods of organizing the population for sampling (strata, clusters)
• Typically have more than one stage of sampling (multi-stage design)
• Often can not create a frame of all OUs in the population
• Need to select larger units first and then construct a frame
• Stratification and systematic sampling are often used to encourage spread across the population
• This improves chances of obtaining a representative sample
• Costs are often reduced by selecting clusters of OUs, although cluster sampling may lead to less precision in estimates
Notation for target population
• The total number of OUs in the population (also called the universe) is denoted by N
• Note UPPER CASE
• Ideally for SRS, sampling frame is list of N OUs in the pop
• EX: there are N = 4 households in our class
• Index set (labels) for all OUs in the population (or universe) is called U
• U = {1, 2, …, N}
• A different index set could be our names, or our SSNs
• Each person has a value for the characteristic of interest or random variable Y , the number of people in the household
• The value of Y for household i is denoted by yi
• Values in the population are y1, y2, …, yN
Notation for sample
• Sample size is denoted by n
• Note lower case
• n is always less than or equal to N (n = N is a census)
• Index set (labels) for OUs in the sample is denoted by S
• To select a sample, we are selecting n indices (labels) from the universe U , consisting of N indices for the population
• U is our sampling frame in this simple setting
• Labels in S may not be sequential because we are selecting a subset of U
Class example
• Suppose n = 2 households are selected from a population of N = 4 households in the class
• U = {1, 2, 3, 4}
• Randomly select sample using SRS and get 2 and 3
• S =
• The data collected on OUs in the sample are values for Y = number of people in the household
• Data:
Summary of probability sampling framework
• Assumptions (for now)
• Observation unit = sampling unit
• Target population = sampling universe = sampling frame
• N = finite number of OUs in the population
• U = {1, 2, …, N} is the index set for the OUs in the population
• Sample
• n = sample size (n is less than or equal to N )
• S = index set for n elements selected from population of N units (S is a subset of U)
Conceptual basis for probability sampling
• Conceptual framework for selecting samples
• Enumerate all possible samples of size n from the population of size N
• Each sample has a known probability of being selected
• P(S) = probability of selecting sampleS
• Use this probability scheme to randomly choose the sample
• Using the probability scheme for the samples, can determine the inclusion probability for each SU
• i = probability that a sample is selected that includes uniti
Simple example
• Population of 4 students in study group, take a random sample of 2 students
• Setting
• U = {1, 2, 3, 4}
• N = 4
• n = 2
• All possible samples of size n = 2 from N = 4 elements
• Note: n < N and S U
Simple example - 2
• All possible samples

S1 = {1, 2} S3 = {1, 4} S5 = {2, 4}

S2 = {1, 3} S4 = {2, 3} S6 = {3, 4}

• Design is determined by assigning a selection probability to each possible sample

P(S1) = 1/3 P(S3) = 1/2 P(S5) = 0

P(S2) = 1/6 P(S4) = 0 P(S6) = 0

Simple example - 3
• Inclusion probability definition?
• What is the probability that student 1 is included in the sample?
• 1 =
• Inclusion probability for student 2, 3, 4?
• 2 =
• 3 =
• 4 =
• Is this a probability sample?
Population distribution
• Response variables represent values associated with a characteristic of interest for i-th OU
• Y is the random variable for the characteristic of interest (CAP Y)
• yi = value of characteristic for OU i(small y)
• The population distribution is the distribution of Y for the target population
• Y is a discrete random variable with a finite number of possible values (<= N values)
• Use discrete probability distribution to represent the distribution of Y
Population distribution - 2
• A discrete probability distribution is denoted by a series of pairs corresponding to
• Value of the random variable Y, denoted by y
• Relative frequency of the value y for the random variable Y in the population, denoted by P(Y = y)
• Pair is { y , P(Y = y) }
• Constructing a probability distribution
• List all unique values y of random variable Y
• Record the relative frequency of y in the population, P(Y = y)
Class example - 2
• Back to # of people in household for each class member
• What are the unique values in the pop?
• What is the frequency of each value?
• What is the relative frequency of each value?
• Construct a histogram depicting the variation in values
Summarizing the population distribution
• Use population parameters to summarize population distribution
• Mean or expected value of y (parameter: )
• Proportion of population having a particular characteristic = mean of a binary (0, 1) variable (parameter: p)
• For finite populations, population total of y is often of interest (parameter: t)
• Variance of y (parameter: S 2)
Mean of Y for population
• Expected value, or population mean, of Y
• Mean is in y-units per OU-unit
• Measure of central tendency (middle of distn)
• Related to population total (t) and proportion (p)
• Examples
• Average number of miles driven per week adults in US
• Average number of phone lines per household
Class example - 3
• What is the mean household size for people in this classroom?
Total of Y in population
• Population total of Y
• Total number of y-units in the population
• Examples
• Number of households in market area with DSL
• yi =1 if household i has DSL, yi = 0 if not
• N = number of households in market area
• Number of deer in Iowa
• yi =number of deer observed in area i
• N = number of observation areas in Iowa
Class example - 4
• What is the total number of people living in households of people in the classroom?
Proportion
• Proportion (p) of population having a particular characteristic
• Mean of binary variable
Class example - 5
• What proportion of people in the classroom have a cell phone?
Population variance of Y
• Population variance of Y
• Measure of spread or variability in population’s response values
• Analogous to 2in other stat classes
• Not the standard error of an estimate
• Note this is CAP S 2
Coefficient of variance for Y
• Variation relative to mean (unitless)
Class example - 6
• What is the population variance for number of people in households of people in the classroom?
• What is the CV?
Summary of population distribution of Y
• Basic pop unit: OU (i)
• Number of units or size of pop: N
• Random variable: Y
• Parameters: characterize the target population
• Mean
• Total t
• Proportion (mean) p
• Variance S2
• Coefficient of variation CV = S /
• STATIC: it is the object of inference and never changes with design or estimator
What’s next
• Population distribution of Y is object of inference
• Use SRS to select a sample and estimate the parameters of the population distribution
• How to select a sample
• Estimators for population parameters of Y under SRS
• Sample mean estimates population mean
• N x sample mean estimates population total
• Sample variance estimates population variance
• Assessing the quality of an estimator of a population parameter under SRS
• Sampling distribution
• Bias, standard error, confidence intervals for the estimator
Simple random sample (SRS)
• DEFN: A SRS is a sample in which every possible subset of n SUs has an equal chance of being selected as the sample
•  every sampling unit has equal chance of being included in the sample
• Example of an “equal probability” sample
• Does not imply that a sample in which each SU has the same inclusion probability is a SRS
• Other non-SRS designs can generate equal probability samples
Simple random sampling (SRS)
• Two types
• SRSWR (SRS with replacement)
• Return SU after each step in the selection process
• SRSWOR (SRS without replacement)
• Do not return SU after it has been selected
• Selection probability
• Probability that a unit is selected in a single draw
• Constant throughout SRSWR process
• Changes with each draw in the SRSWOR process
• NOT an inclusion probability, which considers the probability of drawing a sample that includes unit i
SRSWR (SRS with replacement)
• Selection procedure
• Select one OU with probability 1/N from N OUs
• This is the selection probability for each draw
• Returning selected OU to universe
• Repeat n times
• Procedure is like drawing n independent samples of size 1
• Can draw a sampling unit twice – duplicate units
• Unappealing for finite populations – no additional info in having a duplicate unit
• Useful in theoretical development for large populations
Focus: SRSWOR (SRS without replacement)
• Selection procedure
• Select one OU from universe of size N with probability 1/N
• DON’T return selected unit to universe
• Select 2nd OU from remaining units in universe with probability 1/(N - 1)
• DON’T return selected unit to universe
• Repeat until n sampling units have been selected
• Selection probabilities change with each draw
• 1/N, then 1/(N -1), then 1/(N -2), …, 1/(N – n +1)
SRSWOR (SRS without replacement)
• Probability of selecting a sampling unit in a single draw depends on number of SUs already selected (conditional probability)
• On the c-th step of the process, c-1 s.u.s have already been selected for a sample of size n
• Probability of selecting any of the remaining N – c + 1 s.u.s in the next draw is
• Inclusion probability for SU i (unconditional probability)
• (see p. 44 in text)
SRSWOR (SRS without replacement)
• Number of possible SRSWOR samples of size n from universe of size N
• Probability of selecting a sample S

(Probability is the same for all samples)

Selecting a SRS using SRSWOR
• Create a sampling frame
• List of sampling units in the universe or population
• Assigns an index to each sampling unit
• Determine a selection procedure that performs SRSWOR
• Procedure must generate to n unique sampling units such that each SU has an equal chance of being included in the sample
• Random number generator or table is common basis
• Need rules to identify when the selected unit is included in the sample or tossed
• Select random numbers and determine sampled units
Using random numbers to select a SRSWOR sample
• Determine a rule to assign random numbers to the sampling universe index set U
• Rule must give each unit an equal chance of being included in the sample
• Select the set of random numbers, e.g., using computer or printed random number table
• Apply the rule to each random number to determine the sampled OU
• Check to see if this OU has already been selected
• If already selected, ignore it
• Keep going until you have n SUs in the sample
Census of Agriculture example

Select 300 counties from 3078 counties in the US

• N =
• n =
• Sampling frame = ?
• Generate random numbers between 0 and 1 on the computer
• Need n or more random numbers depending on rule
• Multiply each random number by N = 3078and round up to the nearest integer
• Random number = .61663
• Multiply random # by N = 3078 x .61663 = 1897.98714
• Round up to 1898
• Take 1898th county in the frame
Estimating population mean under SRS
• Target population mean
• Estimator of for SRS sample of size n is the sample mean
• Note
• “Estimator” refers to the formula
• “Estimate” refers to the value obtained from using the formula with data
Class example - 7
• Estimate the average household size for our classroom
Estimating population total
• Target population total
• Estimator of t for SRS sample of size n
Class example - 8
• Estimate the total number of people living in the households of people in this classroom
Estimating population proportion
• Target population proportion
• Y takes on values 0 or 1, where 1 means the unit has the characteristic of interest
• Estimator of p for SRS sample of size n
Class example - 9
• Estimate the proportion of people with cell phones in this class room
Estimating population variance
• Target population variance
• Estimator of S2 for SRS sample of size n is the sample variance

(note lower case s)

Class example - 10
• Estimate the variance of number of people in households of people in this class room
Estimating population standard deviation and CV
• Standard deviation of Y, S ?
• Estimator of standard deviation of Y?
• CV of population distribution?
• Estimator of CV?
What would happen if we took another sample?
• S =
• Data =
• Estimates
• Mean
• Total
• Proportion
• Standard deviation
• CV
Sampling distribution
• Need to assess the quality of our estimates
• Is a good estimator of ?
• Is a good estimator of p ?
• Is s2 a good estimator of S2 ?
• Use the sampling distribution to assess the quality of the estimator
• Distribution of estimator over all possible samples
• EX: distribution of over all possible SRS samples of size n from a population of size N
Measures of quality
• Denote
• Population parameter as  [think pop mean ]
• Estimator of  as [think sample mean ]
• Mean of the sampling distribution is the expected value of the estimator
• An estimator is unbiased if
• Variance of the sampling distribution
• Precision: want variance of estimator to be small
• Coefficient of variance
• Relative precision: want CV to be small
Sampling distribution of estimator
• Basic pop unit: sample selected using a specific design, S
• Number of units or size of pop: number of possible samples
• Need probability of selecting sample !
• Random variable: estimator of parameter,
• Parameters: characterize the quality of the estimator
• Mean (assesses bias of the estimator),
• Variance, SE, CV (assesses precision of estimator)
• DEPENDS on population parameter, estimator of population parameter, sample design
Basic unit: OU (i)

Total number of units: N

Random variable: character of interest, Y

Parameters: characterize the target population

Mean , proportion p (central tendency)

Total t

Variance S2, std dev S, CV (spread of distn)

STATIC once you identify Y, pop distribtn is the object of inference and never changes with design or estimator

Basic unit: sample selected using a specific design, S

Total number of units: number of possible samples

Random variable: estimator of parameter,

Parameters: characterize the quality of the estimator

Mean (used to assess bias of the estimator)

Variance , SE, CV (precision of estimator)

DEPENDS on population parameter, estimator of population parameter, sample design

Population Samplingdistribution distribution
Conceptual framework for a sampling distribution - 1
• List out all possible samples of size n from the population of size N
• A sample is the BASIC UNIT for the population of all possible samples
• We determine the probability of selecting the sample
• Unequal probability sample (now)
• Simple random sample
• NOTE: sampling distribution depends on the design selected
Simple example from earlier lecture (not SRS!)
• All possible samples

S1 = {1, 2} S3 = {1, 4} S5 = {2, 4}

S2 = {1, 3} S4 = {2, 3} S6 = {3, 4}

• Design is determined by assigning a selection probability to each possible sample

P(S1) = 1/3 P(S3) = 1/2 P(S5) = 0

P(S2) = 1/6 P(S4) = 0 P(S6) = 0

Conceptual framework for a sampling distribution - 2
• List
• Using the n data values associated with each sample, calculate the value of the estimator for each sample
• The estimator is the random variable of our distribution
• Example: sample mean is calculated for each of the possible samples
• NOTE: the sampling distribution depends on the estimator selected
Simple example from earlier lecture - 2
• Population values for Y
• i 1 2 3 4
• yi3 5 1 3
• All possible samples of size n = 2

S1 = {1, 2}, S2 = {1, 3}, S3 = {1, 4},S4 = {2, 3}, S5 = {2, 4}, S6 = {3, 4}

• Values of corresponding to each sample
Conceptual framework for a sampling distribution - 3
• List
• Using
• Sampling distribution is described by pairs of values for estimator from the sample and relative frequency of obtaining that value
• We are using the steps we used before for creating a discrete distribution
Representing the sampling distribution
• Probability distribution: pairs of
• is a random variable, c is a valueof
Simple example from previous lecture - 3
• Number of possible samples
• Probability of selecting sample
• Probability distribution: unique values of and relative frequency

c 2.0 3.0 4.0

Conceptual framework for a sampling distribution - 4
• List
• Using
• Sampling distribution
• Parameters summarize sampling distribution
• Mean of sampling distribution
• Variance, std dev (SE) of sampling distribution
• CV of sampling distribution
Ex: mean and variance of sampling distribution for - 4
• Mean of sampling distribution
• Same concept of expected value used with population distribution
• Variance of sampling distribution
• Use more general formula for variance
• Later, we’ll use reductions that are easier to calculate
What if we took a SRS of size n from N units?
• List out all possible samples
• # possible samples:
• Determine the probability of a sample
• Calculate estimator for each sample
• Examples:
• Create a discrete probability distribution
• Calculate summary parameters
Back to example with SRS
• Number of possible samples
• Probability of selecting sample
• Probability distribution: unique values of and relative frequency

c 2.0 3.0 4.0

Example: mean of sampling distribution for under SRS
• Mean of sampling distribution
• Mean of population distribution
Bias of an estimator
• Estimation bias of
• Note that this is the mean of the estimator (from sampling distribution) minus the population parameter (from population distribution)
• If then is said to be an unbiased estimator of 
Variance of sample mean under SRS
• Don’t have to use the general formula
• Variance of sample mean (derived stat using theory)
• Similar to infinite population formula
• Has an extra factor called the finite population correction factor (FPC)
Example
• Variance of sampling distribution for
• Other measures of dispersion for sampling distribution
Finite population correction factor (FPC)
• Sampling fraction is the proportion of the population sampled, or n/N
• Larger sample 
• Larger fraction of population
• Smaller FPC
• Smaller variance of sample mean
Impact of FPC on estimated variance of parameter estimate
• Often FPC is very close to 1
• Sample of 3000 households from total of 1,200,000 households
• In cases where sampling fraction is very small and FPC is very close to 1, FPC has no practical effect on the SE or estimated variance of the param estimate
• Sampling fraction n/N is not a good measure of whether your estimate will be precise
• The sample size n is the most important part of the variance or SE formulas given variance
Estimating population variance under SRS
• Do not know variance of population distribution,
• Unbiased estimator for
• Estimator for
• Note thatis the standard error of the sample mean
Ag example
• Interested in average number of acres per county devoted to farms
• Sample 300 counties from list of 3078
• Collect data and get following summary statistics
• What are estimated mean and standard error?
Rounding rules
• Always keep all of the digits while you are doing calculations
• Round only when you get ready to report the result at the end of the calculation …
• Round the estimated SE to 2 significant digits
• 107,789 is rounded to 110,000
• 0.0325329 is rounded to 0.033
• Round estimate to precision of the SE
• If SE is 110,000, round estimate to nearest 10,000 (xx0,000)
• If SE is 0.033, round estimate to nearest 1/1000 (x.xxx)
• Estimated variances are usually reported to 5 significant digits
Sampling distribution for using SRS of size n from N
• is an unbiased estimator of
• Mean of sampling distribution is always equal to population mean under SRS
• Variance of is
• Estimate the variance of using sample variance s2
Sampling distribution of under SRS
• Mean of for population total t under SRS
• Expectation of a linear function of a random variable

If a, b are constants & Y , are random variables, then

• Is an unbiased estimator of t ?
Sampling distribution of under SRS - 2
• Variance of estimator of total under SRS
• Variance of a linear function of a random variable

If a, b are constants & Y , are random variables, then

Sampling distribution of under SRS - 3
• Estimator for variance of under SRS
Ag example - 2
• Estimated total acres devoted to farms in the US in 1992?
• Estimated Variance of estimated total?
• Other measures of dispersion for sampling distribution?
• Estimated SE
Sampling distribution of under SRS
• Mean of estimator for population proportion p under SRS
• Is unbiased for p ?
Sampling distribution of under SRS - 2
• Variance of sample proportion (derived stat using theory)
• Very similar to infinite population formula
• Extra factor arises from finite pop and is NOT the same as the FPC
• Estimator does have the FPC in the formula
Ag example - 3
• Suppose we are interested in the proportion of counties with fewer than 200,000 acres devoted to farms in 1992
• Data from our sample of 300 indicate that 153 counties have less than 200,000 acres devoted to farms
• Estimated population proportion?
• Estimated SE of estimated proportion?
Quality of estimates (Fig 2.2, p. 29)
• Estimator under a given design is unbiased
• On average over a large number of samples, the mean of the estimates “hit” the target population parameter (centered on the bull’s eye)
• Estimator under a given design is precise
• Over a large number of samples, estimates will tend to be close to one another, indicating that the variance of the sampling distribution for the estimator is small
• Clump pattern, but may not be centered on bull’s eye (precise but biased)
• Estimator under a given design is accurate
• Estimator comes close to hitting target and is precise
• Assess this with the mean squared error (MSE)
Mean Squared Error an Estimator
• Mean squared error (MSE) of
• Combines measures of bias and precision to provide an index of the accuracy of an estimator under a given design
• Sometimes we are willing to accept a little bias to get a more precise estimator, MSE is improved
• If
MSE of SRS estimators
• All of these estimators are unbiased under SRS (Bias = 0)
• So under SRS
Confidence intervals
• Estimate variance, SE, CV, MSE of estimator under a design to provide indication of quality of estimate
• Another approach
• Estimate a confidence interval to express precision of estimate
Book example 2.7, p. 35-6
• True parameter value: t = 40
• CI of interest:
• List 70 possible samples of size n = 4
• Each sample has a probability of selection P(S)
• For each sample, record value of a variable u that indicates whether CI from sample S includes t = 40
• Confidence coefficient:
Ex – 2: Assume SRSWOR
• If 60 of the 70 SRSWOR samples resulted in CIs that included the true total, what is the confidence coefficient?
• What is alpha?
What is a 95% confidence interval (CI) under SRS?
• Heuristic definition
• Take repeated samples of size n from population of size N
• Collect data on Y
• Calculate an estimate of a population parameter using data from n observations
• Calculate 95% CI for parameter estimate using data from n observations
• Expect 95% of the CIs to contain the true value of the parameter
Interpreting CIs in general
• More generally (for any design), a (1-)100% CI has the interpretation
• There is a (1-)100% chance of selecting a sample for which the CI will include the true population parameter
• Note
• The upper and lower limits of the CI are random variables, calculated from the sample data
• The true parameter value is either included or not included in a single CI
• Confidence coefficient of a CI has a relative frequency interpretation across samples
Confidence interval definition
• Standard estimator for a (1-)100% confidence interval (CI):
Standard normal distribution
• Z ~ N(0, 1)
• Z is the random variable
• Mean E{Z} = 0 and variance V{Z} = 1
• Two-sided (1-)100% confidence interval
• Use critical value
Infinite vs. finite populations
• In other stat classes …
• Assume SRS with replacement from infinite pop
• Justify CI by applying the Central Limit Theorem (CLT)
• In sample surveys, we have a finite number of possible samples
• Can calculate exact confidence coefficient 1- for a stated interval (see previous example)
• In practice, it is not possible to list all possible samples, so we have a special CLT that relies on a “superpopulation” framework
Superpopulation framework
• Asymptotic framework for SRSWOR in finite populations
• Population is part of a larger superpopulation
• There is a a series of increasingly larger superpopulations
• Use superpopulation concept to derive a Central Limit Theorem for SRSWOR
• Bottom line
• We will use the standard CI estimator with a different theoretical justification
When is CLT justified?
• Confidence coefficient is approximate
• Quality of approximation depends on n and the distribution of the underlying random variable, Y
• “n is large enough for CLT” is less clear for finite populations
• n = 30 rule in other stat classes does NOT apply
• Rules of thumb
• If distribution of Y is close to normal, n = 50
• Need larger n if distribution of Y deviates from normal, e.g., skewed
• Y categorical: if p is proportion with characteristic of interest, np  5 and n(1-p)  5
Determining sample size – a general approach
• Specify tolerable error (level of precision, level of confidence)
• Identify appropriate equation relating tolerable error (e, ) to sample size (n)
• Estimate unknown parameters in equation
• Solve for n
• Can you afford sample size?
• What expectations can be altered?
Specify tolerable error
• Two parameters
• e : margin of error or half-width of CI
•  : [1-]100% is confidence level
• Absolute expression (half-width of CI): estimate within e of true pop parameter
• Relative expression: within 100e% of 
Equation linking e, , and n
• Most common equation is half-width of CI
• Example: sample mean under SRSWOR

Note for

• For p , use S2p(1-p)
• For  = 0.05, use
• n0 is sample size under SRSWR (ignoring FPC)
Estimate unknowns: population variance of y, S2
• Use estimator for variance, s2
• Pilot study
• Previous study
• Use CV from previous study
• Guess variance under normality
• estimate of S = range for 95% of values / 4
• estimate of S = range for 99% of values / 6
Estimating unknowns: population proportion, p
• Use estimates from pilot or previous study
• If know nothing of true proportion
• Use p = 0.5
• Max possible variance for estimated proportion under SRS, so this is conservative
• Commonly used
Practicalities for determining n
• Sampling fraction rarely important
• Most populations are large enough that sampling fraction n/N is small for practical values of n
• Subpopulations should influence sample size
• 95% CI for a proportion ( = 0.05, p = 0.5)
• Implies
• n = 400 for e 0.05 (whole sample)
• n = 100 for e 0.10 (subpopulation)
• n = 50 for e 0.15 (subpopulation)
• n = 500 for e 0.04 (little gain over 400)
SRS: pros and cons
• Cons
• SRS is rarely the “best” design
• May not have list of all OUs  need different design
• May have additional info on pop to create a more efficient design (improve precision)
• Pros / uses
• Standard stat procedures can be used with little or no bias
• Mainly interested in regression rather than estimating pop params (ignore sample design – but could still get a better sample)