- By
**wendi** - Follow User

- 877 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Ch 2: probability sampling, SRS' - wendi

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Ch 2: probability sampling, SRS

- Overview of probability sampling
- Establish basic notation and concepts
- Population distribution of Y : object of inference
- Sampling distribution of an estimator under a design: assessing the quality of the estimate used to make inference
- Apply these to SRS
- Selecting a SRS sample
- Estimating population parameters (means, totals, proportions)
- Estimating standard errors and confidence intervals
- Determining the sample size

Assume ideal setting

- Sampled population = target population
- Sampling frame is complete and does not contain any OUs beyond the target pop
- No unit nonresponse
- Measurement process is perfect
- All measurements are accurate
- No missing data (no item nonresponse)
- That is, nonsampling error is absent

Survey error model

Total Survey Error

Sampling Error

Nonsampling Error

=

+

Due to the sampling process (i.e., we observe only part of population)

Measurement errorNonresponse errorFrame error

Assessed via bias and variance

Probability sample

- DEFN: A sample in which each unit in the population has a known, nonzero probability of being included in the sample
- Known probability we can quantify the probability of a SU of being included in the sample
- Assign during design, use in estimation
- Nonzero probability every SU has a positive chance of being included in the sample
- Proper survey estimates represent entire target population (under our ideal setting)

Probability sampling relies on random selection methods

- Random sampling is NOT a haphazard method of selection
- Involves very specific rules that include an element of chance as to which unit is selected
- Only the outcome of the probability sampling process (i.e., the resulting sample) is random
- More complicated than non-random samples, but provides important advantages
- Avoid bias that can be induced by selector
- Required to calculate valid statistical estimates (e.g., mean) and measures of the quality of the estimates (e.g., standard error of mean)

Representative sample

- Goal is to have a “representative sample”
- Probability sampling is used to achieve this by giving each OU in target population an explicit chance to be included in the sample
- Sample reflects variability in the population
- Applies to the sample, but does not apply to the OU/SU (don’t expect each observation to be a “typical” pop unit
- Can create legitimate sample designs that deliberately skew the sample to include adequate numbers of important parts of the variation
- Common example: oversampling minorities, women
- MUST use estimation procedures that take into account the sample design to make inferences about the target population (e.g., sample weights)

Basic sampling designs

- Simple selection methods
- Simple random sampling (Ch 2 & 3)
- Select the sample using, e.g., a random number table
- Systematic sampling (2.6, 5.6)
- Random start, take every k-th SU
- Probability proportional to size (6.2.3)
- “Larger” SU’s have a higher chance of being included in sample
- Selection methods with explicit structure
- Stratified sampling (Ch 4)
- Divide population into groups (strata)
- Take sample in every stratum
- Cluster sampling (Ch 5 & 6)
- OUs aggregated into larger units called clusters
- SU is a cluster

Examples

- Select a sample of n faculty from the 1500 UNL faculty on campus
- Goal: estimate total (or average) number of hours faculty spend per week teaching courses
- Simple random sampling (SRS)
- Number faculty from 1 to 1500
- Select a set of n random numbers (integers) between 1 and 1500
- Faculty with ids that match the random numbers are included in the sample

Examples - 2

- Systematic sampling (SYS)
- Choose a random number between 1 and 1500/n
- Select faculty member with that id, and then take every k-th faculty member in the list, with sampling interval k is 1500/n
- SRS / SYS
- Each faculty member has an equal chance of being included in sample
- Each sample of n faculty is equally likely

Examples - 3

- Probability proportional to size (PPS)
- With pps design, we assign a selection probability to each faculty member that is proportional to the number of courses taught by a faculty member that semester
- “Size” measure = # of courses taught by faculty member
- Faculty who teach more courses are more likely to be included in the sample, but those that teach less still have a positive chance of being included
- Motivation: faculty that spend more hours on courses are more critical to getting good estimate of total hours spent
- Data from faculty with higher inclusion probabilities will be “down weighted” relative to those with lower probabilities during the estimation process
- Typically accomplished using weights for each observation in the dataset

Examples - 4

- Stratified random sampling (STS)
- Organize list of faculty by college
- Stratum = college
- Allocate n (divide sample size) among colleges so that we select nh faculty in the h-th college
- Sum of nhover strata equals n
- Use SRS, e.g., to select sample in each of the college strata
- Could use SYS or PPS rather than SRS
- Could have different selection methods in each stratum

Examples - 5

- Cluster sampling (CS)
- Aggregate faculty into departments
- OU = faculty member, SU = dept
- Select a sample of departments, e.g., using SRS
- Very common to use PPS for selecting clusters
- “Size” measure = number of OUs in the the cluster SU
- Many variants for cluster sampling
- After selecting clusters, may want to select a sample of OUs in the cluster rather than taking data on every OU
- E.g., select 15 depts in the first stage of sampling, then select 10 faculty in each dept in a second stage of sampling
- This is called 2-stage sampling

Examples - 6

- Complex sample designs (Ch 7)
- Combine basic selection methods (SRS, SYS, PPS) with different methods of organizing the population for sampling (strata, clusters)
- Typically have more than one stage of sampling (multi-stage design)
- Often can not create a frame of all OUs in the population
- Need to select larger units first and then construct a frame
- Stratification and systematic sampling are often used to encourage spread across the population
- This improves chances of obtaining a representative sample
- Costs are often reduced by selecting clusters of OUs, although cluster sampling may lead to less precision in estimates

Notation for target population

- The total number of OUs in the population (also called the universe) is denoted by N
- Note UPPER CASE
- Ideally for SRS, sampling frame is list of N OUs in the pop
- EX: there are N = 4 households in our class
- Index set (labels) for all OUs in the population (or universe) is called U
- U = {1, 2, …, N}
- A different index set could be our names, or our SSNs
- Each person has a value for the characteristic of interest or random variable Y , the number of people in the household
- The value of Y for household i is denoted by yi
- Values in the population are y1, y2, …, yN

Notation for sample

- Sample size is denoted by n
- Note lower case
- n is always less than or equal to N (n = N is a census)
- Index set (labels) for OUs in the sample is denoted by S
- To select a sample, we are selecting n indices (labels) from the universe U , consisting of N indices for the population
- U is our sampling frame in this simple setting
- Labels in S may not be sequential because we are selecting a subset of U

Class example

- Suppose n = 2 households are selected from a population of N = 4 households in the class
- U = {1, 2, 3, 4}
- Randomly select sample using SRS and get 2 and 3
- S =
- The data collected on OUs in the sample are values for Y = number of people in the household
- Data:

Summary of probability sampling framework

- Assumptions (for now)
- Observation unit = sampling unit
- Target population = sampling universe = sampling frame
- N = finite number of OUs in the population
- U = {1, 2, …, N} is the index set for the OUs in the population
- Sample
- n = sample size (n is less than or equal to N )
- S = index set for n elements selected from population of N units (S is a subset of U)

Conceptual basis for probability sampling

- Conceptual framework for selecting samples
- Enumerate all possible samples of size n from the population of size N
- Each sample has a known probability of being selected
- P(S) = probability of selecting sampleS
- Use this probability scheme to randomly choose the sample
- Using the probability scheme for the samples, can determine the inclusion probability for each SU
- i = probability that a sample is selected that includes uniti

Simple example

- Population of 4 students in study group, take a random sample of 2 students
- Setting
- U = {1, 2, 3, 4}
- N = 4
- n = 2
- All possible samples of size n = 2 from N = 4 elements
- Note: n < N and S U

Simple example - 2

- All possible samples

S1 = {1, 2} S3 = {1, 4} S5 = {2, 4}

S2 = {1, 3} S4 = {2, 3} S6 = {3, 4}

- Design is determined by assigning a selection probability to each possible sample

P(S1) = 1/3 P(S3) = 1/2 P(S5) = 0

P(S2) = 1/6 P(S4) = 0 P(S6) = 0

Simple example - 3

- Inclusion probability definition?
- What is the probability that student 1 is included in the sample?
- 1 =
- Inclusion probability for student 2, 3, 4?
- 2 =
- 3 =
- 4 =
- Is this a probability sample?

Population distribution

- Response variables represent values associated with a characteristic of interest for i-th OU
- Y is the random variable for the characteristic of interest (CAP Y)
- yi = value of characteristic for OU i(small y)
- The population distribution is the distribution of Y for the target population
- Y is a discrete random variable with a finite number of possible values (<= N values)
- Use discrete probability distribution to represent the distribution of Y

Population distribution - 2

- A discrete probability distribution is denoted by a series of pairs corresponding to
- Value of the random variable Y, denoted by y
- Relative frequency of the value y for the random variable Y in the population, denoted by P(Y = y)
- Pair is { y , P(Y = y) }
- Constructing a probability distribution
- List all unique values y of random variable Y
- Record the relative frequency of y in the population, P(Y = y)

Class example - 2

- Back to # of people in household for each class member
- What are the unique values in the pop?
- What is the frequency of each value?
- What is the relative frequency of each value?
- Construct a histogram depicting the variation in values

Summarizing the population distribution

- Use population parameters to summarize population distribution
- Mean or expected value of y (parameter: )
- Proportion of population having a particular characteristic = mean of a binary (0, 1) variable (parameter: p)
- For finite populations, population total of y is often of interest (parameter: t)
- Variance of y (parameter: S 2)

Mean of Y for population

- Expected value, or population mean, of Y
- Mean is in y-units per OU-unit
- Measure of central tendency (middle of distn)
- Related to population total (t) and proportion (p)
- Examples
- Average number of miles driven per week adults in US
- Average number of phone lines per household

Class example - 3

- What is the mean household size for people in this classroom?

Total of Y in population

- Population total of Y
- Total number of y-units in the population
- Examples
- Number of households in market area with DSL
- yi =1 if household i has DSL, yi = 0 if not
- N = number of households in market area
- Number of deer in Iowa
- yi =number of deer observed in area i
- N = number of observation areas in Iowa

Class example - 4

- What is the total number of people living in households of people in the classroom?

Proportion

- Proportion (p) of population having a particular characteristic
- Mean of binary variable

Class example - 5

- What proportion of people in the classroom have a cell phone?

Population variance of Y

- Population variance of Y
- Measure of spread or variability in population’s response values
- Analogous to 2in other stat classes
- Not the standard error of an estimate
- Note this is CAP S 2

Coefficient of variance for Y

- Variation relative to mean (unitless)

Class example - 6

- What is the population variance for number of people in households of people in the classroom?
- What is the CV?

Summary of population distribution of Y

- Basic pop unit: OU (i)
- Number of units or size of pop: N
- Random variable: Y
- Parameters: characterize the target population
- Mean
- Total t
- Proportion (mean) p
- Variance S2
- Coefficient of variation CV = S /
- STATIC: it is the object of inference and never changes with design or estimator

What’s next

- Population distribution of Y is object of inference
- Use SRS to select a sample and estimate the parameters of the population distribution
- How to select a sample
- Estimators for population parameters of Y under SRS
- Sample mean estimates population mean
- N x sample mean estimates population total
- Sample variance estimates population variance
- Assessing the quality of an estimator of a population parameter under SRS
- Sampling distribution
- Bias, standard error, confidence intervals for the estimator

Simple random sample (SRS)

- DEFN: A SRS is a sample in which every possible subset of n SUs has an equal chance of being selected as the sample
- every sampling unit has equal chance of being included in the sample
- Example of an “equal probability” sample
- Does not imply that a sample in which each SU has the same inclusion probability is a SRS
- Other non-SRS designs can generate equal probability samples

Simple random sampling (SRS)

- Two types
- SRSWR (SRS with replacement)
- Return SU after each step in the selection process
- SRSWOR (SRS without replacement)
- Do not return SU after it has been selected
- Selection probability
- Probability that a unit is selected in a single draw
- Constant throughout SRSWR process
- Changes with each draw in the SRSWOR process
- NOT an inclusion probability, which considers the probability of drawing a sample that includes unit i

SRSWR (SRS with replacement)

- Selection procedure
- Select one OU with probability 1/N from N OUs
- This is the selection probability for each draw
- Returning selected OU to universe
- Repeat n times
- Procedure is like drawing n independent samples of size 1
- Can draw a sampling unit twice – duplicate units
- Unappealing for finite populations – no additional info in having a duplicate unit
- Useful in theoretical development for large populations

Focus: SRSWOR (SRS without replacement)

- Selection procedure
- Select one OU from universe of size N with probability 1/N
- DON’T return selected unit to universe
- Select 2nd OU from remaining units in universe with probability 1/(N - 1)
- DON’T return selected unit to universe
- Repeat until n sampling units have been selected
- Selection probabilities change with each draw
- 1/N, then 1/(N -1), then 1/(N -2), …, 1/(N – n +1)

SRSWOR (SRS without replacement)

- Probability of selecting a sampling unit in a single draw depends on number of SUs already selected (conditional probability)
- On the c-th step of the process, c-1 s.u.s have already been selected for a sample of size n
- Probability of selecting any of the remaining N – c + 1 s.u.s in the next draw is
- Inclusion probability for SU i (unconditional probability)
- (see p. 44 in text)

SRSWOR (SRS without replacement)

- Number of possible SRSWOR samples of size n from universe of size N
- Probability of selecting a sample S

(Probability is the same for all samples)

Selecting a SRS using SRSWOR

- Create a sampling frame
- List of sampling units in the universe or population
- Assigns an index to each sampling unit
- Determine a selection procedure that performs SRSWOR
- Procedure must generate to n unique sampling units such that each SU has an equal chance of being included in the sample
- Random number generator or table is common basis
- Need rules to identify when the selected unit is included in the sample or tossed
- Select random numbers and determine sampled units

Using random numbers to select a SRSWOR sample

- Determine a rule to assign random numbers to the sampling universe index set U
- Rule must give each unit an equal chance of being included in the sample
- Select the set of random numbers, e.g., using computer or printed random number table
- Apply the rule to each random number to determine the sampled OU
- Check to see if this OU has already been selected
- If already selected, ignore it
- Keep going until you have n SUs in the sample

Census of Agriculture example

Select 300 counties from 3078 counties in the US

- N =
- n =
- Sampling frame = ?
- Generate random numbers between 0 and 1 on the computer
- Need n or more random numbers depending on rule
- Multiply each random number by N = 3078and round up to the nearest integer
- Random number = .61663
- Multiply random # by N = 3078 x .61663 = 1897.98714
- Round up to 1898
- Take 1898th county in the frame

Estimating population mean under SRS

- Target population mean
- Estimator of for SRS sample of size n is the sample mean
- Note
- “Estimator” refers to the formula
- “Estimate” refers to the value obtained from using the formula with data

Class example - 7

- Estimate the average household size for our classroom

Estimating population total

- Target population total
- Estimator of t for SRS sample of size n

Class example - 8

- Estimate the total number of people living in the households of people in this classroom

Estimating population proportion

- Target population proportion
- Y takes on values 0 or 1, where 1 means the unit has the characteristic of interest
- Estimator of p for SRS sample of size n

Class example - 9

- Estimate the proportion of people with cell phones in this class room

Estimating population variance

- Target population variance
- Estimator of S2 for SRS sample of size n is the sample variance

(note lower case s)

Class example - 10

- Estimate the variance of number of people in households of people in this class room

Estimating population standard deviation and CV

- Standard deviation of Y, S ?
- Estimator of standard deviation of Y?
- CV of population distribution?
- Estimator of CV?

What would happen if we took another sample?

- S =
- Data =
- Estimates
- Mean
- Total
- Proportion
- Standard deviation
- CV

Sampling distribution

- Need to assess the quality of our estimates
- Is a good estimator of ?
- Is a good estimator of p ?
- Is s2 a good estimator of S2 ?
- Use the sampling distribution to assess the quality of the estimator
- Distribution of estimator over all possible samples
- EX: distribution of over all possible SRS samples of size n from a population of size N

Sampling distribution

- Simulation

Measures of quality

- Denote
- Population parameter as [think pop mean ]
- Estimator of as [think sample mean ]
- Mean of the sampling distribution is the expected value of the estimator
- An estimator is unbiased if
- Variance of the sampling distribution
- Precision: want variance of estimator to be small
- Coefficient of variance
- Relative precision: want CV to be small

Sampling distribution of estimator

- Basic pop unit: sample selected using a specific design, S
- Number of units or size of pop: number of possible samples
- Need probability of selecting sample !
- Random variable: estimator of parameter,
- Parameters: characterize the quality of the estimator
- Mean (assesses bias of the estimator),
- Variance, SE, CV (assesses precision of estimator)
- DEPENDS on population parameter, estimator of population parameter, sample design

Basic unit: OU (i)

Total number of units: N

Random variable: character of interest, Y

Parameters: characterize the target population

Mean , proportion p (central tendency)

Total t

Variance S2, std dev S, CV (spread of distn)

STATIC once you identify Y, pop distribtn is the object of inference and never changes with design or estimator

Basic unit: sample selected using a specific design, S

Total number of units: number of possible samples

Random variable: estimator of parameter,

Parameters: characterize the quality of the estimator

Mean (used to assess bias of the estimator)

Variance , SE, CV (precision of estimator)

DEPENDS on population parameter, estimator of population parameter, sample design

Population Samplingdistribution distributionConceptual framework for a sampling distribution - 1

- List out all possible samples of size n from the population of size N
- A sample is the BASIC UNIT for the population of all possible samples
- We determine the probability of selecting the sample
- Unequal probability sample (now)
- Simple random sample
- NOTE: sampling distribution depends on the design selected

Simple example from earlier lecture (not SRS!)

- All possible samples

S1 = {1, 2} S3 = {1, 4} S5 = {2, 4}

S2 = {1, 3} S4 = {2, 3} S6 = {3, 4}

- Design is determined by assigning a selection probability to each possible sample

P(S1) = 1/3 P(S3) = 1/2 P(S5) = 0

P(S2) = 1/6 P(S4) = 0 P(S6) = 0

Conceptual framework for a sampling distribution - 2

- List
- Using the n data values associated with each sample, calculate the value of the estimator for each sample
- The estimator is the random variable of our distribution
- Example: sample mean is calculated for each of the possible samples
- NOTE: the sampling distribution depends on the estimator selected

Simple example from earlier lecture - 2

- Population values for Y
- i 1 2 3 4
- yi3 5 1 3
- All possible samples of size n = 2

S1 = {1, 2}, S2 = {1, 3}, S3 = {1, 4},S4 = {2, 3}, S5 = {2, 4}, S6 = {3, 4}

- Values of corresponding to each sample

Conceptual framework for a sampling distribution - 3

- List
- Using
- Sampling distribution is described by pairs of values for estimator from the sample and relative frequency of obtaining that value
- We are using the steps we used before for creating a discrete distribution

Representing the sampling distribution

- Probability distribution: pairs of
- is a random variable, c is a valueof

Simple example from previous lecture - 3

- Number of possible samples
- Probability of selecting sample
- Probability distribution: unique values of and relative frequency

c 2.0 3.0 4.0

Conceptual framework for a sampling distribution - 4

- List
- Using
- Sampling distribution
- Parameters summarize sampling distribution
- Mean of sampling distribution
- Variance, std dev (SE) of sampling distribution
- CV of sampling distribution

Ex: mean and variance of sampling distribution for - 4

- Mean of sampling distribution
- Same concept of expected value used with population distribution
- Variance of sampling distribution
- Use more general formula for variance
- Later, we’ll use reductions that are easier to calculate

What if we took a SRS of size n from N units?

- List out all possible samples
- # possible samples:
- Determine the probability of a sample
- Calculate estimator for each sample
- Examples:
- Create a discrete probability distribution
- Calculate summary parameters

Back to example with SRS

- Number of possible samples
- Probability of selecting sample
- Probability distribution: unique values of and relative frequency

c 2.0 3.0 4.0

Example: mean of sampling distribution for under SRS

- Mean of sampling distribution
- Mean of population distribution

Bias of an estimator

- Estimation bias of
- Note that this is the mean of the estimator (from sampling distribution) minus the population parameter (from population distribution)
- If then is said to be an unbiased estimator of

Variance of sample mean under SRS

- Don’t have to use the general formula
- Variance of sample mean (derived stat using theory)
- Similar to infinite population formula
- Has an extra factor called the finite population correction factor (FPC)

Example

- Variance of sampling distribution for
- Other measures of dispersion for sampling distribution

Finite population correction factor (FPC)

- Sampling fraction is the proportion of the population sampled, or n/N
- Larger sample
- Larger fraction of population
- Smaller FPC
- Smaller variance of sample mean

Impact of FPC on estimated variance of parameter estimate

- Often FPC is very close to 1
- Sample of 3000 households from total of 1,200,000 households
- In cases where sampling fraction is very small and FPC is very close to 1, FPC has no practical effect on the SE or estimated variance of the param estimate
- Sampling fraction n/N is not a good measure of whether your estimate will be precise
- The sample size n is the most important part of the variance or SE formulas given variance

Estimating population variance under SRS

- Do not know variance of population distribution,
- Unbiased estimator for
- Estimator for
- Note thatis the standard error of the sample mean

Ag example

- Interested in average number of acres per county devoted to farms
- Sample 300 counties from list of 3078
- Collect data and get following summary statistics
- What are estimated mean and standard error?

Rounding rules

- Always keep all of the digits while you are doing calculations
- Round only when you get ready to report the result at the end of the calculation …
- Round the estimated SE to 2 significant digits
- 107,789 is rounded to 110,000
- 0.0325329 is rounded to 0.033
- Round estimate to precision of the SE
- If SE is 110,000, round estimate to nearest 10,000 (xx0,000)
- If SE is 0.033, round estimate to nearest 1/1000 (x.xxx)
- Estimated variances are usually reported to 5 significant digits

Sampling distribution for using SRS of size n from N

- is an unbiased estimator of
- Mean of sampling distribution is always equal to population mean under SRS
- Variance of is
- Estimate the variance of using sample variance s2

Sampling distribution of under SRS

- Mean of for population total t under SRS
- Expectation of a linear function of a random variable

If a, b are constants & Y , are random variables, then

- Is an unbiased estimator of t ?

Sampling distribution of under SRS - 2

- Variance of estimator of total under SRS
- Variance of a linear function of a random variable

If a, b are constants & Y , are random variables, then

Sampling distribution of under SRS - 3

- Estimator for variance of under SRS

Ag example - 2

- Estimated total acres devoted to farms in the US in 1992?
- Estimated Variance of estimated total?
- Other measures of dispersion for sampling distribution?
- Estimated SE

Sampling distribution of under SRS

- Mean of estimator for population proportion p under SRS
- Is unbiased for p ?

Sampling distribution of under SRS - 2

- Variance of sample proportion (derived stat using theory)
- Very similar to infinite population formula
- Extra factor arises from finite pop and is NOT the same as the FPC
- Estimator does have the FPC in the formula

Ag example - 3

- Suppose we are interested in the proportion of counties with fewer than 200,000 acres devoted to farms in 1992
- Data from our sample of 300 indicate that 153 counties have less than 200,000 acres devoted to farms
- Estimated population proportion?
- Estimated SE of estimated proportion?

Quality of estimates (Fig 2.2, p. 29)

- Estimator under a given design is unbiased
- On average over a large number of samples, the mean of the estimates “hit” the target population parameter (centered on the bull’s eye)
- Estimator under a given design is precise
- Over a large number of samples, estimates will tend to be close to one another, indicating that the variance of the sampling distribution for the estimator is small
- Clump pattern, but may not be centered on bull’s eye (precise but biased)
- Estimator under a given design is accurate
- Estimator comes close to hitting target and is precise
- Assess this with the mean squared error (MSE)

Mean Squared Error an Estimator

- Mean squared error (MSE) of
- Combines measures of bias and precision to provide an index of the accuracy of an estimator under a given design
- Sometimes we are willing to accept a little bias to get a more precise estimator, MSE is improved
- If

MSE of SRS estimators

- All of these estimators are unbiased under SRS (Bias = 0)
- So under SRS

Confidence intervals

- Estimate variance, SE, CV, MSE of estimator under a design to provide indication of quality of estimate
- Another approach
- Estimate a confidence interval to express precision of estimate

Book example 2.7, p. 35-6

- True parameter value: t = 40
- CI of interest:
- List 70 possible samples of size n = 4
- Each sample has a probability of selection P(S)
- For each sample, record value of a variable u that indicates whether CI from sample S includes t = 40
- Confidence coefficient:

Ex – 2: Assume SRSWOR

- If 60 of the 70 SRSWOR samples resulted in CIs that included the true total, what is the confidence coefficient?
- What is alpha?

What is a 95% confidence interval (CI) under SRS?

- Heuristic definition
- Take repeated samples of size n from population of size N
- Collect data on Y
- Calculate an estimate of a population parameter using data from n observations
- Calculate 95% CI for parameter estimate using data from n observations
- Expect 95% of the CIs to contain the true value of the parameter

Interpreting CIs in general

- More generally (for any design), a (1-)100% CI has the interpretation
- There is a (1-)100% chance of selecting a sample for which the CI will include the true population parameter
- Note
- The upper and lower limits of the CI are random variables, calculated from the sample data
- The true parameter value is either included or not included in a single CI
- Confidence coefficient of a CI has a relative frequency interpretation across samples

Confidence interval definition

- Standard estimator for a (1-)100% confidence interval (CI):

Standard normal distribution

- Z ~ N(0, 1)
- Z is the random variable
- Mean E{Z} = 0 and variance V{Z} = 1
- Two-sided (1-)100% confidence interval
- Use critical value

Infinite vs. finite populations

- In other stat classes …
- Assume SRS with replacement from infinite pop
- Justify CI by applying the Central Limit Theorem (CLT)
- In sample surveys, we have a finite number of possible samples
- Can calculate exact confidence coefficient 1- for a stated interval (see previous example)
- In practice, it is not possible to list all possible samples, so we have a special CLT that relies on a “superpopulation” framework

Superpopulation framework

- Asymptotic framework for SRSWOR in finite populations
- Population is part of a larger superpopulation
- There is a a series of increasingly larger superpopulations
- Use superpopulation concept to derive a Central Limit Theorem for SRSWOR
- Bottom line
- We will use the standard CI estimator with a different theoretical justification

When is CLT justified?

- Confidence coefficient is approximate
- Quality of approximation depends on n and the distribution of the underlying random variable, Y
- “n is large enough for CLT” is less clear for finite populations
- n = 30 rule in other stat classes does NOT apply
- Rules of thumb
- If distribution of Y is close to normal, n = 50
- Need larger n if distribution of Y deviates from normal, e.g., skewed
- Y categorical: if p is proportion with characteristic of interest, np 5 and n(1-p) 5

Determining sample size – a general approach

- Specify tolerable error (level of precision, level of confidence)
- Identify appropriate equation relating tolerable error (e, ) to sample size (n)
- Estimate unknown parameters in equation
- Solve for n
- Evaluate (and return to first step)
- Can you afford sample size?
- What expectations can be altered?

Specify tolerable error

- Two parameters
- e : margin of error or half-width of CI
- : [1-]100% is confidence level
- Absolute expression (half-width of CI): estimate within e of true pop parameter
- Relative expression: within 100e% of

Equation linking e, , and n

- Most common equation is half-width of CI
- Example: sample mean under SRSWOR

Note for

- For p , use S2p(1-p)
- For = 0.05, use
- n0 is sample size under SRSWR (ignoring FPC)

Estimate unknowns: population variance of y, S2

- Use estimator for variance, s2
- Pilot study
- Previous study
- Careful about comparability
- Use CV from previous study
- Careful about comparability
- Guess variance under normality
- estimate of S = range for 95% of values / 4
- estimate of S = range for 99% of values / 6

Estimating unknowns: population proportion, p

- Use estimates from pilot or previous study
- If know nothing of true proportion
- Use p = 0.5
- Max possible variance for estimated proportion under SRS, so this is conservative
- Commonly used

Practicalities for determining n

- Sampling fraction rarely important
- Most populations are large enough that sampling fraction n/N is small for practical values of n
- Subpopulations should influence sample size
- 95% CI for a proportion ( = 0.05, p = 0.5)
- Implies
- n = 400 for e 0.05 (whole sample)
- n = 100 for e 0.10 (subpopulation)
- n = 50 for e 0.15 (subpopulation)
- n = 500 for e 0.04 (little gain over 400)

SRS: pros and cons

- Cons
- SRS is rarely the “best” design
- May not have list of all OUs need different design
- May have additional info on pop to create a more efficient design (improve precision)
- Pros / uses
- Standard stat procedures can be used with little or no bias
- Mainly interested in regression rather than estimating pop params (ignore sample design – but could still get a better sample)

Download Presentation

Connecting to Server..