Chapter 3

Chapter 3 Producing data

Observation versus Experiment • An observational study observes individuals and measures variables of interest but does not attempt to influence the responses. • An experimentdeliberately imposes some treatment on individuals to observe their responses.

Confounding • Two variables (explanatory variables or lurking variables) are confounded when their effects on a response variable cannot be distinguished from each other.

Population, Sample • The population in a statistical study is the entire group of individuals about which we want information. • A sample is a part of the population from which we actually collect information, which we use to draw conclusions about the whole.

Population, Sample • For the remainder of the course, you must be able to differentiate between a population of interest and a sample that gives information about a population. • Use your text to find many scenarios and practicing with those scenarios will aid your understanding. In other words do as many problems as possible. • Populations are often not a subset of people. They can be groups of objects (e.g., quality control of some item).

Example • All students at ISU, all citizens of Ames, all consumers driving Mercedes are examples of a population. • Students who study in the second floor of Parks library is a sample from the population of all ISU students. • Mercedes drivers in Ames is a sample from all consumers driving Mercedes. • Trees in front of the Curtis Hall is a sample.

Example • Each week, the Gallup Poll questions a sample of about 1500 adult U.S. residents to determine national opinion on a wide variety of issues, such as the approval rating of the president. • What is the population of interest? • All U.S. adults • What is the sample? • 1500 sampled U.S. adults

Example • A social scientist wants to know the opinions of the employed adult women about government funding for day care. She obtains a list of the 520 members of a local business and professional women’s club and mails a questionnaire to 100 of these women selected at random. Only 48 questionnaire are returned. • What is the population in this study? • What is the sample from whom information is actually obtained? • What is the rate (percent) of response?

Voluntary Response Sample • A voluntary response sample consists of people who choose themselves by responding to a general appeal. Voluntary response samples are biased because people with strong opinions, especially negative opinions, are most likely to respond.

Bias • The design of a study is biased if it systematically favors certain outcomes. • A scale always shows that objects are 2kg too heavy • A survey written such that the true opinions are skewed due to wording: • “Given the myriad of health problems alcohol can cause, do you still support lowering the legal age of drinking to 18?” How will most people respond after such a leading question….

Simple Random Sample • A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.

Undercoverage and Nonresponse • Undercoverage occurs when some groups in the population are left out of the process of choosing the sample. • Nonresponse occurs when an individual chosen for the sample can’t be contacted or refuses to cooperate.

Subjects, Factors, Treatments • The individuals studied in an experiment are often called subjects, especially if they are people. • The explanatory variables in an experiment are often called factors. • A treatment is any specific experimental condition applied to the subjects. If an experiment has several factors, a treatment is a combination of a specific value (often called a level) of each of the factors.

Completely Randomized Design • In a completely randomized experimental design, all the subjects are allocated at random among all the treatments.

Principles of Experimental Design • Control the effects of lurking variables on the response, most simply by comparing two or more treatments. • Randomize: use impersonal chance to assign subjects to treatments. • Replicate each treatment on enough subjects to reduce chance variation in the results.

Statistical Significance • An observed effect so large that it would rarely occur by chance is called statistically significant.

Section 3.3 Statistical Inference

Example • A market research firm interviews a random sample of 2500 adults. • Result: 66% find shopping for cloths frustrating and time-consuming. • We want to know the opinion of almost 210 million adult Americans who make up the population. • Because the sample was chosen at random, it is reasonable to think that these 2500 people represent the entire population pretty well.

Example • 2500 adults were asked that if they agree or disagree that “I like buying new clothes, but shopping is often frustrating and time-consuming”. • 1650 said they agreed. • is a statistic. • The corresponding parameter is the proportion (call it p) of all adult U.S. residents who would have said “Agreed” if asked the same question. • What’s the truth about the almost 210 million American adults who make up the population? • a basic move in statistics is to use a fact about a sample to estimate the truth about the whole population

Statistical Inference • Statistical Inference is when we infer conclusions about the wider population from data on selected individuals • To think about inference, we must keep straight whether a number describes a sample or a population • Definitions time!

Parameters and Statistics • A parameter is a number that describes the population • a fixed number • in practice, we don’t know its value • A statistic is a number that describes a sample • its value is known when we have taken a sample • value can change from sample to sample • often used to estimate an unknown parameter • In the Gallup Polls the parameter is the proportion of adult U.S. residents who approve of FEMA’s response to Katrina • The statistic is the proportion of people sampled who approve of FEMA’s response to Katrina

Parameter, Statistic • Example: • We denote a Normal distribution with mean , and standard deviation as . • Formally, we call and parameters. • When describe a reasonably symmetric histogram, we can use and to describe its center and spread • and are called statistics.

Keep In Mind the Big Picture Population Parameter Inference Sample Sample Statistic

Sampling Variability • If we took a second sample of 2500 adults, the new sample would have different people. • it is almost certain that there would not be exactly 1650 positive responses. • The value of will vary from sample to sample! • If we choose different samples from the same population, we will end up different values of the statistic.

Sampling Variability • Random samples eliminate bias from the act of choosing a random sample, but they can still be “wrong” because of the variability that results when we choose at random. • If the variation when we take repeated samples from the same population is too great, we can’t trust the results of any one sample • If we take lots of random samples of the same size from the same population, the variation from sample to sample will follow a predictable pattern

Sampling Variability • All of statistical inference is based on one idea: to see how trustworthy a procedure is, ask what would happen if we repeated it many times • Definition: The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. • What would happen if we took many samples? • take a large number of samples from the same population • calculate the sample statistic for each sample • make a histogram of the values of the sample statistic • examine the distribution displayed in the histogram for shape, center, and spread, as well as outliers or other deviations

Sampling Distribution • The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population • The sampling distribution is the ideal pattern that would emerge if we looked at all possible samples of the same size from our population • One of the uses of probability theory in statistics is to obtain sampling distributions without simulation

Example: Opinion about shopping. case 1: get 1000 random samples Each sample has size 100 Sample statistic is the sample proportion We use the sample proportion to estimate the unknown value of the population proportion P. The histogram of values for 1000 random samples: Sampling Variability, Sampling Distribution

Example: Opinion about shopping. case 2: get 1000 random samples Size of the each sample is 2500. We use the sample proportion to estimate the unknown value of the population proportion P. The histogram of 1000 values: Sampling Variability, Sampling Distribution

Sampling Distribution • Shape: both histograms look normal. • Center: In both cases, the values of the sample proportion vary from sample to sample, but the values are centered at 0.6. • (the mean of the 1000 values of is 0.598 for samples of size 100 and 0.6002 for samples of size 2500). • Spread: The values of from samples of size 2500 are much less spread out than the values from samples of size 100. In fact, the standard deviations are 0.0051 and 0.01 respectively. • As sample size gets larger, the variation in sampling distribution gets smaller.

Want more details? Ex. 3.20 • Our texts contains a much more detailed discussion of the shopping question. • See pages 208-209 and figures 3.7 and 3.8

Bias and Variability • Bias concerns the center of the sampling distribution • a statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated • The variability of a statistic is described by the spread of its sampling distribution • this spread is determined by the sampling design and the sample size n • statistics from larger samples have smaller spreads

Bias and Variability • We can think of the true value of the population parameter as the bull’s-eye on a target, and we can think of the sample statistic as an arrow fired at the bull’s-eye • bias and variability describe what happens when an archer fires many arrows at the target (page 213) • bias means that the aim is off, and the arrows land consistently off the bull’s-eye in the same direction • large variability means that repeated shots are widely scattered on the target

Bias and Variability

Managing Bias and Variability • To reduce bias, use random sampling. When we start with a list of the entire population, simple random sampling produces unbiased estimates—the values of a statistic computed from a SRS neither consistently overestimate nor consistently underestimate the value of the population parameter • To reduce the variability of a statistic from a SRS, use a larger sample. You can make the variability as small as you want by taking a large enough sample.

Sampling from Large Populations • Population Size Does Not Matter • The variability of a statistic from a random sample does not depend on the size of the population, as long as the population is at least 100 times larger than the sample • If we denote the population size by N, then we want N > 100(n) where n is the sample size

Why Randomize? • The act of randomizing guarantees that our data are subject to the laws of probability • The behavior of statistics is described by a sampling distribution. • The form of the distribution is known, and in many cases is approximately Normal • Usually, the center of the distribution lies at the true parameter value • The spread of the distribution describes the variability of the statistic

Cautions • The proper statistical design is not the only aspect of a good sample or experiment • The sampling distribution shows only how a statistic varies due to the operation of chance in randomization • The sampling distribution reveals nothing about possible bias due to undercoverage or nonresponse in a sample or to lack of realism in an experiment • The true distance of a statistic from the parameter it is estimating can be much larger than the sampling distribution suggests (random chance!) • We cannot actually gauge the added error

Problems • 3.66 • Statistic, it describes the sample • 3.68 • a) High Variability (HV), High Bias (HB) • b) LV,LB • c) HV,LB • d) LV,HB • 3.70 • a) It won’t vary. Population size does not impact variability. • b) It will vary. The sample size changed!

Section 3.3 Summary • A number that describes a population is a parameter. A number that can be computed from the data is a statistic. The purpose of sampling or experimentation is usually to use statistics to make statements about unknown parameters.

Section 3.3 Summary • A statistic from a probability sample or randomized experiment has a sampling distribution that describes how the statistic varies in repeated data production. The sampling distribution answers the question, “What would happen if we repeated the sample or experiment many times?” Formal statistical inference is based on the sampling distributions of statistics.

Section 3.3 Summary • A statistic as an estimator of a parameter may suffer from bias or from high variability. Bias means that the center of the sampling distribution is not equal to the true value of the parameter. The variability of the statistic is described by the spread of its sampling distribution.

Section 3.3 Summary • Properly chosen statistics from randomized data production designs have no bias resulting from the way the sample is selected or the way the subjects are assigned to treatments. We can reduce the variability of the statistic by increasing the size of the sample or the size of the experimental groups.

Chapter 3

Chapter 3

Presentation Transcript

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

chapter 3

CHAPTER 3-3

Chapter 3-3

Chapter 3 Chapter 3

CHAPTER 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

CHAPTER 3

Chapter 3