CHAPTER 12. SAMPLES AND SURVEYS Mrs. Padilla WHS AP STATS. Sample Surveys Producing Valid Data . “ If you don ’ t believe in random sampling, the next time you have a blood test tell the doctor to take it all. ”. Beyond the Data at Hand to the World at Large:.
“If you don’t believe in random sampling,
the next time you have a blood test tell
the doctor to take it all.”
Beyond the Data at Hand to the World at Large.
We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data we have.
We’d like (and often need) to stretch beyond the data at hand to the world at large.
Let’s investigate three major ideas that will allow us to make this stretch…
Examine a Part of the Whole:
The first idea is to draw a sample.
We’d like to know about an entire population of individuals, but examining all of them is usually impractical, if not impossible.
We settle for examining a smaller group of individuals—a sample—selected from the population. We must consider the following:
A sampling frame is a list of individuals from the population of interest from which the sample is drawn.
Sampling variability is the natural tendency of randomly drawn samples to differ, one from another. Sampling variability is not an error, just the natural result of random sampling.
1. Think about sampling something you are cooking—you taste (examine) a small part of what you’re cooking to get an idea about the dish as a whole.
2. Opinion polls are examples of sample surveys, designed to ask questions of a small group of people in the hope of learning something about the entire population.
Just ask whoever is around.
EXAMPLE: Manufacturers and advertising agencies often use interviews at shopping malls to gather information about the habits of consumers and the effectiveness of ads. A sample of mall shoppers is fast and cheap. “Mall interviewing is being propelled primarily as a budget issue,” one expert told the New York Times. But people contacted at shopping malls are not representative of the entire US population. They are richer, for example, and more likely to be teenagers and retired. Moreover, mall interviewers tend to select neat, safe-looking individuals from the stream of customers. Decisions based on mall interviews may not reflect the preferences of all consumers.
Bias: Opinions limited to individuals present.
Individuals choose to be involved.
These samples are very susceptible to being biased because different people are motivated to respond or not. Often called “public opinion polls.” These are not considered valid or scientific.
Bias: Sample design systematically favors a particular outcome.
Ann Landers summarizing responses of readers 70% of (10,000) parents wrote in to say that having kids was not worth it—if they had to do it over again, they wouldn’t.
Bias: Most letters to newspapers are written by disgruntled people. A random sample showed that 91% of parents WOULD have kids again.
CNN on-line surveys:
Bias: People have to care enough about an issue to bother replying. This sample is probably a combination of people who are interested in their families history and may or may not know how to go about it.
Note: They are sure to let you know that this is not a scientific poll.
Administrators at a hospital are concerned about the possibility of drug abuse by the individuals who work there. They decide to look into the extent of the problem by having a random sample of the employees undergo a drug test. The administrators randomly select a department (let’s say, radiology) and test all those who work in this particular department, including doctors, nurses, technicians, clerks, custodians, etc. ..
Why might this result in a biased sample?
-Dept. might not represent full range of employee types, experiences, stress levels, or the hospital’s drug supply.
Name the kind of bias that might be present if the administration decides that instead of subjecting people to random testing they’ll just…
a. interview employees about possible drug abuse
Response bias: people will feel threatened, won’t answer truthfully
b. ask people to volunteer to be tested
Voluntary response bias; only those who are “clean” would volunteer
Bias is the annoyance of sampling—the one thing above all to avoid.
There is usually no way to fix a biased sample and no way to salvage useful information from it.
The best way to avoid bias is to select individuals for the sample at random.
The value of deliberately introducing randomness is one of the great insights of Statistics
Note: Never trust the results of a sample survey until you have read the exact questions posed. The sampling design, the amount of nonresponse, and the date of the survey are also important. Good statistical design is a part, but only a part, of a trustworthy survey.
Which leads us to – IDEA 2
Nonresponse bias – This occurs in a sample design when individuals selected for the sample fail to respond, cannot be contacted, or decline to participate.
Response bias – Anything in a survey that influences responses falls under this category of bias. Examples are biased wording of survey questions, lack of privacy while being surveyed, and appearance of the interviewer.
EXAMPLE: Should we ban Disposable Diapers. A survey paid for by makers of disposable diapers found that 84% of the sample opposed banning disposable diapers. Here is the actual question:
“It is estimated that disposable diapers account for less than 2% of the trash in today’s landfills. In contrast, beverage containers, third-class mail and yard waste are estimated to account for about 21% of the trash in landfills. Give this, in your opinion, would it be fair to ban disposable diapers?”
This question gives information on only one side of an issue, then asks an opinion. That’s a sure way to bias the responses. A different question that described how long disposable diapers take to decay and how many tons they contribute to landfills each year would draw a quite different response.
Undercoverage – A sampling scheme that fails to sample from some part of the population or that gives a part of the population less representation than it has in the population suffers from undercoverage. Obtaining a sampling frame is not always possible and can result in undercoverage.
EXAMPLE: The Census Undercount: Even the US Census, backed by the resources of the federal government, suffers from undercoverage and nonresponse. The census begins by mailing forms to every household in the country. The Census Bureau’s list of addresses is incomplete, resulting in undercoverage. Despite special efforts to count homeless people (who can’t be reached by any address); homelessness causes more undercoverage.
In 1990, about 35% of households that were mailed census forms did not mail them back. In New Yor City, 47% did not return the form. That’s nonresponse. The Census Bureau sent interviewers to these households. In inner-city areas, the interviewers could not contact about one in five of the nonresponders, even after six tries..
The Census Bureau estimates that the 1990 census missed about 1.8% of the total population due to undercoverage and nonresponse. Because the undercount was greater in the poorer sections of large cities, the Census Bureau estimates that it failed to count 4.4% of blacks and 5.0% of Hispanics.
For the 2000 Census, the Bureau planned to replace follow-up of all nonresponders with more intense pursuit of a probability sample of nonresponding households plus a national sample of 750,000 households. The final counts would be based on comparing the national sample with the original responses. This idea was politically controversial. The Supreme Court ruled that the sampling could be used foremost purposes, but not for dividing seats in Congress among the states.
Randomization can protect you against factors that you know are in the data
It can also help protect against factors you are not even aware of
Randomizing protects us from the influences of all the features of our population, even ones that we may not have thought about.
Randomizing makes sure that on the average the sample looks like the rest of the population
Individuals are randomly selected. No one group should be over-represented.
Sampling randomly gets rid of bias.
Random samples rely on the absolute objectivity of random numbers.
There are tables and books of random digits available for random sampling.
Statistical software can generate random digits (e.g., Excel “=random()”, ran# button on calculator).
the 20 Pharmacists on
the hospital staff. Use
the random numbers
listed below to select
three of them to be in the
Not only does randomizing protect us from bias, it actually makes it possible for us to draw inferences about the population when we see only a sample.
04905 83852 29350
91397 19994 65142
How large a random sample do we need for the sample to be reasonably representative of the population?
It’s the size of the sample, not the size of the population, that makes the difference in sampling.
Exception: If the population is small enough and the sample is more than 10% of the whole population, the population size can matter.
Thefractionof the population that you’ve sampled doesn’t matter. It’s the sample size itself that’s important.
In the city of Chicago, Illinois, 1,000 likely voters are randomly selected and asked who they are going to vote for in the Chicago mayoral race.
In the state of Illinois, 1,000 likely voters are randomly selected and asked who they are going to vote for in the Illinois
In the United States, 1,000 likely voters are randomly selected and asked who they are going to vote for in the presidential election.
Which survey has more accuracy?
Answer: All the surveys have the same accuracy
Ask yourself: Why bother worrying the sample size? Wouldn’t it be better to just include everyone and “sample” the entire population?
Such a special sample is called acensus.
What problems are there with taking a census?
Practicality: It can be difficult to complete a census—there always seem to be some individuals who are hard to locate or hard to measure.
Timeliness: populations rarely stand still. Even if you could take a census, the population changes while you work, so it’s never possible to get a perfect measure.
Expense:taking a census may be more complex than sampling.
Accuracy: a census may not be as accurate as a good sample due to data entry error, inaccurate (made-up?) data, tedium.
population we actually
examine and for which we
do have data.
How well the sample
represents the population
depends on the sample
design.POPULATION VS. SAMPLE
Population: The entire
Group of individuals in
Which we are interested
But cannot usually access
Directly. (i.e. Humans, all
Working age people in CA,
A parameter is a number describing a characteristic
of the population.
Astatisticis a number
describing a characteristic
of a sample.
Values of population parameters are unknown; in addition, they are unknowable.
Example: The distribution of heights of adult females in the US is approximately symmetric and mound-shaped with mean µ. µ is a population parameter whose value is unknown or unknowable.
The heights of 1500 females are obtained from a sample of government records. The sample mean x̅ of the 1500 heights is calculated to be 64.5 inches.
The sample mean x̅ is a sample statistic that we use to estimate the unknown population µ.
As you know we typically use Greek Letters to denote parameters and Latin letters to denote statistics, therefore the following is our guide:
Various claims are often made, why are each of the following claims not correct?
It is always better to take a census than a sample
Timeliness, expense, complexity and accuracy
Stopping students on their way out of the cafeteria is a good way to sample if we want to know the quality of the food in the cafeteria.
Bias; the students are those who chose to eat at the cafeteria
We drew a sample of 100 from 3,000 students at a Junior College. To get the same level of precision for a town of 30,000 residents, we’ll need a sample of 1,000 residents.
It’s the sample size, not the size of the population or the fraction of the population that we sample, that is important.
An internet poll taken at the web site www.statsisfun.org garnered 12,357 responses. The majority said they enjoy doing statistics homework. With a sample size that large, we can be pretty sure that most statistic students fell this way, too.
Voluntary response bias, size of sample does not remove the bias
The true percentage of all Statistics students who enjoy the homework is called a “population statistic.”
The true percentage is a population parameter
A simple random sample of size n is one in which each possible sample of n individuals has an equal chance of selection. (Remember size n consists of n units from the population chosen).
To select a random sample, we must first define where the sample will come from.
The Sampling Frame is a list of individuals from which the sample is drawn.
I.e., To select a random sample of students from a college, we might obtain a list of all registered full-time students.
When defining sampling frame, must deal with details defining the population; are part-time students included? How about current study-aboard students?
Once we have our sampling frame, the easiest way to choose an SRS is with random numbers.
If some members of the population are not included in the sampling frame, they cannot be part of the sample!! (i.e., using a telephone book as the sampling frame)
Population: WalMart shoppers
EXAMPLE: Joans small accounting firm serves 30 business clients. Joan wants to interview a sample of 5 clients in detail to find ways to improve client satisfaction. To avoid bias, she chooses an SRS of size 5.
Step 1: Label: Give each client a numerical label, using a few digits as possible. Two digits are needed to label 30 clients so we use labels, 01, 02, 03, 04, 05,…29, 30.
Step 2: Table: Enter your Random Digits table anywhere and read two-digit groups.
Each successive two-digit group is a label. The labels 00 and 31 to 99 are no used in this example, so we can ignore them. The first 5 labels between 01 and 30 that we encounter in the table choose our sample. Of the first 10 labels in line 130, we ignore 5 because they are to high (over 30). The others are 05, 16, 17, 17, and 17. Ignore the second and third 17s because that client is already in the sample. Now run your fingers across line 130 (and continue to line 131 if needed) until 5 clients are chosen.
The sample is the clients labeled 05, 16, 17, 20, and 19. Which are Bailey Trucking, JL Records, Johnson Commodities, Magic Tan and Liu’s Chinese Restaurant.
Sampling variability is the natural tendency of randomly drawn samples to differ, one from another. Sampling variability is not an error, just the natural result of random sampling.
Samples drawn at random generally differ from one another
Each draw of random numbers selects different people for our sample
These differences lead to different values for the variables we measure
We call these sample-to-sample differences sampling variability
Variability is OK; Bias is BAD!!!
A stratified random sample is a sampling method in which the population is first broken up into homogeneous groups (mutually exclusive sets) called strata. These groups are made up of individuals similar in some way that may affect the response variable. SRS is then used within each stratum before the results are combined.
With this procedure we can acquire information about:
- the whole population
- the relationships among the strata
EXAMPLE: Suppose a television station is interested in obtaining information from its viewers regarding the events they are most likely to watch during their coverage of the Olympics. Since men and women may differ significantly in their choice of events, a sample that stratifies by gender can help reduce variation in the results.
There are several ways to build the stratified sample. For example, keep the proportion of each stratum in the population:
A Sample of size 1000 is to be drawn
This method divides the population into heterogeneous groups called clusters and then takes an SRS of some of the clusters. This method is usually used to reduce the cost of obtaining a sample.
Sometimes stratifying isn’t practical and simple random sampling is difficult.
Therefore splitting the population into similar parts or clusters can make sampling more practical
Then we could select one or few clusters at random and perform a census within each cluster
This sampling design is called cluster sampling
If each cluster fairly represents the full population, cluster sampling will give us an unbiased sample.
When is this a useful way of sampling:
when it is difficult and costly to develop a complete list of the population members (making it difficult to develop a SRS procedure)
i.e – all items sold in a grocery store
When the population members are widely dispersed geographically
i.e. All Toyota Dealerships in California
NOTE: Cluster sampling is not the same as stratified sampling……..!!!
We stratify to ensure that our sample represents different groups in the population, and sample randomly with each stratum.
i.e. – Strata are homogenous (male, female) but differ from one another.
Clusters are more or less alike, each heterogeneous and resembling the overall population.
We select clusters to make sampling more practical and affordable.
We conduct a census on or select a SRS from each selected cluster
Multistage sampling is a sampling scheme that combines several methods.
For Example: Most surveys conducted by professional polling organizations and government agencies use some combination of stratified and cluster sampling as well as SRS.
Systematic sampling is a method of sampling in which the sample is selected in some predetermined way. For example, we may obtain a list of our population of interest and from that list choose every tenth individual to be part of the sample. Although each individual has an equal chance of being chosen, this method is not an SRS
EXAMPLE: If we are choosing a sample of 30 students from the 300 students in the senior class by selecting every 10th student from the alphabetical directory, the first 30 students on the list will never all be chosen as the sample group.