STAT 111 Introductory Statistics

STAT 111 Introductory Statistics Lecture 4: Collecting Data May 24, 2004

Today’s Topics • Relationships between categorical variables • Collecting Data • Designing experiments • Choosing a sample • Sampling distributions

Categorical Variables • Recall that categorical variables separate individuals into groups. • We’ve seen that to see relationships between quantitative variables, we use scatterplots. • Similarly, to see relationships between categorical explanatory variables and quantitative responses, side-by-side boxplots are quite useful. • What do we use to see the relationship between two categorical variables, though?

Contingency Table • The contingency table is a two-way table with one variable as the row variable and the other as the column variable. • The row totals and column totals in a two-way table give the marginal distributions of two variables separately. • Conditional distribution of the response variable for each category of the explanatory variable could be used to describe the association between the two variables.

Contingency Table Example 1 • Titanic data – 2201 passengers, only the counts Column variable SURVIVED Total Count yes no female 126 344 470 male 1364 367 1731 SEX Total 1490 711 2201 Row variable

Joint and Marginal Distributions Joint Distribution Marginal distribution of SURVIVED Marginal distribution of SEX

Conditional Distributions Conditional distribution of survival given gender Conditional distribution of gender given survival

Example from Contingency Table 1 • Joint distribution: • P( Male surviving ) = 16.67% • P( Female surviving ) = 15.63% • Marginal distribution: • P( Surviving ) = 32.30% • P( Male ) = 78.65% • Conditional distribution: Given a female Given a male

Example from Contingency Table 1 • We see that of the people on board the ship, female survivors and male survivors made up roughly the same percentage. • But the number of females on board was substantially smaller than the number of males. • Looking at each category, we see that the percentage of females that survived is higher than the percentage of males that survived. • Survival and gender seem to be associated.

Lurking Variables • We know that lurking variables can produce nonsensical relationships between two quantitative variables. • Does the same hold true for relationships between categorical variables? • Example – We have the number of delayed and on-time flights for two airlines, Alaska Airlines (AA) and America West (AW). Which one has more flights that leave on-time?

Lurking Variables (cont.) • Looking at the contingency table below, it looks like America West has a larger percentage of on-time flights. But… Status Count delay on-time Row % AA 501 3274 3775 13.27 86.73 AW 787 6438 7225 Airline 10.89 89.11 1288 9712 11000

Lurking Variables (cont.) • Let’s look at the data for the individual cities. Los Angeles Phoenix San Diego Seattle San Francisco

Lurking Variables (cont.) • For each individual city, the percentage of flights that are on-time is higher for Alaska Airlines than it is for America West. • On the other hand, the percentage of flights that are on-time is higher for America West than for Alaska Airlines when we look at the aggregate. • What’s going on here?

Lurking Variables (cont.) • An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is Simpson’s paradox. • Simpson’s paradox is an extreme form of the fact that observed associations can be misleading in the presence of lurking variables. • Our case is an example of Simpson’s paradox, so what is the lurking variable here?

Lurking Variables (cont.) • The lurking variable here is the city, and in particular, the weather of that city. • Of the five cities listed, Seattle has the worst weather, so flights tend to be more delayed in this airport. Phoenix, on the other hand, is not plagued with bad weather, so flights tend to be more on-time. • Most of Alaska Airline’s flights involve Seattle, whereas America West’s flights mostly involve Phoenix!

Contingency Tables – Wrap-up • Most often, the contingency tables you’ll see will be of categorical variables with two levels each. • Naturally, we can extend this to categorical variables with more than two levels. • Also, we can consider a contingency table involving three variables; what we do in this case is create a series of contingency tables involving only the first two variables, one table for each of the levels of the third variable.

Collecting Data • We’ve discussed previously the idea of exploratory data analysis. • “What do we see in our data?” • Formal statistical inference is another type of data analysis. • Here, we are more interested in answering specific questions with a known degree of confidence. • Either way, successful statistical analysis requires our data to be both reliable and accurate.

Collecting Data (cont.) • The reliability and accuracy of our data depend on the method we use to collect our data. This method is known as a design. • Some popular sources of data are • Available data from libraries and the internet (Available data are data that were produced in the past for some other purpose but that may help answer a present question.) • Observational studies • Experimental studies

Observational vs Experimental Studies • In an observational study, we observe individuals and measure variables of interest, but we do not attempt to influence the responses. • In an experiment, we deliberately impose some treatment on individuals in order to observe their responses. • An observational study is generally poor at gauging the effect of an intervention, but in many situations, we have to use an observational study.

Sample Surveys • The sample survey is one specific type of observational study. • Why is it preferred to a census? • Financial constraints • Time • A sampling survey can be conducted using • Personal interviews • Telephone interviews • Self-administered questionnaires

Experiments • Experimental units: individuals on which our experiment is conducted • Subjects: human experimental units • Treatment: specific experimental condition applied to our units • In principle, experiments can give good evidence of causation.

Principles in Designing Experiments • Control the effects of lurking variables on the response; easiest way to do this is by comparing two or more treatments. This can help reduce the bias in a study. • Randomize – use chance to assign experimental units to treatments. • Replicate each treatment on many units to reduce chance variation in the results.

More on Experiments • In an experiment, we hope a difference in the responses so large that it is unlikely to happen because of chance variation alone. • In other words, we are looking for a statistically significant effect. • This terms frequently appears in reports of studies and tells you that the investigators found good evidence for the effect they were seeking. • The most serious weakness of experiments, though, is their lack of realism.

Types of Experimental Designs • Completely randomized design: experimental units are allocated at random among treatments. Simplest design for experiments. • Block design: blocks of experimental units are formed; random assignments of units to treatments is carried out separately within each block. • Matched pairs design: special type of block design that compares only two treatments by choosing blocks of two units that are as closely matched as possible.

Review: Population vs Sample • Population: the entire group of individuals that we want information about • Sample: the part of the population we actually examine in order to gather information • Parameter: a value that describes the population. It is fixed, but generally unknown. • Statistic: a value that describes the sample. It is observed once a sample is obtained and can be used to estimate an unknown parameter. • We generally require that the sample be a good representative of the population.

Sampling Designs • Voluntary response sample • Biased sample scheme scheme • Simple random sample • Stratified random sample • Cluster sample (one-stage and two-stage)

Sampling Designs • A voluntary response sample consists of people who choose themselves by responding to a general appeal. • This type of sample is invariably biased (contains a systematic error) and is not usually representative of the general population. Why? • The people who are willing to respond are the only ones included in this sample, and usually those are the ones with very strong opinions. • So what we get are the extreme cases.

Sampling Designs (cont.) • Better sampling designs choose individuals by random chance so that the bias is eliminated. • A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected. • How do we select an SRS? • Assign a number to each individual in the population. • Randomly select sample numbers by using a random numbers table or software package.

Sampling Designs (cont.) • A probability sample is a sample chosen by chance and is the general framework for designs that use chance to choose a sample. Possible samples and the probability of each possible sample occurring must be known. • The SRS is the simplest type of probability sample; it gives each member of the population an equal chance of selection. • More complex designs are better for sampling from large populations.

Age • under 20 • 20-30 • 31-40 • 41-50 • Sex • Male • Female • Martial status • Married • Single Sampling Designs (cont.) • To select a stratified random sample, divide the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.

Sampling Designs (cont.) • We typically choose the strata based on facts we know prior to taking the sampling. • Strata for sampling are similar to blocks in experiments. • Overall, using a stratified random sample, we can acquire information about • The whole population • Each stratum • The relationships among the strata

Sampling Design (cont.) • The SRS and stratified random sample both select individuals from the population. • On the other hand, the cluster sample selects groups or clusters of individuals from the population. A cluster is also referred to as a primary sampling unit (PSU). • In a one-stage cluster sample, all individuals within the selected clusters are selected. • In a two-stage cluster sample, a SRS of the individuals within each selected cluster is drawn.

Sampling Designs (cont.) • A two-stage cluster sample is an example of a multistage sampling design. • This is a more complex design in which, as the name suggests, a sample is obtained by sampling in multiple stages. • Basically, any sort of combination of an SRS, stratified random sample, and cluster sample can create a multistage sample.

Errors – Non-sampling vs Sampling • Non-sampling errors occur due to mistakes made during the process of data acquisition. • Increasing sample size will not reduce this type of error. • There are three types of non-sampling errors: • Errors in data acquisition, e.g., response bias • Nonresponse errors • Selection bias, such as undercoverage

If this observation… …is wrongly recorded here… Error in Data Acquisition Population Sampling error + Data acquisition error Sample

Nonresponse Error Population No response here... …may lead to biased results here. Sample

Selection Bias Population When parts of the population cannot be selected... …the sample cannot represent the whole population. Sample

Sampling Error • Sampling error refers to differences between the sample and the population, because of the specific observations that happen to be selected. • Sampling error is expected to occur when making a statement about the population based on the sample taken.

Population Population mean Sampling error The sample mean Sample

Sampling Distributions • The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. • The bias of a statistic is the difference between the mean of its sampling distribution and the population parameter; no bias = unbiased. • The variability is described by the spread of its sampling distribution; determined by the design and size of the sample.

High bias, low variability Low bias, high variability High bias, high variability Low bias, low variability

More on Sampling Errors • We are often concerned with how to manage the bias and variability of a statistic. • To reduce the bias, we use random sampling. • Generally speaking, estimates drawn from an SRS are unbiased (which is why the SRS is so attractive). • To reduce the variability of a statistic from an SRS, increase the sample size. • There is a trade-off between bias and variability , however (i.e., we cannot make both very small).

STAT 111 Introductory Statistics