Loading in 5 sec....

Statistics for the Physical Sciences STAT 229PowerPoint Presentation

Statistics for the Physical Sciences STAT 229

- 143 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Statistics for the Physical Sciences STAT 229' - michel

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Statistics for the Physical Sciences STAT 229

### Statistics for the Physical Sciences STAT 229

### Statistics for the Physical Sciences MultimodalSTAT 229

### Chapter 4: VariableGathering Data

### Chapter 4 VariableGathering Data

### Chapter 5 times during the programProbability in our Daily Lives

### Chapter 5: Probability in Our Daily Lives times during the program

### Chapter 5 times during the programProbability in Our Daily Lives

### Chapter 5: Probability in Our Daily Lives times during the program

### Chapter 6: Probability Distributions times during the program

### Chapter 6: Probability Distributions times during the program

### Chapter 6: Probability Distributions times during the program

### Chapter 7: Sampling Distributions times during the program

### Chapter 7: Sampling times during the programDistributions

### Chapter 7: Sampling Distributions times during the program

### Chapter 8: Statistical Inference: Confidence Intervals times during the program

### Chapter 8: Statistical Inference: Confidence Intervals times during the program

### Chapter 8: Statistical Inference: Confidence Intervals times during the program

### Chapter 8: Statistical Inference: Confidence Intervals height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(

### Chapter 8: Statistical Inference: Confidence Intervals height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(

### Chapter 9: Statistical Inference: Significance Tests About Hypotheses

### Chapter 9: Statistical Inference: Significance Tests About Hypotheses

### Chapter 9: Statistical Inference: Significance Tests about Hypotheses

### Chapter 9: Statistical Inference: Significance Tests about Hypothesis

### Chapter 9: Statistical Inference: Significance Tests about Hypothesis

### Chapter 9:Statistical Inference: Significance Tests about Hypothesis

Chapter 1

Statistics: The Art and Science of Learning from Data

Homework 1

- Problems 1.1 to 1.36 (even numbered)
- Complete the survey on page 22-23

Fall 2008 STAT 229

1.1 Overview

- Statistics is the art and science of learning from data. It is a collection of methods for
- Planning experiments (Design)
- Obtaining data (data are collected observations, such as measurements and survey responses)
- Organizing data
- Summarizing data (Description)
- Analyzing data
- Interpreting results, and
- Making decisions and predictions (Inference)

- Statistics is a branch of Mathematics ->

Fall 2008 STAT 229

- Statistics is invented for studying Randomness- a lack of order, purpose, cause, or predictability (by Wiki)- without which the world will be of no interest.
- Examples of random phenomena:
- Phelps won 8 gold medals
- A 6-sided die is flipped and landed a 4
- It’s going to rain tomorrow

- Randomness, Fuzziness and Uncertainty
- Randomness creates uncertainty. On the other hand, randomness can be used. When estimating the proportion of adults in USA who smoked, we can survey 1000 adults and use the survey responses as our data. How randomness is used? Why use it?

Fall 2008 STAT 229

1.2 We Learn about Population Using Samples

- In the previous example, all US adults form a population while the 1000 surveyed adults form a sample.
- In general, a population is the complete collection of all items to be studied. These items can be human subjects, animals, machines, even scores.
- A sample is a sub-collection of items selected from a population.

Fall 2008 STAT 229

More about Samples

- A sample should represent the underlying population. Therefore, sample data must be collected in an appropriate way, such as through a process of random selection.
- How large should a sample be?
- What are those appropriate ways to generate a sample?
- Methods for summarizing sample data are referred to as descriptive statistics, while methods for making decisions or predictions about a population based on sample data are called inferential statistics.

Fall 2008 STAT 229

Parameter and Statistic

- A parameter is a numeric summary of the population
- A statistic is a numeric summary of a sample taken from the population

Fall 2008 STAT 229

- Problem: Number of Good Friends
One year the General Social Survey asked, “About how many good friends do you have?” Of the 819 people who responded, 6% reported having only one good friend. Identify

(a) the sample

(b) the population, and

(c) the parameter or statistic

- Try Problem 1.3 on page 8 of the textbook.
Go to the General Social Survey website

http://sda.berkeley.edu/GSS

By entering HEAVEN as the “row variable” name, find the percentages of people who said “yes, definitely,” “yes, probably,” “no, probably not,” and “no, definitely not” when asked whether they believed in heaven.

1.3 What Role Do Computers Play in Statistics?

- Save (large) data files
- Create databases
- Do analysis with software: SAS, Minitab, Spss, R, Splus, C, Matlab, Excel, ...
- Simulation – use of computers to mimic reality.

Fall 2008 STAT 229

Simulation of Coin Tossing inMicrosoftExcel

NOTES:

1. Pseudo-random numbers are numbers generated by a computer algorithm to simulate real random numbers.

2. Excel has an Analysis ToolPak by which one can do statistical analysis, including simulation.

Fall 2008 STAT 229

When a balanced coin is tossed 20 times, we have a sequence of 20 Heads or Tails. Let 1 denote Heads and 0 denote Tails. Then a sample is a sequence of 1 or 0. The empirical probability or sample proportion of tossing Heads(1) is computed as the number of 1’s divided by the total number of tosses. The coin-tossing process can be simulated using Bernoulli distribution with proportion p = 0.5.

1. Simulate 5 random samples, each consisting of 10 pseudo-random numbers from a Bernoulli(0.5) distribution. Repeat the process using 1000 pseudo-random numbers.

2. Compute the sample proportion for each of the 10 samples.

Follow this:

Tools Data Analysis Random Number Generation Bernoulli

More questions:

- Where does randomness play a role?
- Is the amount of variability from sample to sample of size 10 bigger than the amount of variability from sample to sample of size 1000?
- Comment on the effect of sample size.

If You Are Using Excel 2007…

- Excel 2007 no longer have tools menu.
- To use Analysis ToolPak, go to office button at the upper left corner, click Excel options, then click Add-ins and highlight Analysis ToolPak. Clicking go button to open the Add-ins window. Check the box Analysis ToolPak and click OK.
- Now go to Data menu, click Data Analysis and choose Random Number Generation.

Fall 2008 STAT 229

Chapter 2

Exploring Data with Graphs and Numerical Summaries

Homework #2

- 2-1 (p29): Problems 2.2, 2.4, 2.6, 2.8
- 2-2 (p44): Problems 2.10, 2.12, 2.14, 2.16, 2.22
- 2-3 (p55): Problems 2.30, 2.32, 2.34, 2.36, 2.42, 2.44
- 2-4 (p64): Problems 2.48, 2.52, 2.56, 2.58, 2.60
- 2-5 (p73): Problems 2.64, 2.66, 2.68, 2.72, 2.74, 2.78, 2.80, 2.82
- 2-6 (p80): Problems 2.84

2.1 What Are the Types of Data?

- A characteristic observed for the subjects in a study is called a variable.
- Examples of variable: major, GPA, religious affiliation, smoking status,...
- Variables can be quantitative (numerical) or qualitative (categorical).
- A variable is quantitative if its numerical values represent different magnitudes of the variable, such as weight, GPA. A variable is categorical, if its value represents a category, such as major, letter garde.

- Quantitative variables can be discrete or continuous.
- A discrete variable is usually a count such as the number of car accident last year, while a continuous variable is a measurement, such as distance.
- The reason we care whether a variable is quantitative, categorical, discrete, or continuous is that the method used to analyze a data set depends on the type of variable the data represent.

Key Features of a Variable

- A quantitative variable usually takes different values in a study. Studying the spread (variability) of such a variable is one of the most important tasks in statistics. Another feature of a quantitative variable is the center of all its possible values.
- For a categorical variable, a key feature to describe is the relative number of items (percentage) in the various categories.

Frequency Tables

- For a categorical variable, counting how often each possible value is taken by the variable is a critical first step in descriptive statistics. The results are summarized in a frequency table.
- The following table shows the frequency of shark attacks in various regions for 1990-2006.

Frequency of shark attacks in various regions for 1990-2006

Questions: What is the variable? Is it categorical?

The mode of categorical data is the category with the highest frequency. Find the mode of the data.

Frequency Tables (cont’d)

- In the table above, the proportions and percentages are also called relative frequencies. A table like this is called a frequency table.
- A frequency table is a listing of possible values for a variable, together with the number of observations for each value.
- For a quantitative variable, A frequency table is constructed by first categorizing the data into a set of adjacent intervals, then finding the frequencies for each interval.

- Example
Construct a frequency table for quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

Score Frequency Proportion Percentage

1

1

7

8

3

0.05

0.05

0.35

0.40

0.15

[0,2]

(2,4]

(4,6]

(6,8]

(8,10]

5

5

35

40

15

Total 20

1.00 100

2.2 How Can We Describe Data Using Graphical Summaries?

Preliminary results of the election for the European Parliament in 2004

Pie Charts and Bar Graphsfor Categorical Variables

- Pie chart: A circle having a “slice of a pie” for each category. The size of slice corresponds to the percentage of observations in the category.
- Bar graph: Displays a vertical bar for each category. The height of the bar is the percentage of observations in the category.

Bar Graph for European Parliament in 2004

Pareto Chart: Bar Graph with categories Ordered by Their Frequency from the Tallest Bar to Shortest

Graphs for Quantitative Variables Frequency from the Tallest Bar to Shortest

- Dot plots: Shows a dot for each observation, placed just above the value on the number line for that observation.
- Stem-and-Leaf Plots: similar to dot plot. Each observation is represented by a stem and a leaf.
- Histogram: a graph uses bars to portray the frequencies or relative refrequencies.

Graphs for Quantitative Variables Frequency from the Tallest Bar to Shortest

Example Dot plot

Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

1 2 3 4 5 6 7 8 9 10

Graphs for Quantitative Variables Frequency from the Tallest Bar to Shortest

Example Stem-and-Leaf Plot

Test scores for 12 students: 80, 45, 100, 76, 84, 87, 96, 62, 75,74, 87, 76

Step 1: Sorted test scores: 45, 62, 74, 75, 76, 76, 80, 84, 87, 87, 96, 100

Step 2: Place the scores in the corresponding stems and leaves. (usually the last digit will be the leaf)

Stem Leaves

4

5

6

7

8

9

10

5

2

4 5 6 6

0 4 7 7

6

0

Graphs for Quantitative Variables Frequency from the Tallest Bar to Shortest

Histogram

Step 1: Divide the range of data into

intervals of equal width.

Step 2: Count the frequency and construct a

frequency table (or relative frequency table).

Step 3: Label the endpoints of the intervals on x-axis.

Draw a bar over each interval with height equal

to its frequency (or relative frequency), values

of which are marked on the y-axis.

Graphs for Quantitative Variables Frequency from the Tallest Bar to Shortest

Example Histogram

Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

Score Freq

[0,2) 1

[2,4) 1

[4,6) 7

[6,8) 8

[8,10) 3

[10,12) 0

The Shape of a Distribution Frequency from the Tallest Bar to Shortest

- When looking at a graph of quantitative data (dot plot, stem-and-leaf plot, and histogram), look for
- the overall pattern: Do the data cluster together?
- the outliers
- modes: unimodal, bimodal,…
- skew: skewed to the left or right
- the underlying smooth curve

Unimodal Bimodal Multimodal

outliers Multimodal

Time plots Multimodal

- Time series: a data set collect over time.
- Time plot: a graph displaying time-series data.
- Look for pattern over time.

Time plots: Example Multimodal

Gasoline price

2.3 How can we describe the center of quantitative data? Multimodal

- Measures of center: mean and median
- Mean: the sum of the observations divided by the number of observations.
- Median: The midpoint of the observations.

Mean Formula Multimodal

- Example Multimodal Travel times to work
- How long does it take to get from home to work?
- Here are the travel times in minutes in North
- Carolina, chosen at random by Census Bureau:
- 20 10 40 25 20 10 60 15 40 5 30 12 10 10
- Find the mean travel time.

How to Determine the Median Multimodal

Step 1: Sort your data from the smallest

to the largest.

Step 2: If n, the number of data points is

odd, the median is the middle

value; if n is even, the median is

the average of the middle two values.

Example Multimodal Find median for the travel times

30 20 10 40 25 20 10 60 15 40 5 30 12 10 10

Arrange the data in order:

5 10 10 10 10 12 15 20 20 25 30 30 40 40 60

Since n = 15 is odd, Median = 20, the middle value.

Example Find the median for the scores

60 80 87 73 95 92

Arrange the data in order: 60 73 80 87 92 95

Since n = 6 is even, Median = (80 + 87)/2 = 83.5, the

average of the two middle values.

Properties of the mean and the median Multimodal

- The mean is the balance point of the data.
- In a symmetric distribution, the mean and median are the same.
- In a skewed distribution, the mean is usually farther out in the long tail than the median.
- Skewed to the right, mean > median
- Skewed to the left, mean < median

- Mean is less resistant to outliers.

Mean, Median, and Mode Multimodal

- The mean is the balance point.
- The median is the midpoint.
- The mode is the value occurs most frequently.

Mean and Median: Applications Multimodal

- City data
- St Cloud, MN
- New Orleans, LA

2.4 How can we describe the spread of quantitative data? Multimodal

- Measures of spread:
- The Range
- The Stand Deviation
- The Interquartile Range (Sec2.5)

Measuring spread: The Range Multimodal

- Range = largest value - smallest value
- Example: Find the range of the quiz scores : 2, 5, 0, 7, 9, 1, 7, 6, 10, 9, 3, 9, 9, 7, 0, 6, 9, 10, 8,1, 4, 6, 8, 9, 4, 2, 9, 0, 5, 7
Range = largest value - smallest value

= 10 - 0

= 10

The Range Multimodal

- Simple to compute
- Easy to understand
But

- Uses only extreme values
- Affected severely by outliers

Measuring MultimodalSpread: Variance and Standard Deviation

- The standard deviation and variance measure spread by looking how far the observations are from their mean.
- The variance of a set of observations is an average of the squares of deviation from the mean.

- The standard deviations is the square root of the variance

The standard deviation: Example Multimodal

- Example (Calculating the standard deviation s)
Metabolic rates of 7 men who took part in a study of dieting. The units are calories per 24 hours.

1792 1666 1362 1614 1460 1867 1439

Find the mean first:

Cont’d Multimodal

Observations Deviations Squared deviations

sum = 0 sum = 214870

The variance

The standard deviation

Properties of the MultimodalStandard Deviation

- The greater the spread, the larger the s.
- s ≥ 0.
- s = 0 when all the observations take the same value.
- s can be influenced by outliers.

Interpreting the Magnitude of s: MultimodalThe Empirical Rule

If a distribution of data is bell shaped, then approximately:

68% of the observations fall within 1 stand deviation of the mean, that is between - s and + s.

95% of the observations fall within 2 stand deviations of the mean, that is between - 2s and + 2s.

99.7% of the observations fall within 3 stand deviations of the mean, that is between - 3s and

+ 3s.

Sample Statistics and MultimodalPopulation Parameters

- Population: The collection of all individuals or items under consideration.
- Sample: That part of the population from which we actually collect information.
- We use a sample to draw conclusion about the entire population.

Sample Statistics and MultimodalPopulation Parameters

- Parameter: Numerical summary of the population.
- Statistic: Numerical summary of a sample.
- Notations:
Population Mean

Population Standard Deviation

Sample Mean

Sample Standard Deviation s

2.5 How Can Measures of Position Describe Spread? Multimodal

- Measure of positions:
- Quartiles
- Percentiles.

- Percentiles:
- pth percentile: a value such that p percent of observations fall below or at that value.

- Quartiles
- First quartile, the same as 25th percentile (p=25)
- Second quartile, the same as 50th percentile (p=50)
- Third quartile, the same as 75th percentile (p=75)

Calculating Quartiles Multimodal

- To calculate the quartiles:
1. Arrange the observations in increasing order.

2. The second quartile is the median M.

( = 50th percentile)

3. The first quartile is the median of the

observations whose position in the ordered list is to

the left location of the overall median. ( = 25th

percentile)

4. The third quartile is the median of the

observations whose position in the ordered list is to

the right location of the overall median. ( = 75th

percentile)

Quartiles: Example Multimodal

- Example 2.17 Travel times to work Find and .
Arrange the data in order:

5 10 10 10 10 12 15 20 20 25 30 30 40 40 60

the left location of the overall median 20 is:

5 10 10 10 10 12 15

= 10

the right location of the overall median 20 is:

20 25 30 30 40 40 60

= 30

Quartiles: Example Multimodal

- Example 2.5 Travel times to work Find and .
Travel times in minutes of 20 randomly chosen New York workers: 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45

Arrange the data in order:

5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85

The overall median = 22.5 minutes

the left location of the overall median is: 5 10 10 15 15 15 15 20 20 20

= 15 minutes

the right location of the overall median is:25 30 30 40 40 45 60 60 65 85

= 42.5 minutes

Another Measure of Spread: MultimodalThe Interquartile Range

- The Interquartile Range (IQR)
The Interquartile Range = -

- Example (Travel times to work) Find IQR.
5 10 10 10 10 12 15 20 20 25 30 30 40 40 60

Detecting Potential Outliers: The 1.5*IQR Criterion Multimodal

- The 1.5*IQR Criterion for Identifying Potential Outliers.
An observation is a potential outlier if it falls more than 1.5*IQR below the first quartile or more than 1.5*IQR above the third quartile.

Detecting Potential Outliers: MultimodalExample

- Example 2.18 Travel times to work (in minutes). Detecting Potential Outliers.
5 10 10 10 10 12 15

20 20 25 30 30 40 40

80

The five-number summary and The BoxPlot Multimodal

- The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation.
Minimum Median Maximum

- Example 2.19 The five-number summary of travel times to work.
5 10 10 10 10 12 15 20 20 25 30 30 40 40 80

The Box Plot Multimodal

Constructing a box plot

- A box goes from the and .
- A line drawn inside the box at the median.
- A line goes from the lower end of the box to the smallest observation that is not an potential outlier. A separate line goes from the upper end of the box to the largest observation that is not an potential outlier. These lines are called whiskers.
- The potential outliers are shown separately.

outlier Multimodal

Example

(Constructing a boxplot)

Travel times to work.

5 10 10 10 10 12 15 20 20 25 30 30 40 40 80

Steps:

- Find Q1, Q2, and Q3:
- Find IQR:
- Determine two fences:
lower fence = Q1 – 1.5*IQR

upper fence = Q3 + 1.5*IQR

- Identify potentialutliers
- Determine whiskers: one from Q1 to the smallest observation within fences, and the other from Q3 to the largest within fences.
- Draw the boxplot.

Largest in fences

Q3 = 30

Q2 = 20

Q1 = 10

smallest in fences

(Text Page 67) Multimodal

Sodium values for 20 breakfast cereals:

0 70 125 125

140 150 170 170

180 200 200 210

210 220 220 230

250 260 290 290

R codes:

x=c(0,70,125,125,140,150,170,170,180,200,200,210,210,220,220,230,250,260,290,290)

boxplot(x, col=3, horizontal = T)

Example (Boxplot)

Interpretation of Boxplots Multimodal

- IQR measures the sample variability (or spread).
- A box plot indicates skew. The side with the larger part of the box and the longer whisker usually has skew in that direction.

Interpretation of Box Plots Multimodal

Interpretation of Box Plots Multimodal

In terms of symmetry, median, spread, …

Side-by-Side Box Plots Multimodal

- Help to compare groups (in terms of symmetry, median, spread,…).
- Example: (College student heights) Click here to see the “Heights” data on the text CD.

R Multimodalcodes (copy and paste to R):

heights=read.table("heights.csv”

, sep=',', header=T)

boxplot(HEIGHT~GENDER,

data=heights, col = 3:4)

The z-Score Multimodal

- Z-score for an observation is the number of standard deviation that it falls from the mean and in which direction.
- An observation in a bell-shaped distribution is regarded as a potential outlier if it falls more than three standard deviation from the mean;
that is, z > 3 or z < - 3. (Recall the empirical rule, 99.7% of values are within 3 standard deviations of the mean.)

The z-Score: Example Multimodal

2.6 How Can Graphical Summaries Be Misused? Multimodal

- Self reading

Chapter 3

Association: Contingency, Correlation, regression

Homework #3 Multimodal

- 3-1: Problems 3.2, 3.4, 3.6, 3.8, 3.10
- 3-2: Problems 3.12, 3.14, 3.16, 3.18, 3.22
- 3-3: Problems 3.26, 3.30, 3.36, 3.38, 3.40
- 3-4: Problems 3.48, 3.50, 3.52, 3.54, 3.56, 3.58, 3.60

Response Variables and MultimodalExplanatory Variables

- In this chapter, we discuss statistical methods for data on two variables.
- Some times, one of the two variables may be termed the response variable and the other explanatory variable.
- The response variable is the outcome variable on which comparisons are made.
- The explanatory variable defines the group to be compared with respect to values on the response variable.

Is Smoking Actually Beneficial to Your Health? Multimodal

- This is Example 1 on page 93 of text. 1314 women were asked whether they were smokers. They were followed over a period of 20 years.

It’s natural to treat the variable “Survival Status” as a response variable and “Smoker” as an explanatory variable.

Associations Multimodal

- The main purpose of a data analysis with two variables is to investigate whether there is an association and to describe the nature of that association.
- An association exits between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.
- Is the variable “Survival Status” associated with the variable “Smoker”? Does smoking lead to cancer?

Other Examples of Association Multimodal

- Smoking and BMI
- Smoking and lung cancer
- Irrigation and plant growth
- Traffic and air pollution
- Gender and height

3.1 Explore the Association between Two Categorical Variables

- A contingency table is used to explore the association between two categorical variables:
- Rows list the categories of one variable.
- Columns list the categories of the other variable.
- Each cell in the table holds the number of observations (frequency) in the sample with certain outcomes on the two variables.

- Cross-tabulation: The process of finding the frequencies for the cells of a contingency table.
- The previous table is an example of a contingency table.

Construct Contingency Tables From Raw Data Variables

- Excel Data: Two Variables
- Cancer Treatment: treatments given to the cancer patients (Surgery and Radiation therapy).
- Cancer Controlled: whether cancer has been controlled (Yes and No).

Contingency table (Example) Variables

Questions: (1) What proportion of the patients who had surgery

had their cancer controlled?

(2) What proportion of all cancer patients had their cancer

controlled?

Answer Variables

(1) 21 / 23 = 91% of the patients who had surgery had their cancer controlled.

(2) 36 / 41 = 88% of all cancer patients had cancer controlled.

Conditional Proportions Variables

A conditional proportion is the proportion of one variable at a given level of the other variable.

Marginal proportion Variables

A marginal proportion is the proportion of a row or column variable.

Side-by-side bars Variables

- Display conditional proportions.
- Useful for making comparisons.

Side-by-side bars: Example Variables

- The proportion of patients who had their cancer controlled is slightly higher for the patients who had surgery than for those who had radiation therapy.

Is There an Association? Variables

Examples Variables

- Ex 3.8 page 101
- Ex 3.3 page 100

3.2 How Can We Explore the Association between Two Quantitative Variables

- An association can be studied between
- two categorical variables
- two quantitative variables
- a categorical variable and a quantitative variable.

- In this section, we explore the association between two quantitative variables.
- That is, we will study how a response variable tends to change as the value of an explanatory variable changes.

Scatterplots Quantitative Variables

- A scatterplot is a graphical display of relationship between two quantitative variables. It portrays two variables simultaneously
- horizontal axis: the explanatory variable
- vertical axis: response variable.
- point in the display: observation corresponding to a subject.

Example: Worldwide Use of Internet Quantitative Variables

- Click to see the data (text, page 103).
- Data dictionary -
GDP: Gross domestic product, per capita, in thousands of US dollars

CO2: Carbon dioxide emissions, per capita, in tons

Cellular: Percentage of adults who are cellular-phone subscribers

Fertility: Mean number of children per adult woman

- Question to explore
(1) Describe the center and spread of the data distribution.

(2) Portray the relationship with a scatterplot for Internet use

and GDP

(3) What do you learn about the association by inspecting

the scatterplot?

Interpreting Scatterplots Quantitative Variables

- You can describe the overall pattern of a scatterplot by the trend, direction, and strength of the relationship between the two variables
- Trend: linear, curved, clusters, no pattern
- Direction: positive, negative, no direction
- Strength: how closely the points fit the trend

- Also look for outliers from the overall trend

Positive Association Quantitative Variables

- Two quantitative variables x and y are
- Positively associated when
- High values of x tend to occur with high values of y
- Low values of x tend to occur with low values of y

- Negatively associatedwhen high values of one variable tend to pair with low values of the other variable

- Positively associated when

Would you expect a positive association, a Quantitative Variables

negative association or no association between

the age of the car and the mileage on the

odometer?

- Positive association
- Negative association
- No association

Moving Graphics Quantitative Variables

http://www.gapminder.org

Linear Quantitative VariablesCorrelation, r

- Measures the strength and direction of the linear association between x and y
- A positive r value indicates a positive association
- A negative r value indicates a negative association
- An r value close to +1 or -1 indicates a strong linear association
- An r value close to 0 indicates a weak association

Properties of Correlation Quantitative Variables

- Always falls between -1 and +1
- Sign of correlation denotes direction
- (-) indicates negative linear association
- (+) indicates positive linear association

- Correlation has a unitless measure - does not depend on the variables’ units
- Two variables have the same correlation no matter which is treated as the response variable
- Correlation is sensitive to outliers
- Correlation only measures strength of linear relationship

Calculating the Correlation Coefficient Quantitative Variables

Per Capita Gross Domestic Product and Average Life Expectancy for Countries in Western Europe

Calculating the Correlation Coefficient Quantitative Variables

Called Z-Scores

Divide a Scatterplot into Quadrants Quantitative Variables

II

In quadrant I, both z-scores positive;

In quadrant II, z-scores of Internet are positive, while z-scores of GDP are negative;

In quadrant III, both z-scores negative;

In quadrant IV, z-scores of GDP are positive, while z-scores of INTERNET are negative;

I

IV

III

3.3 How Can We Predict the Outcome of a Variable? Quantitative Variables

- When a scatterplot indicates a relationship between two variables, we can start fitting a curve to the data.
- The procedure of fitting a curve to the data, along with inferences about parameters of interest and prediction of the response value, is called regression analysis.

Regression Analysis Quantitative Variables

- The first step of a regression analysis is to identify the response and explanatory variables
- We use y to denote the response variable
- We use x to denote the explanatory variable

Regression Line Quantitative Variables

- A regression line is a straight line that describes how the response variable (y) changes as the explanatory variable (x) changes
- A regression line predicts the value of the response variable (y) for a given level of the explanatory variable (x)
- The y-intercept of the regression line is denoted by a
- The slope of the regression line is denoted by b

Example: How Can Anthropologists Predict Height Using Human Remains?

- Regression Equation:
- is the predicted height and is the length of a femur (thighbone), measured in centimeters

- Use the regression equation to predict the height of a person whose femur length was 50 centimeters

Interpreting the y-Intercept Remains?

- y-Intercept:
- The predicted value for y when x = 0
- Helps in plotting the line
- May not have any interpretative value if no observations had x values near 0

Interpreting the Slope Remains?

- Slope: measures the change in the predicted variable (y) for a 1 unit increase in the explanatory variable in (x)
- Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height

Slope Values: Remains?Positive, Negative, Equal to 0

Regression Line Remains?

- At a given value of x, the equation:
- Predicts a single value of the response variable
- But… we should not expect all subjects at that value of x to have the same value of y
- Variability occurs in the y values!

Residuals Remains?

- Measures the size of the prediction errors, the vertical distance between the point and the regression line
- Each observation has a residual
- Calculation for each residual:
- A large residual indicates an unusual observation

“Least Squares Method” Yields the Regression Line Remains?

- Residual sum of squares:
- The least squares regression line is the line that minimizes the vertical distance between the points and their predictions, i.e., it minimizes the residual sum of squares
- Note: the sum of the residuals about the regression line will always be zero

Regression Formulas for y-Intercept and Slope Remains?

- Slope:
- Y-Intercept:

Regression line always passes through

Calculating the slope and y intercept for the regression line

Slope =26.4

Find a and b.

y intercept=-2.28

Internet Usage and lineGross National Product (GDP)

Using TI-83 line

- Enter x data into L1
- Enter y data into L2
- STAT CALC menu
- Choose 8: LinReg(a+bx)
- 1st number = x variable
- 2nd number = y variable
- Enter

The Slope and the Correlation line

- Correlation:
- Describes the strength of the linear association between 2 variables
- Does not change when the units of measurement change
- Does not depend upon which variable is the response and which is the explanatory

- Slope:
- Numerical value depends on the units used to measure the variables
- Does not tell us whether the association is strong or weak
- The two variables must be identified as response and explanatory variables
- The regression equation can be used to predict values of the response variable for given values of the explanatory variable

The Squared Correlation line

- When a strong linear association exists, the regression equation predictions tend to be much better than the predictions using only
- We measure the proportional reduction in error and call it, r2, which measures the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x.
- A correlation of 0.9 means that
81% of the variation in the y-values can be explained by the explanatory variable, x

3.4 What Are Some Cautions in Analyzing Association? line

- Be cautious of
- Extrapolation
- Influential outliers
- Interpretation of correlation or association
- Lurking variables
- Confounding

Extrapolation line

- Extrapolation: Using a regression line to predict y-values for x-values outside the observed range of the data
- It’s riskier as we move farther from the range of the given x-values
- There is no guarantee that the relationship given by the regression equation holds outside the range of sampled x-values

Outliers and Influential Points line

- A regressionoutlier is an observation/point that lies far away from the trend that the rest of the data follows
- An observation is influential if
- Its x value is relatively low or high compared to the remainder of the data, and
- The observation is a regression outlier.

- An influential observation tends to pull the regression line toward that data point and away from the rest of the data.

Interpretation of Correlation and Association line

- Correlation does not imply causation.
- In general, it’s also true that association does not imply causation. This warning holds whether we analyze associations between qualitative variables or between quantitative variables.
- Create a scatterplot for “Crime rate” against “Education” in the “FL crime” data on the text CD.

Lurking Variables line

- A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest.

- Example line: A reporter studied the causes of a fire to a house and established a high positive correlation between the damages (in dollars) and the number of firefighters at the scene. Which of the following could be a lurking variable that is responsible for the association?
- (a) Firefighter
- (b) Weather
- (c) Size of the house
- (d) Size of the blaze

STAT 319 Biometrics Fall 2008

135

- Example line: An economist noticed that nations with more TV sets have higher life expectancies. He established a high positive correlation between length of life and number of TV sets. Find the lurking variable, if there is one.
- (a) TV sets brands
- (b) Popcorn
- (c) Wealth of the nation
- (d) Sofa
- (e) No confounding variable

STAT 319 Biometrics Fall 2008

136

Simpson’s Paradox line

- Simpson’s Paradox refers to the phenomenon that the direction of an association between two variables can change after we include a third variable and analyze the data at separate levels of that variable. (Book)
- Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined. (Wiki)

Is Smoking Actually Beneficial to Your Health? line

The data indicate that smoking could apparently be beneficial to your health. Could a lurking variable be responsible for the association?

This is Example 1 on page 93 of text. 1314 women were asked whether they were smokers. They were followed over a period of 20 years.

- The were also age information about the 1314 women involved in the study. These women can be stratified into 4 different age groups, creating 4 contingency tables.

Question: For each age group, find conditional proportions of deaths for smokers and nonsmokers.

More Simpson Paradoxes in the study. These women can be stratified into 4 different age groups, creating 4 contingency tables.

- http://en.wikipedia.org/wiki/Simpson's_paradox

Simpson's paradox for continuous data: a positive trend appears for two separate groups (blue and red), a negative trend (black, dashed) appears when the data are combined.

Confounding appears for two separate groups (blue and red), a negative trend (black, dashed) appears when the data are combined.

- When two explanatory variables are both associated with a response variable but are also associated with each other, there is said to be confounding.
- Age is a confounding variable in the study of the association between smoking and survival status.

Difference between a Confounding Variable and a Lurking Variable

- A confounding variable is already included in the study. It is associated both with the response variable and the explanatory variable.
- A lurking variable is not measured in the study. It has the potential for confounding.
- The effect of an explanatory variable can be analyzed by adjusting for confounding variables.
- Ignoring lurking variables results in misleading conclusions. (age in smoking-survival association).

Section 4.1

Should We Experiment or Should We Merely Observe?

Statistics for the Physical Sciences (STAT 229-02)

Homework #4 Variable

- 4-1: Problems 4.2, 4.4, 4.6, 4.8, 4.10
- 4-2: Problems 4.14, 4.18, 4.20, 4.22, 4.28, 4.30
- 4-3: Problems 4.34, 4.36, 4.38, 4.40, 4.42
- 4-4: Problems 4.44, 4.46, 4.48, 4.50, 4.52, 4.54

Learning Objectives: Variable

- Population versus Sample
- Types of Studies: Experimental and Observational
- Comparing Experimental and Observational Studies

Learning Objective 1: VariablePopulation and Sample

- Population: all the subjects of interest
- We use statistics to learn about the population, the entire group of interest

- Sample: subset of the population
- Data is collected for the sample because we cannot typically measure all subjects in the population

Population

Sample

Learning Objective 2: VariableType of Study: Observational Study

- In an observational study, the researcher observes values of the response variable and explanatory variables for the sampled subjects, without anything being done to the subjects (such as imposing a treatment)

Learning Objective 2: VariableObservational Study – Sample Survey

- A sample survey selects a sample of people from a population and interviews them to collect data.
- A sample survey is a type of observational study.
- A census is a survey that attempts to count the number of people in the population and to measure certain characteristics about them

Learning Objective 2: VariableType of Study: Experiment

- A researcher conducts an experiment by assigning subjects to certain experimental conditions and then observing outcomes on the response variable
- The experimental conditions, which correspond to assigned values of the explanatory variable, are called treatments

Learning Objective 2: VariableExample

- Headline: “Student Drug Testing Not Effective in Reducing Drug Use”
- Facts about the study:
- 76,000 students nationwide
- Schools selected for the study included schools that tested for drugs and schools that did not test for drugs
- Each student filled out a questionnaire asking about his/her drug use

Learning Objective 2: VariableExample

- Conclusion: Drug use was similar in schools that tested for drugs and schools that did not test for drugs

Learning Objective 2: VariableExample

This study was an observational study.

In order for it to be an experiment, the researcher would had to have assigned each school to use or not use drug testing rather than leaving this decision to the school.

Learning Objective 3: VariableComparing Experiments and Observational Studies

- An experiment reduces the potential for lurking variables to affect the result. Thus, an experiment gives the researcher more control over outside influences.
- Only an experiment can establish cause and effect. Observational studies can not.
- Experiments are not always possible due to ethical reasons, time considerations and other factors.

Learning Objectives: Variable

- Sampling Frame & Sampling Design
- Simple Random Sample (SRS)
- Random number table
- Margin of Error
- Convenience Samples
- Types of Bias in Sample Surveys
- Key Parts of a Sample Survey

Learning VariableObjective 1:Sampling Frame & Sampling Design

- The sampling frame is the list of subjects in the population from which the sample is taken, ideally it lists the entire population of interest
- The sampling design determines how the sample is selected. Ideally, it should give each subject an equal chance of being selected to be in the sample

Learning VariableObjective 2:Simple Random Sampling, SRS

- Random Sampling is the best way of obtaining a sample that is representative of the population
- A simple random sample of ‘n’ subjects from a population is one in which each possible sample of that size has the same chance of being selected

Learning Objective 2: VariableSRS Example

- Two club officers are to be chosen for a New Orleans trip
- There are 5 officers: President, Vice-President, Secretary, Treasurer and Activity Coordinator
- The 10 possible samples are:
(P,V) (P,S) (P,T) (P,A) (V,S)

(V,T) (V,A) (S,T) (S,A) (T,A)

- For a SRS, each of the ten possible samples has an equal chance of being selected. Thus, each sample has a 1 in 10 chance of being selected and each officer has a 4 in 10 chance of being selected.

Learning Objective 3: VariableSRS: Table of Random Numbers

Table of Random Numbers

- Table E on pg. A6 of text

Leaning Objective Variable3:Using Random Numbers to select a SRS

- To select a simple random sample
- Number the subjects in the sampling frame using numbers of the same length (number of digits)
- Select numbers of that length from a table of random numbers or using a random number generator
- Include in the sample those subjects having numbers equal to the random numbers selected

Learning Objective 3: VariableChoosing a simple random sample

We need to select a random sample of 5 from a class of 20 students.

- List and number all members of the population, which is the class of 20.
- The number 20 is two-digits long.
- Parse the list of random digits into numbers that are two digits long. Here we choose to start with line 2, for no particular reason.

22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34 02

22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34 Variable02

24 1304 83 60 22 52 79 72 65 76 39 36 48 0915 17 92 48 30

1 Alison

2 Amy

3 Brigitte

4 Darwin

5 Emily

6 Fernando

7 George

8 Harry

9 Henry

10 John

11 Kate

12 Max

13 Moe

14 Nancy

15 Ned

16 Paul

17 Ramon

18 Rupert

19 Tom

20 Victoria

- Choose a random sample of size 5 by reading through the list of two-digit random numbers, starting with line 2 and on.
- The first five random numbers matching numbers assigned to people make the SRS.

The first individual selected is Amy, number 02. That’s it from line 2. Move to line 3

Then Moe (13), Darwin, (04), Henry (09), and Net (15)

- Remember that 1 is 01, 2 is 02, etc.
- If you were to hit 09 again before getting five people, don’t sample Ramon twice—you just keep going.

Learning VariableObjective 4:Margin of Error

- Sample surveys are commonly used to estimate population percentages
- These estimates include a margin of error which tells us how well the sample estimate predicts the population percentage
- When a SRS of n subjects is used, the margin of error is approximately

Learning VariableObjective 4:Example: Margin of Error

- A survey result states: “The margin of error is plus or minus 3 percentage points”
- This means: “It is very likely that the reported sample percentage is no more than 3% lower or 3% higher than the population percentage”
- Click here to see a Gallup example. Read the “Survey Methods” part and justify the margin of error in the survey.

Learning VariableObjective 5:Convenience Samples: Poor Ways to Sample

- Convenience Sample: a type of survey sample that is easy to obtain
- Unlikely to be representative of the population
- Often severe biases result from such a sample
- Results apply ONLY to the observed subjects; that is, they are descriptive.

Learning VariableObjective 5:Convenience Samples: Poor Ways to Sample

- Volunteer Sample: most common form of convenience sample
- Subjects volunteer for the sample
- Volunteers do not tend to be representative of the entire population

Learning VariableObjective 6:Types of Bias in Sample Surveys

Bias: Tendency to systematically favor certain parts of the population over others

- Sampling Bias: Occurs when using biased samples, which are based on sampling methods such as using nonrandom samples or having undercoverage
- Nonresponse bias: occurs when some sampled subjects cannot be reached or refuse to participate or fail to answer some questions
- Response bias: occurs when the subject gives an incorrect response or the question is misleading
A Large Sample Does Not Guarantee An Unbiased Sample!

Learning VariableObjective 7:Key Parts of a Sample Survey

- Identify the population of all subjects of interest
- Construct a sampling frame which attempts to list all subjects in the population
- Use a random sampling design to select n subjects from the sampling frame
- Be cautious of sampling bias due to nonrandom samples
We can make inferences about the population of interest when sample surveys that use random sampling are employed.

Learning Objectives: Variable

- Identify the elements of an experiment
- Experiments
- 3 Components of a good experiment
- Blinding the Study
- Define Statistical Significance
- Generalizing Results of the Study

Learning Objective 1 Variable:Elements of an Experiment

- Experimental units: the subjects of an experiment; the entities that we measure in an experiment
- Treatment: A specific experimental condition imposed on the subjects of the study; the treatments correspond to assigned values of the explanatory variable
- Explanatory variable: Defines the groups to be compared with respect to values on the response variable
- Response variable: The outcome measured on the subjects to reveal the effect of the treatment(s).

Learning VariableObjective 2:Experiments

- An experiment deliberately imposes treatments on the experimental units in order to observe their responses.
- The goal of an experiment is to compare the effect ofthe treatment on the response.
- Experiments that are randomized occur when the subjects are randomly assigned to the treatments; randomization helps to eliminate the effects of lurking variables

Learning Objective 3: Variable3 Components of a Good Experiment

- Control/Comparison group: allows the researcher to analyze the effectiveness of the primary treatment
- Randomization: eliminates possible researcher bias, balances the comparison groups on known as well as on lurking variables (so that the observed difference among subjects is attributed to treatments)
- Replication: allows us to attribute observed effects to the treatments rather than ordinary variability

Learning Objective 3 Variable:Principle 1: Control or Comparison Group

- A placebo is a dummy treatment, i.e. sugar pill. Many subjects respond favorable to any treatment, even a placebo.
- A control group typically receives a placebo. A control group allows us to analyze the effectiveness of the primary treatment.
- A control group need not receive a placebo. Clinical trials often compare a new treatment for a medical condition, not with a placebo, but with a treatment that is already on the market.

Learning Objective 3 Variable:Principle 1: Control or Comparison Group

- Experiments should compare treatments rather than attempt to assess the effect of a single treatment in isolation
- Is the treatment group better, worse, or no different than the control group?

- Example: 400 volunteers are asked to quit smoking and each start taking an antidepressant. In 1 year, how many have relapsed? Without a control group (individuals who are not on the antidepressant), it is not possible to gauge the effectiveness of the antidepressant.

Learning Objective 3 Variable:Placebo effect

- Placebo effect (power of suggestion) : The “placebo effect” is an improvement in health due not to any treatment but only to the patient’s belief that he or she will improve.

Learning Objective 3 Variable:Principle 2: Randomization

- To have confidence in our results we should randomly assign subjects to the treatments. In doing so, we
- Eliminate bias that may result from the researcher assigning the subjects
- Balance the groups on variables known to affect the response
- Balance the groups on lurking variables that may be unknown to the researcher

Learning Objective 3 Variable:Principle 3: Replication

- Replication is the process of assigning several experimental units to each treatment
- The difference due to ordinary variation is smaller with larger samples
- We have more confidence that the sample results reflect a true difference due to treatments when the sample size is large
- Since it is always possible that the observed effects were due to chance alone, replicating the experiment also builds confidence in our conclusions

Learning Objective 4: VariableBlinding the Experiment

- Ideally, subjects are unaware, or blind, to the treatment they are receiving
- If an experiment is conducted in such a way that neither the subjects nor the investigators working with them know which treatment each subject is receiving, then the experiment is double-blinded
- A double-blinded experiment controls response bias from the respondent and experimenter

Learning Objective Variable5:Define Statistical Significance

- If an experiment (or other study) finds a difference in two (or more) groups, is this difference really important?
- If the observed difference is larger than what would be expected just by chance, then it is labeled statistically significant.
- Rather than relying solely on the label of statistical significance, also look at the actual results to determine if they are practically significant.

Learning Objective 6: VariableGeneralizing Results

- Recall that the goal of experimentation is to analyze the association between the treatment and the response for the population, not just the sample
- However, care should be taken to generalize the results of a study only to the population that is represented by the study.

Section 4.4

What are Other Ways to Conduct Experimental and Observational Studies

Learning Objectives Variable

- Sample Surveys: Other Random Sampling Designs
- Types of Observational Studies: Prospective and Retrospective
- Multifactor Experiment
- Matched pairs design
- Randomized block design

Learning Objective 1: VariableSample Surveys: Random Sampling Designs

- It is not always possible to conduct an experiment , so it is necessary to have well designed, informative studies that are not experimental, e.g., sample surveys that use randomization
- Simple Random Sampling
- Cluster Sampling
- Stratified Random Sampling

Learning Objective 1: VariableSample Surveys: Cluster Random Sample

Steps

- Divide the population into a large number of clusters, such as city blocks
- Select a simple random sample of the clusters
- Use the subjects in those clusters as the sample

Learning Objective 1: VariableSample Surveys: Cluster Random Sample

- Preferable when
- A reliable sampling frame is unavailable
- The cost of selecting a SRS is excessive

- Disadvantage
- Usually need a larger sample size than with a SRS in order to achieve a particular margin of error

Learning Objective 1: VariableSample Surveys: Stratified Random Sample

Steps

- Divide the population into separate groups, called strata
- Select asimple random sample from each strata
- Combine the samples from all strata to form complete sample

Learning Objective 1: VariableSample Surveys: Stratified Random Sample

- Advantage is that you can include in your sample enough subjects in each stratum you want to evaluate
- Disadvantage is that you must have a sampling frame and know the stratum into which each subject belongs

Learning VariableObjective 1:Stratified Random Sample - Example

Suppose a university has the following student demographics:

Undergraduate Graduate First Professional Special

55% 20% 5% 20%

In order to insure proper coverage of each demographic, a stratified random sample of 100 students could be chosen as follows: select a SRS of 55 undergraduates, a SRS of 20 graduates, a SRS of 5 first professional students, and a SRS of 20 special students; combine these 100 students.

Learning Objective 1: VariableComparing Random Sampling Methods

Learning VariableObjective 2:Types of Observational Studies

An observational study can yield useful information when an experiment is not practical.

- Types of observational studies:
- Sample Survey: attempts to take a cross section of a population at the current time
- Retrospective study: looks into the past
- Prospective study: follows its subjects into the future

- Causation can never be definitively established with an observational study, but well designed studies can provide supporting evidence for the researcher’s beliefs

Learning Objective Variable2:Retrospective Case-Control Study

- A case-control study is a retrospective observational study in which subjects who have a response outcome of interest (the cases) and subjects who have the other response outcome (the controls) are compared on an explanatory variable

Learning Objective 2: VariableExample: Case-Control Study

- Response outcome of interest: Lung cancer
- The cases have lung cancer
- The controls did not have lung cancer

- The two groups were compared on the explanatory variablesmoker/nonsmoker

Learning Objective 2: VariableExample: Prospective Study

Nurses’ Health Study:

- Began in 1976 with 121,700 female nurses aged 30 to 55; questionnaires are filled out every two years
- Purpose was to explore the relationships among diet, hormonal factors, smoking habits and exercise habits and the risk of coronary heart disease, pulmonary disease and stroke
- Nurses are followed into the future to determine whether they eventually develop an outcome such as lung cancer and whether certain explanatory variables are associated with it

Learning Objective 3: VariableMultifactor Experiments

- A Multifactor experiment uses a single experiment to analyze the effects of two or more explanatory variables on the response
- Categorical explanatory variables in an experiment are often called factors
- We are often able to learn more from a multifactor experiment than from separate one-factor experiments since the response may vary for different factor combinations

Learning VariableObjective 3:Example: Multifactor experiment

- Examine the effectiveness of both Zyban and nicotine patches on quitting smoking
- Two factor experiment
- 4 treatments

Learning Objective 3: VariableExample: Multifactor experiment

- subjects: a certain number of undergraduate students
- all subjects viewed a 40-minute television program that included ads for a digital camera
- some subjects saw a 30-second commercial; others saw a 90-second version
- same commercial was shown either 1, 3, or 5 times during the program
- there were two factors: length of the commercial (2 values), and number of repetitions (3 values)

subjects assigned to Treatment 3 see a 30-second ad five times during the program

Learning Objective 3:Example: Multifactor experiment- the 6 combinations of one value of each factor form six treatments

- after viewing, all subjects answered questions about: recall of the ad, their attitude toward the commercial, and their intention to purchase the product – these were the response variables.

Learning Objective 4: times during the programMatched Pairs Design Randomly The number of replicates equals the number of pairs Helps to reduce effects of lurking variables

In a matched pairs design, the subjects receiving the two treatments are somehow matched (same person, husband/wife, two plots in the same field, etc.)

- In a crossover design, the same individual is used for the two treatments

- assign the two treatments to the two matched subjects, or
- randomize the order of applying the treatments in a crossover design

Learning Objective times during the program5:Randomized Block Design

- A block is a set of experimental units that are matched with respect to one or more characteristics
- A Randomized Block Design, RBD, is when the random assignment of experimental units to treatments is carried out separately within each block

L times during the programearning Objective 5:Example: Randomized Block Design

- Block = gender; 3 treatments = 3 types of therapy
- The men (as well as the women) are randomly assigned to the
- 3 treatments; differences can be compared with respect to
- gender as well as therapy type

Learning Objective times during the program5:Randomized Block Design

- RBD eliminates variability in the response due to the blocking variable; allows for better comparisons to be made among the treatments of interest
- A matched pairs design is a special case of a RBD with two observations in each block

Section 5.1: How can Probability

Quantify Randomness?

Homework #5 times during the program

- Section 5.1: 5.2, 5.4, 5.6, 5.8
- Section 5.2: all even
- Section 5.3: all even
- Section 5.4: 5.48, 5.50, 5.56, 5.58, 5.60, 5.62

Learning Objectives times during the program

- Random Phenomena
- Law of Large Numbers
- Probability
- Independent Trials
- Finding probabilities
- Types of Probabilities: Relative Frequency and Subjective

Learning Objective 1: times during the programRandom Phenomena

- For random phenomena, the outcome is uncertain
- In the short-run, the proportion of times that something happens is highly random
- In the long-run, the proportion of times that something happens becomes very predictable
Probability quantifies long-run randomness

Learning Objective 2 times during the program:Law of Large Numbers

- As the number of trials increase, the proportion of occurrences of any given outcome approaches a particular number “in the long run”
- For example, as one tosses a die, in the long run 1/6 of the observations will be a 3.

Learning Objective 3: times during the programProbability

- With random phenomena, the probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations
- Example:
- When rolling a die, the outcome of “3” has probability = 1/6. In other words, the proportion of times that a 3 would occur in a long run of observations is 1/6.

Learning Objective 4 times during the program:Independent Trials

- Different trials of a random phenomenon are independent if the outcome of any one trial is not affected by the outcome of any other trial.
- Example:
- If you have 20 flips of a coin in a row that are “heads”, you are not “due” a “tail” - the probability of a tail on your next flip is still 1/2. The trial of flipping a coin is independent of previous flips.

Learning Objective 5: times during the programHow can we find Probabilities?

- Calculate theoretical probabilities based on assumptions about the random phenomena. For example, it is often reasonable to assume that outcomes are equally likely such as when flipping a coin, or a rolling a die.
- Observe many trials of the random phenomenon and use the sample proportion of the number of times the outcome occurs as its probability. This is merely an estimate of the actual probability.

Learning Objective 6 times during the program:Types of Probability:Relative Frequency vs. Subjective

- The relative frequency definition of probability is the long run proportion of times that the outcome occurs in a very large number of trials - not always helpful/possible.
- When a long run of trials is not feasible, you must rely on subjective information. In this case, the subjective definition of the probability of an outcome is your degree of belief that the outcome will occur based on the information available.
- Bayesian statistics is a branch of statistics that uses subjective probability as its foundation

Section 5.2: How Can We Find Probabilities?

Learning Objectives times during the program

- Sample Space
- Event
- Probabilities for a sample space
- Probability of an event
- Basic rules for finding probabilities about a pair of events

Learning Objectives times during the program

- Probability of the union of two events
- Probability of the intersection of two events

Learning Objective 1 times during the program:Sample Space

- For a random phenomenon, the sample space is the set of all possible outcomes.

Learning Objective 2 times during the program:Event

- An event is a subset of the sample space
- An event corresponds to a particular outcome or a group of possible outcomes.
- For example;
- Event A = “student answers all 3 questions
correctly” = (CCC)

- Event B = “student passes (at least 2 correct)”
= (CCI, CIC, ICC, CCC)

- Event A = “student answers all 3 questions

Learning Objective 3 times during the program:Probabilities for a sample space

Each outcome in a sample space has a probability

- The probability of each individual outcome is between 0 and 1
- The total of all the individual probabilities equals 1.

Learning Objective times during the program4:Probability of an Event

- The probability of an event A, denoted by P(A), is obtained by adding the probabilities of the individual outcomes in the event.
- When all the possible outcomes are equally likely:

Learning times during the programObjective 4:Example: What are the Chances of being Audited?

- What is the sample space for selecting a taxpayer?
{(under $25,000, Yes), (under $25,000, No),

($25,000 - $49,000, Yes) …}

Learning Objective 4: times during the programExample: What are the Chances of being Audited?

For a randomly selected taxpayer in 2002,

- What is the probability of an audit?
- 310/80200=0.004

- What is the probability of an income of $100,000 or more?
- 10700/80200=0.133

- What income level has the greatest probability of being audited?
- $100,000 or more = 80/10700= 0.007

Learning Objective times during the program5:Basic rules for finding probabilities about a pair of events

- Some events are expressed as the outcomes that
- Are not in some other event (complement of the event)
- Are in one event and in another event (intersection of two events)
- Are in one event or in another event (union of two events)

Learning Objective 5 times during the program:Complement of an event

- The complement of an event A consists of all outcomes in the sample space that are not in A.
- The probabilities of A and of Ac add to 1
- P(Ac) = 1 – P(A)

Learning Objective times during the program5:Disjoint events

- Two events, A and B, are disjoint if they do not have any common outcomes

Learning Objective 5: times during the programIntersection of two events

- The intersection of A and B consists of outcomes that are in both A and B

Learning Objective times during the program5:Union of two events

- The union of A and B consists of outcomes that are in A or B or in both A and B.

Learning Objective times during the program6:Probability of the Union of Two Events

- Addition Rule:
- For the union of two events,
- P(A or B) = P(A) + P(B) – P(A and B)
- If the events are disjoint, P(A and B) = 0, so
- P(A or B) = P(A) + P(B)

Learning Objective 6 times during the program:Example

- 80.2 million tax payers (80,200 thousand)
- Event A = being audited
- Event B = income greater than $100,000
- P(A and B) = 80/80200=.001

Learning times during the programObjective 7:Probability of the Intersection of Two Events

Multiplication Rule:

For the intersection of two independent events, A and B, P(A and B) = P(A) x P(B)

Learning Objective times during the program7:Example

- What is the probability of getting 3 questionscorrect by guessing?
- Probability of guessing correctly is .2

- What is the probability that a student answers at least 2 questions correctly?
- P(CCC) + P(CCI) + P(CIC) + P(ICC) =
- 0.008 + 3(0.032) = 0.104

Learning Objective 7 times during the program:Assuming independence

- Don’t assume that events are independent unless you have given this assumption careful thought and it seems plausible.

Learning Objective 7: times during the programEvents Often Are Not Independent

- Example: A Pop Quiz with 2 Multiple Choice Questions
- Data giving the proportions for the actual responses of students in a class
Outcome: II IC CI CC

Probability: 0.26 0.11 0.05 0.58

- Data giving the proportions for the actual responses of students in a class

Learning times during the programObjective 7:Events Often Are Not Independent

- Define the events A and B as follows:
- A: {first question is answered correctly}
- B: {second question is answered correctly}

- P(A) = P{(CI), (CC)} = 0.05 + 0.58 = 0.63
- P(B) = P{(IC), (CC)} = 0.11 + 0.58 = 0.69
- P(A and B) = P{(CC)} = 0.58
- If A and B were independent,
P(A and B) = P(A) x P(B) = 0.63 x 0.69 = 0.43

Thus, in this case, A and B are not independent!

Section 5.3: Conditional Probability:

What’s the Probability of A, Given B?

Learning Objectives times during the program

- Conditional probability
- Multiplication rule for finding P(A and B)
- Independent events defined using conditional probability

Learning Objective times during the program1:Conditional Probability

- For events A and B, the conditional probability of event A, given that event B has occurred, is:
- P(A|B) is read as “the probability of event A, given event B.” The vertical slash represents the word “given”. Of the times that B occurs, P(A|B) is the proportion of times that A also occurs

Learning Objective 1 times during the program:Conditional Probability

Learning Objective 1 times during the program:Example 1

Learning Objective 1 times during the program:Example 1

Learning Objective 1 times during the program:Example 1

- What was the probability of being audited, given that the income was ≥ $100,000?
- Event A: Taxpayer is audited
- Event B: Taxpayer’s income ≥ $100,000

Learning Objective 1 times during the program:Example 1

- What is the probability of being audited given that the income level is < $25,000
- Let A =Event being Audited
- Let B = Income < $25,000
- P(A and B) = .0011
- P(B)=.1758
- .0011/.1758=.0063

Learning Objective 1: times during the programExample 2

- A study of 5282 women aged 35 or over analyzed the Triple Blood Test to test its accuracy

Learning Objective 1: times during the programExample 2

- A positive test result states that the condition is present
- A negative test result states that the condition is not present
- False Positive: Test states the condition is present, but it is actually absent
- False Negative: Test states the condition is absent, but it is actually present

Learning Objective 1: times during the programExample 2

- Assuming the sample is representative of the population, find the estimated probability of a positive test for a randomly chosen pregnant woman 35 years or older
- P(POS) = 1355/5282 = 0.257

Learning Objective 1: times during the programExample 2

- Given that the diagnostic test result is positive, find the estimated probability that Down syndrome truly is present
- Summary: Of the women who tested positive, fewer than 4% actually had fetuses with Down syndrome

Learning Objective 2 times during the program:Multiplication Rule for Finding P(A and B)

- For events A and B, the probability that A and B both occur equals:
- P(A and B) = P(A|B) x P(B)
also

- P(A and B) = P(B|A) x P(A)

- P(A and B) = P(A|B) x P(B)

Learning Objective 2: times during the programExample

- Roger Federer – 2006 men’s champion in the Wimbledon tennis tournament
- He made 56% of his first serves
- He faulted on the first serve 44% of the time
- Given that he made a fault with his first serve, he made a fault on his second serve only 2% of the time

Learning Objective 2: times during the programExample

- Assuming these are typical of his serving performance, when he serves, what is the probability that he makes a double fault?
- P(F1) = 0.44
- P(F2|F1) = 0.02
- P(F1 and F2) = P(F2|F1) x P(F1)
= 0.02 x 0.44 = 0.009

Learning Objective 3 times during the program:Independent Events Defined Using Conditional Probabilities

- Two events A and B are independent if the probability that one occurs is not affected by whether or not the other event occurs
- Events A and B are independent if:
P(A|B) = P(A), or equivalently, P(B|A) = P(B)

- If events A and B are independent,
P(A and B) = P(A) x P(B)

Learning Objective 3 times during the program:Checking for Independence

- To determine whether events A and B are independent:
- Is P(A|B) = P(A)?
- Is P(B|A) = P(B)?
- Is P(A and B) = P(A) x P(B)?

- If any of these is true, the others are also true and the events A and B are independent

Learning Objective 3: times during the programExample

- The diagnostic blood test for Down syndrome:
POS = positive result

NEG = negative result

D = Down Syndrome

DC = Unaffected

Learning Objective 3: times during the programExample: Checking for Independent

- Are the events POS and D independent or dependent? Is P(POS|D) = P(POS)?
- P(POS|D) =P(POS and D)/P(D)
= 0.009/0.010 = 0.90

- P(POS) = 0.257
- The events POS and D are dependent

Section 5.4: Applying the Probability Rules

Learning Objectives times during the program

- Is a “Coincidence” Truly an Unusual Event?
- Probability Model
- Probabilities and Diagnostic Testing
- Simulation

Learning Objective 1 times during the program:Is a “Coincidence” Truly an Unusual Event?

- The law of very large numbers states that if something has a very large number of opportunities to happen, occasionally it will happen, even if it seems highly unusual

Learning Objective 1: times during the programExample: Is a Matching Birthday Surprising?

- What is the probability that at least two students in a group of 25 students have the same birthday?

Learning Objective times during the program1:Example: Is a Matching Birthday Surprising?

- P(at least one match) = 1 – P(no matches)

Learning times during the programObjective 1:Example: Is a Matching Birthday Surprising?

- P(no matches) = P(students 1 and 2 and 3 …and 25 have different birthdays)

Learning Objective times during the program1:Example: Is a Matching Birthday Surprising?

- P(no matches) =
(365/365) x (364/365) x (363/365) x …

x (341/365)

- P(no matches) = 0.43

Learning Objective times during the program1:Example: Is a Matching Birthday Surprising?

- P(at least one match) =
1 – P(no matches) = 1 – 0.43 = 0.57

Not so surprising when you consider that there are 300 pairs of students who can share the same birthday!

Learning Objective 2: times during the programProbability Model

- We’ve dealt with finding probabilities in many idealized situations
- In practice, it’s difficult to tell when outcomes are equally likely or events are independent
- In most cases, we must specify a probability model that approximates reality

Learning Objective 2: times during the programProbability Model

- A probability model specifies the possible outcomes for a sample space and provides assumptions on which the probability calculations for events composed of these outcomes are based
- Probability models merely approximate reality

Learning Objective 2: times during the programExample: Probability Model

- Out of the first 113 space shuttle missions there were two failures
- What is the probability of at least one failure in a total of 100 missions?
- P(at least 1 failure)=1-P(0 failures)
=1-P(S1 and S2 and S3 … and S100)

=1-P(S1)xP(S2)x…xP(S100)

=1-[P(S)]100=1-[0.971]100=0.947

- P(at least 1 failure)=1-P(0 failures)

Learning Objective 2: times during the programExample: Probability Model

- This answer relies on the assumptions of
- Same success probability on each flight
- Independence
These assumptions are suspect since other variables (temperature at launch, crew experience, age of craft, etc.) could affect the probability

Learning Objective 3: times during the programProbabilities and Diagnostic Testing

- Sensitivity = P(POS|S)
- Specificity = P(NEG|SC)

Learning Objective 3: times during the programExample: Probabilities and Diagnostic Testing

Random Drug Testing of Air Traffic Controllers

- Sensitivity of test = 0.96
- Specificity of test = 0.93
- Probability of drug use at a given time ≈ 0.007 (prevalence of the drug)

Learning Objective 3: times during the programExample: Probabilities and Diagnostic Testing

What is the probability of a positive test result?

P(POS)=P(S and POS)+P(SC and POS)

- P(S and POS)=P(S)P(POS|S)
= 0.007x0.96=0.0067

- P(SC and POS)=P(SC)P(POS|SC)
= 0.993x0.07=0.0695

- P(POS)=.0067+.0695=0.0762
Even though the prevalence is < 1%, there is an almost 8% chance of the test suggesting drug use!

Learning Objective 4: times during the programSimulation

Some probabilities are very difficult to find with ordinary reasoning. In such cases, we can approximate an answer by simulation.

Learning Objective 4: times during the programSimulation

Carrying out a Simulation:

- Identify the random phenomenon to be simulated
- Describe how to simulate observations
- Carry out the simulation many times (at least 1000 times)
- Summarize results and state the conclusion

Textbook: P265 Question 5.113 times during the program

R code

birthdayMatch = function(n = 25, rep = 1000){

match = 0

for (i in 1:rep){

x = sample(1:365, 25, replace = T)

if (sum(duplicated(x)) > 0) match = match + 1

}

match/rep

}

birthdayMatch(rep = 1000)

Textbook: P265 Question 5.114 times during the program

In table tennis, the first person to get at least 21 points while being ahead of the opponent by at least 2 points wins the game. In games between you and an opponent, suppose successive points are independent, and suppose the probability of your winning any given point is 0.40. Simulate the table tennis process and find your chance of winning a game.

The chance of winning a game is approximately the proportion of games you win in 1000 games.

tableTennis = function(p = 0.4, sim = 1000){

win = 0

score = matrix(0, sim, 2)

for (i in 1:sim){

A = B = 0 # your point is A and your opponent’s point is B

cat("The No.",i, "game is: ")

while (max(A, B) < 21 | abs(A - B) < 2){

if (runif(1) < p) A = A + 1

else B = B + 1

print(c(A, B))

}

score[i, ] = c(A, B)

win = win + (A > B)

}

list(score, prob = win / sim)

}

tableTennis(p = 0.4, sim = 10)

Section 6.1: How Can We Summarize Possible Outcomes and Their Probabilities?

Homework #6 times during the program

- Page 277: 6.1, 6.4, 6.6, 6.8, 6.10, 6.12
- Page 290: 6.16, 6.20, 6.22, 6.24, , 6.26, 6.27, 6.28, 6.30
- Page 299: 6.36, 6.38, 6.40, 6.42, 6.46, 6.48

Hawkes Homework times during the program

- Section 5.1, 5.2
- Section 6.1, 6.2, 6.3, 6.4
- Due with Homework #6

Learning Objectives times during the program

- Random variable
- Probability distributions for discrete random variables
- Mean of a probability distribution
- Summarizing the spread of a probability distribution
- Probability distribution for continuous random variables

Learning Objective 1 times during the program:Random Variable

- A random variable is a numerical measurement of the outcome of a random phenomenon.

Learning Objective 1: times during the programRandom Variable

- Use a capital letter, such as X, to refer to the random variable itself.
- Use a lowercase letter, such as x, to refer to A particular value of the random variable X.
Example: Flip a coin three times

- X=number of heads in the 3 flips; defines the random variable
- x=2; represents a realized value of the random variable X.

- Use a lowercase letter, such as x, to refer to A particular value of the random variable X.

Learning Objective times during the program2:Probability Distribution

- The probability distribution of a discrete random variable specifies its possible values and their probabilities.
- The probability distribution of a continuous random variable specifies the intervals where the random variable falls and their probabilities.

Learning Objective 2 times during the program:Probability Distribution of a Discrete Random Variable

- A discrete random variableX has separate values (such as 0,1,2,…) as its possible outcomes
- Its probability distribution assigns a probability P(x) to each possible value x:
- The sum of the probabilities for all the possible x values equals 1

Learning Objective 2: times during the programExample

- What is the estimated probability of at least three home runs?
P(3)+P(4)+P(5)=0.13+0.03+0.01=0.17

Learning Objective 3 times during the program:The Mean of a Discrete Probability Distribution

- The mean of a probability distribution for a discrete random variable is
where the sum is taken over all possible values of x.

- The mean of a probability distribution is denoted by the parameter, µ.
- The mean is a weighted average; values of x that are more likely receive greater weight P(x)

Learning Objective 3 times during the program:Expected Value of X

- The mean of a probability distribution of a random variable X is also called the expected value of X.
- The expected value reflects not what we’ll observe in a single observation, but rather that we expect for the average in a long run of observations.

Learning Objective 3: times during the programExample

- Find the mean of this probability distribution.

The mean:

= 0(0.23) + 1(0.38) + 2(0.22) + 3(0.13) + 4(0.03) + 5(0.01) = 1.38

Learning times during the programObjective 4:The Standard Deviation of a Probability Distribution

The standard deviation of a probability distribution, denoted by the parameter, σ, measures its spread.

- Larger values of σ correspond to greater spread.
- Roughly, 0.8σ describes how far the random variable falls, on the average, from the mean of its distribution

Learning Objective 5: times during the programContinuous Random Variable

- A continuous random variable has an infinite continuum of possible values in an interval.
- Examples are: time, age and size measures such as height and weight.

Learning Objective times during the program5:Probability Distribution of a Continuous Random Variable

- A continuous random variable has possible values that form an interval.
- Its probability distribution is specified by a curve.
- Each interval has probability between 0 and 1.
- The interval containing all possible values has probability equal to 1.

Pr( 0.51≤X ≤1.48) times during the program

Section 6.2: How Can We Find Probabilities for Bell-Shaped Distributions?

Learning Objectives times during the program

- Normal Distribution
- 68-95-99.7 Rule for normal distributions
- Z-Scores and the Standard Normal Distribution
- The Standard Normal Table: Finding Probabilities
- Using the TI-calculator: find probabilities

Learning Objectives times during the program

- Using the Standard Normal Table in Reverse
- Using the TI-calculator: find z-scores
- Probabilities for Normally Distributed Random Variables
- Percentiles for Normally Distributed Random Variables
- Using Z-scores to Compare Distributions

Learning times during the programObjective 1:Normal Distribution

The normal distribution is symmetric, bell-shaped and characterized by its mean µ and standard deviation .

- The normal distribution is the most important distribution in statistics
- Many distributions have an approximate normal distribution
- Approximates many discrete distributions well when there are a large number of possible outcomes
- Many statistical methods use it even when the data are not bell shaped

Learning times during the programObjective 1:Normal Distribution

- Normal distributions are
- Bell shaped
- Symmetric around the mean

- The mean () and the standard deviation () completely describe the density curve
- Increasing/decreasing moves the curve along the horizontal axis
- Increasing/decreasing controls the spread of the curve

The bigger the variance, the narrower the curve. times during the program

Learning Objective times during the program1:Normal Distribution

- Within what interval do almost all of the men’s heights fall? Women’s height?

Learning times during the programObjective 2:68-95-99.7 Rule for Any Normal Curve

- 68% of the observations fall within one standard deviation of the mean
- 95% of the observations fall within two standard deviations of the mean
- 99.7% of the observations fall within three standard deviations of the mean

Learning Objective times during the program2:Example : 68-95-99.7% Rule

- Heights of adult women
- can be approximated by a normal distribution
- = 65 inches; =3.5 inches

- 68-95-99.7 Rule for women’s heights
- 68% are between 61.5 and 68.5 inches
[ µ = 65 3.5 ]

- 95% are between 58 and 72 inches
[ µ 2 = 65 2(3.5) = 65 7 ]

- 99.7% are between 54.5 and 75.5 inches
[ µ 3 = 65 3(3.5) = 65 10.5 ]

- 68% are between 61.5 and 68.5 inches

Learning Objective times during the program3:Z-Scores and the Standard Normal Distribution

- The z-score for a value x of a random variable is the number of standard deviations that x falls from the mean
- A negative (positive) z-score indicates that the value is below (above) the mean
- z-scores can be used to calculate the probabilities of a normal random variable using the normal tables in the back of the book

Learning Objective times during the program3:Z-Scores and the Standard Normal Distribution

- A standard normal distribution has mean µ=0 and standard deviation σ=1
- When a random variable has a normal distribution and its values are converted to z-scores by subtracting the mean and dividing by the standard deviation, the z-scores have the standard normal distribution.

Standard normal curve times during the program

Learning Objective times during the program4:Table A: Standard Normal Probabilities

Table A enables us to find normal probabilities

- It tabulates the normal cumulative probabilities falling below the point +z
To use the table:

- Look up the closest value in the table to the z score.
- First column gives z to the first decimal place
- First row gives the second decimal place of z

- The corresponding probability found in the body of the table gives the probability of falling below the z-score

Learning Objective times during the program4:Example: Using Table A

- Find the probability that a normal random variable takes a value less than 1.43 standard deviations above µ; P(z < 1.43)=.9236

TI Calculator = Normcdf(-1e99, 1.43 , 0, 1)= .9236

Learning Objective 4: times during the programExample: Using Table A

- Find the probability that a normal random variable takes a value greater than 1.43 standard deviations above µ: P(z>1.43)=1-.9236=.0764

TI Calculator = Normcdf(1.43,1e99,0,1)= 0.0764

Learning Objective 4: times during the programExample:

- Find the probability that a normal random variable assumes a value within 1.43 standard deviations of µ
- Probability below 1.43σ = .9236
- Probability below -1.43σ = .0764 (1-.9236)
- P(-1.43<z<1.43) =.9236-.0764=.8472

TI Calculator = Normcdf(-1.43,1.43,0,1)= .8472

Learning Objective 5 times during the program:Using the TI Calculator

To calculate the cumulative probability

- 2nd DISTR; 2:normalcdf(lower bound, upper bound, mean, sd)
- Use –1E99 for negative infinity and 1E99 for positive infinity

Learning times during the programObjective 5:Find Probabilities Using TI Calculator

- Find probability to the left of -1.64
- P(z<-1.64)=normcdf(-1e99,-1.64,0,1)=.0505

- Find probability to the right of 1.56
- P(z>1.56)=normcdf(1.56,1e99,0,1)=.0594

- Find probability between -.50 and 2.25
- P(-.5<z<2.25)=normcdf(-.5,2.25,0,1)=.6793

More Examples: Using Normal Tables times during the program

- http://www.math.unb.ca/~knight/utility/NormTble.htm
- From the standard normal distribution table, we can find
probabilities such as

Find Normal Probabilities in Excel times during the program

In Excel, use NORMDIST(x, 0, 1, true). For example, to find P(Z < 0.62), in Excel, type

=NORMDIST(0.62, 0, 1, TRUE)

And press the ENTER key. The answer is 0.732.

Learning Objective 6 times during the program:How Can We Find the Value of z for a Certain Cumulative Probability?

- To solve some of our problems, we will need to find the value of z that corresponds to a certain normal cumulative probability
- To do so, we use Table A in reverse
- Rather than finding z using the first column (value of z up to one decimal) and the first row (second decimal of z)
- Find the probability in the body of the table
- The z-score is given by the corresponding values in the first column and row

- Rather than finding z using the first column (value of z up to one decimal) and the first row (second decimal of z)

Learning times during the programObjective 6:How Can We Find the Value of z for a Certain Cumulative Probability?

- Example: Find the value of z for a cumulative probability of 0.025.
- Look up the cumulative probability of 0.025 in the body of Table A.
- A cumulative probability of 0.025 corresponds to z = -1.96.
- Thus, the probability that a normal
random variable falls at least 1.96

standard deviations below the

mean is 0.025.

Learning Objective times during the program6:How Can We Find the Value of z for a Certain Cumulative Probability?

- Example: Find the value of z for a cumulative probability of 0.975.
- Look up the cumulative probability of 0.975 in the body of Table A.
- A cumulative probability of 0.975 corresponds to z = 1.96.
- Thus, the probability that a normal
random variable takes a value no more

than 1.96 standard deviations above

the mean is 0.975.

Learning Objective times during the program7:Using the TI Calculator to Find Z-Scores for a Given Probability

- 2nd DISTR 3:invNorm; Enter
- invNorm(percentile,mean,sd)
- Percentile is the probability under the curve from negative infinity to the z-score

- Enter

Learning Objective times during the program7:Examples

- The probability that a standard normal random variable assumes a value that is ≤ z is 0.975. What is z? Invnorm(.975,0,1)=1.96
- The probability that a standard normal random variable assumes a value that is > z is 0.0275.
What is z? Invnorm(.975,0,1)=1.96

- The probability that a standard normal random variable assumes a value that is ≥ z is 0.881.
What is z? Invnorm(1-.881,0,1)=-1.18

- The probability that a standard normal random variable assumes a value that is < z is 0.119.
What is z? Invnorm(.119,0,1)= -1.18

Learning Objective 7 times during the program:Example

- Find the z-score z such that the probability within z standard deviations of the mean is 0.50.
- Invnorm(.75,0,1)= .67
- Invnorm(.25,0,1)= -.67

- Probability = P(-.67<Z<.67)=.5

Learning times during the programObjective 8:Finding Probabilities for General Normally Distributed Random Variables

- State the problem in terms of the observed random variable X, i.e., P(X<x)
- Standardize X to restate the problem in terms of a standard normal variable Z
- Draw a picture to show the desired probability under the standard normal curve
- Find the area under the standard normal curve using Table A

Standard normal times during the program

Shaded areas are kept same.

Learning Objective times during the program8:P(X<x)

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure less than 100?
- P(X<100) =
- Normcdf(-1E99,100,120,20)=.1587
- 15.9% of adults have systolic blood pressure less than 100

Learning times during the programObjective 8:P(X>x)

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure greater than 100?
- P(X>100) = 1 – P(X<100)
- P(X>100)= 1-.1587=.8413
- Normcdf(100,1e99,120,20)=.8413
- 84.1% of adults have systolic blood pressure greater than 100

Learning Objective times during the program8:P(X>x)

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure greater than 133?
- P(X>133) = 1 – P(X<133)
- Normcdf(133,1E99,120,20)=.2578
- 25.8% of adults have systolic blood pressure greater than 133

Learning times during the programObjective 8: P(a<X<b)

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure between 100 and 133?
- P(100<X<133) = P(X<133) - P(X<100)
- Normcdf(100,133,120,20)=.5835
- 58% of adults have systolic blood pressure between 100 and 133

Learning times during the programObjective 9:Find X Value Given Area to Left

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What is the 1st quartile?
- Translation: Given P(X < x) = .25, find x.
- Look up .25 in the body of Table A to find z = -0.67
- Solve equation to find x:

- Check:
- P(X<106.6) = P(Z<-0.67)=0.25
- TI Calculator = Invnorm(.25,120,20)=106.6

Learning times during the programObjective 9:Find X Value Given Area to Right

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. 10% of adults have systolic blood pressure above what level?
- P(X>x)=.10, find x.
- P(X>x)=1-P(X<x)
- Look up 1-0.1=0.9 in the body of Table A to find z=1.28
- Solve equation to find x:

- Check:
- P(X>145.6) =P(Z>1.28)=0.10
- TI Calculator = Invnorm(.9,120,20)=145.6

Learning Objective 10: times during the programUsing Z-scores to Compare Distributions

Z-scores can be used to compare observations from different normal distributions

- Example:
- You score 650 on the SAT which has =500 and
=100 and 30 on the ACT which has =21.0 and

=4.7. On which test did you perform better?

- Compare z-scores
SAT: ACT:

- Since your z-score is greater for the ACT, you performed better on this exam

- You score 650 on the SAT which has =500 and

Section 6.3: How Can We Find Probabilities When Each Observation Has Two Possible Outcomes?

Learning Objectives times during the program

- The Binomial Distribution
- Conditions for a Binomial Distribution
- Probabilities for a Binomial Distribution
- Factorials
- Examples using Binomial Distribution
- Do the Binomial Conditions Apply?
- Mean and Standard Deviation of the Binomial Distribution
- Normal Approximation to the Binomial

Learning Objective 1 times during the program:The Binomial Distribution

- Each observation is binary: it has one of two possible outcomes.
- Examples:
- Accept, or decline an offer from a bank for a credit card.
- Have, or do not have, health insurance.
- Vote yes or no on a referendum.

Learning Objective times during the program2:Conditions for the Binomial Distribution

- Each of n trials has two possible outcomes: “success” or “failure”.
- Each trial has the same probability of success, denoted by p.
- The ntrials are independent.
- Let X be the number of successes in the n trials. Then, X has a Binomial Distribution.

Learning Objective times during the program3:Probabilities for a Binomial Distribution

- Denote the probability of success on a trial by p.
- For n independent trials, the probability of x successes equals:

See def. of n!

Learning Objective 4 times during the program:Factorials

Rules for factorials:

- n!=n*(n-1)*(n-2)…2*1
- 1!=1
- 0!=1
For example,

- 4!=4*3*2*1=24

Learning Objective 5: times during the programExample: Finding Binomial Probabilities

- John Doe claims to possess ESP, extrasensory perception.
- An experiment is conducted:
- A person in one room picks one of the integers 1, 2, 3, 4, 5 at random.
- In another room, John Doe identifies the number he believes was picked.
- Three trials are performed for the experiment.
- Doe got the correct answer twice.

Learning Objective 5: times during the programExample 1

If John Doe does not actually have ESP and is actually guessing the number, what is the probability that he’d make a correct guess on two of the three trials?

- Let S = Success and F = Failure. All possible answers for the three trials are: SSS, …, FFF, all equally likely. The three ways John Doe could make two correct guesses in three trials are: SSF, SFS, and FSS.
- The probability of two correct guesses is then P(SSF, SFS, or FSS), an “or” probability.

Learning Objective 5: times during the programExample 1

- The probability of exactly 2 correct guesses is the binomial probability with n = 3 trials, x = 2 correct guesses and p = 0.2 probability of a correct guess.

TI calculator:

2nd Vars

0:binampdf(n,p,x)

Binampdf(3,.2,2)=0.096

Learning Objective 5: times during the programBinomial Example 2

- 1000 employees, 50% Female
- 10 employees were chosen for management training. None of these were female. Is the selection procedure random?

- The probability that no females are chosen is:
- TI calculator: Binompdf(10,.5,0)=9.765625E-4=0.000975625
- It is very unlikely (one chance in a thousand) that none of the 10 selected for management training would be female if the employees were chosen randomly

Learning times during the programObjective 6:Do the Binomial Conditions Apply?

- Before using the binomial distribution, check that its three conditions apply:
- Binary data (success or failure).
- The same probability of success for each trial (denoted by p).
- Independent trials.

Learning Objective 6 times during the program:Do the Binomial Conditions Apply to Example 2?

- The data are binary (male, female).
- If employees are selected randomly, the probability of selecting a female on a given trial is 0.50.
- With random sampling of 10 employees from a large population, outcomes for one trial does not depend on the outcome of another trial

Learning times during the programObjective 7:Binomial Mean and Standard Deviation

- The binomial probability distribution for n trials with probability p of success on each trial has mean µ and standard deviation σ given by:

Learning Objective 7: times during the programExample: Racial Profiling?

- Data:
- 262 police car stops in Philadelphia in 1997.
- 207 of the drivers stopped were African-American.
- In 1997, Philadelphia’s population was 42.2% African-American.
- Does the number of African-Americans stopped suggest possible bias, being higher than we would expect (other things being equal, such as the rate of violating traffic laws)? Use the 68-95-99.7 Empirical Rule.

Learning Objective 7: times during the programExample: Racial Profiling?

- Assume:
- 262 car stops represent n = 262 trials.
- Successive police car stops are independent.
- P(driver is African-American) is p = 0.422.

- Calculate the mean and standard deviation of this binomial distribution:

Learning Objective 7: times during the programExample: Racial Profiling?

- Recall: Empirical Rule
- When a distribution is bell-shaped, close to 100% of the observations fall within 3 standard deviations of the mean.

Learning Objective 7: times during the programExample: Racial Profiling?

- If there is no racial profiling, we would not be surprised if between about 87 and 135 of the 262 drivers stopped were African-American.
- The actual number stopped (207) is well above these values.
- The number of African-Americans stopped is too high, even taking into account random variation.

- Limitation of the analysis:
- Different people do different amounts of driving, so we don’t really know that 42.2% of the potential stops were African-American.

Unusually High (Low)? times during the program

- An observed value may be larger than expected. How can we identify such values?
- An observed value x of a random variable X is said to unusually high, if P(X ≥ x) is very small, say < 0.05.
- Example: Toss a balanced coin 10 times and 8 are heads. Let X = # of heads. Is X = 8 unusually high?
- We calculate
P(X ≥ 8) = P(X = 8) + P(X = 9) + P(X = 10) = 0.055 > 0.05,

X = 8 is not unusually high.

- An observed value x of a random variable X is said to unusually low, if P(X ≤ x) is very small, say < 0.05.
- Is X = 2 unusually small? P(X ≤ 2) = 0.055 > 0.05, so No.

Learning Objective 8 times during the program:Approximating the Binomial Distribution with the Normal Distribution

- The binomial distribution can be well approximated by the normal distribution when the expected number of successes, np, and the expected number of failures, n(1-p) are both at least 15.

Section 7.1

How Likely Are the Possible Values of a Statistic? The Sampling Distribution

Homework #7 times during the program

- Problems 7.1 to 7.34 (Even)
- Skip problems that need simulation.
- Hawkes: 7-2 and 7-3

Learning Objectives times during the program

- Statistic vs. Parameter
- Sampling Distributions
- Mean and Standard Deviation of the Sampling Distribution of a Proportion
- Standard Error
- Sampling Distribution Example
- Population, Data, and Sampling Distributions

Learning times during the programObjective 1:Statistic and Parameter

- A statistic is a numerical summary of sample data such as a sample proportion or sample mean
- A parameter is a numerical summary of a population such as a population proportion or population mean.
- In practice, we seldom know the values of parameters.
- Parameters are estimated using sample data.
- We use sample statistics to estimate the corresponding population parameters.

Learning Objective 2: times during the programSampling Distributions

Example:

- Prior to counting the votes, the proportion in favor of recalling Governor Gray Davis was an unknown parameter.
- An exit poll of 3160 voters reported that the sample proportion in favor of a recall was 0.54.
- If a different random sample of about 3000 voters were selected, a different sample proportion would occur.
The sampling distribution of the sample proportion shows all possible values and the probabilities for those values.

Learning times during the programObjective 2:Sampling Distributions

- The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take.
- Sampling distributions describe the variability that occurs from study to study using statistics to estimate population parameters
- Sampling distributions help to predict how close a statistic falls to the parameter it estimates

Learning Objective 3: times during the programMean and SD of the Sampling Distribution of a Proportion

- For a random sample of size n from a population with proportion p of outcomes in a particular category, the sampling distribution of the proportion of the sample in that category has

Learning times during the programObjective 4:The Standard Error

- To distinguish the standard deviation of a sampling distribution from the standard deviation of an ordinary probability distribution, we refer to it as a standard error.

Learning Objective 5: times during the programExample: 2006 California Election

- If the population proportion supporting the re-election of Schwarzenegger was 0.50, would it have been unlikely to observe the exit-poll sample proportion of 0.565?
- Based on your answer, would you be willing to predict that Schwarzenegger would win the election?

Learning Objective 5: times during the programExample: 2006 California

- Given that the exit poll had 2705 people and assuming 50% support the reelection of Schwarzenegger,
- Find the estimate of the population proportion and the standard error:

Learning Objective 5: times during the programExample: 2006 California Election

Learning Objective 5: times during the programExample: 2006 California Election

- The sample proportion of 0.565 is more than six standard errors from the expected value of 0.50.
- The sample proportion of 0.565 voting for reelection of Schwarzenegger would be very unlikely if the population proportion were p = 0.50 or p < 0.50

Learning times during the programObjective 6:Population Distribution

- Population distribution: This is the probability distribution from which we take the sample.
- Values of its parameters are usually unknown. They’re what we’d like to learn about.

Learning times during the programObjective 6:Data distribution

- This is the distribution of the sample data. It’s the distribution we actually see in practice.
- It’s described by statistics
- With random sampling, the larger the sample size n, the more closely the data distribution resembles the population distribution

Learning Objective 6: times during the programExample

- In the 2006 U.S. Senate election in NY
- An exit poll of 1336 voters showed
- 67% (895) voted for Clinton
- 33% (441) voted for Spencer

- When all 4.1 million votes were tallied
- 68% voted for Clinton
- 32% voted for Spencer

- An exit poll of 1336 voters showed
- Let X= vote outcome, with x=1 for Clinton and x=0 for Spencer

Learning Objective 6: times during the programExample

- The population distribution is the 4.1 million values of the x vote variable, 32% of which are 0 and 68% of which are 1.
- The data distribution is the 1336 values of the x vote for the exit poll, 33% of which are 0 and 67% of which are 1.
- The sampling distribution of the sample proportion is approximately a normal distribution with p=0.68 and
- Only the sampling distribution is bell-shaped; the others are discrete and concentrated at the two values 0 and 1.

Section 7.2

How Close Are Sample Means to Population Means?

Learning Objectives times during the program

- The Sampling Distribution of the Sample Mean
- Effect of n on the Standard Error
- Central Limit Theorem (CLT)
- Calculating Probabilities of Sample Means

Learning Objective 1 times during the program:The Sampling Distribution of the Sample Mean

- The sample mean, x, is a random variable.
- The sample mean varies from sample to sample.
- By contrast, the population mean, µ, is a single fixed number.

Learning Objective 1: times during the programThe Sampling Distribution of the Sample Mean

- For a random sample of size n from a population having mean µ and standard deviation σ, the sampling distribution of the sample mean has:
- Center described by the mean µ (the same as the mean of the population).
- Spread described by the standard error, which equals the population standard deviation divided by the square root of the sample size:
- standard error of

Learning Objective 1: times during the programExample 1: Pizza Sales

- Daily sales at a pizza restaurant vary from day to day.
- The sales figures fluctuate around a mean µ = $900 with a standard deviation σ = $300.
- What are the center and spread of the sampling distribution of the average sales in a week?

Learning Objective times during the program2:Effect of n on the Standard Error

- Knowing how to find a standard error gives us a mechanism for understanding how much variability to expect in sample statistics “just by chance.”
- The standard error of the sample mean =
- As the sample size n increases, the denominator increases, so the standard error decreases.
- With larger samples, the sample mean is more likely to fall closer to the population mean.

Learning Objective times during the program3:Central Limit Theorem (CLT)

- Question: How does the sampling distribution of the sample mean relate with respect to shape, center, and spread to the population distribution from which the samples were taken?
- For random sampling with a large sample size n, the sampling distribution of the sample mean is approximately a normal distribution.
- This result applies no matter what the shapeof the probability distribution from which the samples are taken.

Learning Objective times during the program3:CLT: How Large a Sample?

- The sampling distribution of the sample mean takes more of a bell shape as the random sample size n increases.
- The more skewed the population distribution, the larger n must be for CLT to work.
- In practice, the sampling distribution is usually close to normal when the sample size n is at least 30.
- If the population distribution is normal, then the sampling distribution is normal for all sample sizes.

Learning times during the programObjective 3:CLT: Impact of increasing n

Learning Objective 3: times during the programCLT: Making Inferences

- CLT: For large n, the sampling distribution is approximately normal even if the population distribution is not.
- This enables us to make inferences about population means regardless of the shape of the population distribution.

Learning Objective times during the program4:Calculating Probabilities of Sample Means

- The distribution of weights of milk bottles is normally distributed with a mean of 1.1 lbs and a standard deviation (σ)=0.20.
- What is the probability that the mean of a random sample of 5 bottles will be greater than 0.99 lbs?
- Calculate the mean and standard error for the sampling distribution of a random sample of 5 milk bottles
- By the CLT, is approximately normal with mean=1.1 and standard error = =0.0894

- P( >0.99)=

- Calculate the mean and standard error for the sampling distribution of a random sample of 5 milk bottles

Learning times during the programObjective 4:Calculating Probabilities of Sample Means

- Closing prices of stocks have a right skewed distribution with a mean (µ) of $25 and σ= $20.
- What is the probability that the mean of a random sample of 40 stocks will be less than $20?
- Calculate the mean and standard error for the sampling distribution of a random sample of 40 stocks
- By the CLT, is approximately normal with mean=25 and standard error = =3.1623

- P( <20)=

- Calculate the mean and standard error for the sampling distribution of a random sample of 40 stocks

Learning times during the programObjective 4: Calculating Probabilities of Sample Means

- An automobile insurer has found that repair claims have a mean of $920 and a standard deviation of $870. Suppose that the next 100 claims can be regarded as a random sample from the long-run claims process.
- What is the probability that the average of the 100 claims is larger than $900?

Learning Objective 4: times during the programCalculating Probabilities of Sample Means

Example: the distribution of actual weights of 8 oz. wedges of cheddar cheese produced by a certain company is normal with mean =8.1 oz. and standard deviation =0.1 oz.

- Find the value x such that there is only a 10% chance that the average weight of a sample of five wedges will be above x.

Learning Objective 4: times during the programCalculating Probabilities of Sample Means

Example: the distribution of actual weights of 8 oz. wedges of cheddar cheese produced by a certain company is normal with mean =8.1 oz. and standard deviation =0.1 oz.

- Find the value x such that there is only a 5% chance that the average weight of a sample of five wedges will be below x.

Section 7.3

How Can We Make Inferences About a Population?

Learning Objectives times during the program

- Using the CLT to Make Inferences
- Standard Errors in Practice
- Sampling Distribution for a Proportion

Learning Objective 1: times during the programUsing the CLT to Make Inferences

Implications of the CLT

- When the sampling distribution of the sample mean is approximately normal, falls within 2 standard errors of with probability close to 0.95 and almost certainly falls within 3 standard errors of . (Empirical Rule)
- For large n, the sampling distribution of is approximately normal no matter what the shape of the underlying population distribution.

Learning Objective 2: times during the programStandard Errors in Practice

In practice, standard errors are estimated

- Standard errors have exact values depending on parameter values, e.g.,
- for a sample proportion
- for a sample mean

- In practice, these parameter values are unknown. Inference methods use standard errors that substitute sample values for the parameters in the exact formulas above
These estimated standard errors are the numbers we use in practice.

Learning Objective 3: times during the programSampling Distribution for a Proportion

- The binomial probability distribution is the sampling distribution for the number of successes in n independent trials
- In practice, the sample proportion of successes is the statistic usually reported
- Since the sample proportion is simply the number of successes divided by the number of trials, the formulas for the mean and standard deviation of the sampling distribution of the proportion of successes are the formulas for the mean and standard deviation of the number of successes divided by n.

Learning Objective 3: times during the programSampling Distribution for a Proportion

- For a binomial random variable with n trials and probability p of success for each, the sampling distribution of the proportion of successes has
- Mean = p
- Standard error =

- For large n, by CLT, the sampling distribution can be approximated by a normal distribution with the same mean and the same standard error.

Section 8.1

What are Point and Interval Estimates of Population Parameters?

Homework #8 times during the program

- Strong suggestion to visit: http://www.studio4learning.tv
- Try Math -> Statistics
- 8.1 to 8.60 All even-numbered questions
- Note: If a problem needs a simulation, the problem is optional.
- Hawkes: 8-1, 8-2, 8-3, 8-4

Learning Objectives times during the program

- Point Estimate and Interval Estimate
- Properties of Point Estimators
- Confidence Intervals
- Logic of Confidence Intervals
- Margin of Error
- Example

Learning Objective times during the program1:Point Estimate and Interval Estimate

- A point estimate is a single number that is our “best guess” for the parameter
- An interval estimate is an interval of numbers within which the parameter value is believed to fall.

Learning Objective times during the program1:Point Estimate vs. Interval Estimate

- A point estimate doesn’t tell us how close the estimate is likely to be to the parameter
- An interval estimate is more useful
- It incorporates a margin of error which helps us to gauge the accuracy of the point estimate

Learning times during the programObjective 2:Properties of Point Estimators

- Property 1: A good estimator has a sampling distribution that is centered at the parameter
- An estimator with this property is unbiased
- The sample mean is an unbiased estimator of the population mean
- The sample proportion is an unbiased estimator of the population proportion

- An estimator with this property is unbiased

Learning times during the programObjective 2:Properties of Point Estimators

- Property 2: A good estimator has a small standard error compared to other estimators
- This means it tends to fall closer than other estimates to the parameter
- The sample mean has a smaller standard error than the sample median when estimating the population mean of a normal distribution

- This means it tends to fall closer than other estimates to the parameter

Learning times during the programObjective 3:Confidence Interval

- A confidence interval is an interval containing the most believable values for a parameter
- The probability that this method produces an interval that contains the parameter is called the confidence level
- This is a number chosen to be close to 1, most commonly 0.95.

Learning times during the programObjective 4:Logic of Confidence Intervals

- To construct a confidence interval for a population proportion, start with the sampling distribution of a sample proportion, which
- Gives the possible values for the sample proportion and their probabilities
- Is approximately a normal distribution for large random samples by the CLT
- Has mean equal to the population proportion
- Has standard deviation called the standard error

Learning times during the programObjective 4:Logic of Confidence Intervals

- Fact: Approximately 95% of a normal distribution falls within 1.96 standard deviations of the mean
- With probability 0.95, the sample proportion falls within about 1.96 standard errors of the population proportion
- The distance of 1.96 standard errors is the margin of error in calculating a 95% confidence interval for the population proportion

Learning Objective times during the program5:Margin of Error

- The margin of error measures how accurate the point estimate is likely to be in estimating a parameter
- It is a multiple of the standard error of the sampling distribution of the estimate when the sampling distribution is a normal distribution.
- The distance of 1.96 standard errors in the margin of error for a 95% confidence interval for a parameter from a normal distribution

Learning times during the programObjective 6:Example: CI for a Proportion

Example: The GSS asked 1823 respondents whether they agreed with the statement “It is more important for a wife to help her husband’s career than to have one herself”. 19% agreed. Assuming the standard error is 0.01, calculate a 95% confidence interval for the population proportion who agreed with the statement

- Margin of error = 1.96*se=1.96*0.01=0.02
- 95% CI = 0.19±0.02 or (0.17 to 0.21)
We predict that the population proportion who agreed is somewhere between 0.17 and 0.21.

Section 8.2

How Can We Construct a Confidence Interval to Estimate a Population Proportion?

Learning Objectives times during the program

- Finding the 95% Confidence Interval for a Population Proportion
- Sample Size Needed for Large-Sample Confidence Interval for a Proportion
- How Can We Use Confidence Levels Other than 95%?
- What is the Error Probability for the Confidence Interval Method?
- Summary
- Effect of the Sample Size
- Interpretation of the Confidence Level

Learning Objective 1: times during the programFinding the 95% Confidence Interval for a Population Proportion

- We symbolize a population proportion by p
- The point estimate of the population proportion is the sample proportion
- We symbolize the sample proportion by

Learning Objective 1: times during the programFinding the 95% Confidence Interval for a Population Proportion

- A 95% confidence interval uses a margin of error = 1.96(standard errors)
- CI = [point estimate ± margin of error] =
for a 95% confidence interval

Learning Objective 1: times during the programFinding the 95% Confidence Interval for a Population Proportion

- The exact standard error of a sample proportion equals:
- This formula depends on the unknown population proportion, p
- In practice, we don’t know p, and we need to estimate the standard error as

Learning times during the programObjective 1:Finding the 95% Confidence Interval for a Population Proportion

- A 95% confidence interval for a population proportion p is:

Learning Objective 1: times during the programExample 1

- In 2000, the GSS asked: “Are you willing to pay much higher prices in order to protect the environment?”
- Of n = 1154 respondents, 518 were willing to do so

- Find and interpret a 95% confidence interval for the population proportion of adult Americans willing to do so at the time of the survey

Learning Objective 2: times during the programSample Size Needed for Large-Sample Confidence Interval for a Proportion

- For the 95% confidence interval for a proportion p to be valid, you should have at least 15 successes and 15 failures:

Learning times during the programObjective 3:How Can We Use Confidence Levels Other than 95%?

- “95% confidence“ means that there’s a 95% chance that a sample proportion value occurs such that the confidence interval contains the unknown value of the population proportion, p
- With probability 0.05, the method produces a confidence interval that misses p

Learning Objective 3: times during the programHow Can We Use Confidence Levels Other than 95%?

- In practice, the confidence level 0.95 is the most common choice
- But, some applications require greater (or less)confidence
- To increase the chance of a correct inference, we use a larger confidence level, such as 0.99

Learning Objective 3: times during the programHow Can We Use Confidence Levels Other than 95%?

- In using confidence intervals, we must compromise between the desired margin of error and the desired confidence of a correct inference
- As the desired confidence level increases, the margin of error gets larger

Learning times during the programObjective 3:Example 2

- A recent GSS asked “If the wife in a family wants children, but the husband decides that he does not want any children, is it all right for the husband to refuse to have children?
- Of 598 respondents, 366 said yes
- Calculate the 99% confidence interval

Learning times during the programObjective 3:Example 3

- Exit poll: Out of 1400 voters, 660 voted for the Democratic candidate.
- Calculate a 95% and a 99% Confidence Interval

Learning Objective 4: times during the programWhat is the Error Probability for the Confidence Interval Method?

- The general formula for the confidence interval for a population proportion is:
Sample proportion ± (z-score)(std. error)

which in symbols is

Learning Objective 5: times during the programSummary: Confidence Interval for a Population Proportion, p

- A confidence interval for a population proportion p is:
- Assumptions
- Data obtained by randomization
- A large enough sample size n so that the number of success, n , and the number of failures, n(1- ), are both at least 15

Learning Objective 6: times during the programEffects of Confidence Level and Sample Size on Margin of Error

- The margin of error for a confidence interval:
- Increases as the confidence level increases
- Decreases as the sample size increases

Learning Objective 7: times during the programInterpretation of the Confidence Level

- If we used the 95% confidence interval method to estimate many population proportions, then in the long run about 95% of those intervals would give correct results, containing the population proportion

Simulation: R codes times during the program

CI = function(p = 0.3, n = 40, level = 0.95, rep = 1000){

A = matrix(0, rep, 2)

plot(1,1,type = "n", xlim = c(1,rep), ylim=c(0,1),xlab="simulation", ylab = "CI")

abline(p, 0)

legend(1,1,legend=c(paste("p =",p),paste("n =",n),paste("Level =",level)),

text.col = 'blue', bty = 'n')

for (i in 1:rep){

phat = mean(rbinom(n, 1, p))

E = qnorm(1-(1-level)/2)*sqrt(phat*(1-phat)/n)

a = A[i, ] = c(phat-E, phat+E)

if ((p - a[1])*(p - a[2]) <= 0) lines(c(i, i), a, type = 'l', lwd = 4)

else lines(c(i, i), a, type = 'l', col = "red", lwd = 4)

Sys.sleep(.1)

}

print(A)

ActualLevel = mean(apply(A, 1, function(a) (p - a[1])*(p - a[2]) <= 0))

list("Nominal Confidence Level" = level, 'Actual Level' = ActualLevel)

}

CI(p = 0.3, n = 40, level = 0.95, rep = 80) ## do suggest use of rep = 5000

Section 8.3

How Can We Construct a Confidence Interval to Estimate a Population Mean?

Learning Objectives times during the program

- How to Construct a Confidence Interval for a Population Mean
- Properties of the t Distribution
- Formula for 95% Confidence Interval for a Population Mean
- How Do We Find a t Confidence Interval for Other Confidence Levels?
- If the Population is Not Normal, is the Method “Robust”?
- The Standard Normal Distribution is the t
Distribution with df = ∞

Learning Objective 1: times during the programHow to Construct a Confidence Interval for a Population Mean

- Point estimate ± margin of error
- The sample mean is the point estimate of the population mean
- The exact standard error of the sample mean is σ/
- In practice, we estimate σ by the sample standard deviation, s

Learning Objective 1: times during the programHow to Construct a Confidence Interval for a Population Mean

- For large n… from any population
and also

- For small n from an underlying population that is normal…
- The confidence interval for the population mean is:

Learning Objective 1: times during the programHow to Construct a Confidence Interval for a Population Mean

- In practice, we don’t know the population standard deviation
- Substituting the sample standard deviation s for σ to get se = s/ introduces extra error
- To account for this increased error, we replace the z-score by a slightly larger score, the t-score

Learning Objective 2: times during the programProperties of the t Distribution

- The t-distribution is bell shaped and symmetric about 0
- The probabilities depend on the degrees of freedom, df=n-1
- The t-distribution has thicker tails than the standard normal distribution, i.e., it is more spread out

Learning Objective 2 times during the program:t Distribution

The t-distribution has thicker tails and is more spread out than the standard normal distribution

Learning Objective times during the program2:t Distribution

Learning Objective 3: times during the programFormula for 95% Confidence Interval for a Population Mean

- When the standard deviation of the population is unknown, a 95% confidence interval for the population mean µ is:
- To use this method, you need:
- Data obtained by randomization
- An approximately normal population distribution

Learning Objective 3: times during the programExample: eBay Auctions of Palm Handheld Computers

Do you tend to get a higher, or a lower, price if you give bidders the “buy-it-now” option?

- Consider some data from sales of the Palm M515 PDA (personal digital assistant)
- During the first week of May 2003, 25 of these handheld computers were auctioned off, 7 of which had the “buy-it-now” option

Learning Objective 3: times during the programExample: eBay Auctions of Palm Handheld Computers

- Summary of selling prices for the two types of auctions:

Learning Objective 3: times during the programExample: eBay Auctions of Palm Handheld Computers

- Let µ denote the population mean for the “buy-it-now” option
- The estimate of µ is the sample mean: = $233.57
- The sample standard deviation: s = $14.64
- Table B df=6, with 95% Confidence: t = 2.447
233.57 ± 13.54 or (220.03, 247.11)

Learning Objective 3: times during the programExample: eBay Auctions of Palm Handheld Computers

- The 95% confidence interval for the mean sales price for the bidding only option is:
(220.70, 242.52)

- Notice that the two intervals overlap a great deal:
- “Buy-it-now”:(220.03, 247.11)
- Bidding only: (220.70, 242.52)

- There is not enough information for us to conclude that one probability distribution clearly has a higher mean than the other

A study of 7 American adults from an SRS yields an average height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(m) is:

Learning Objective 3:Example: Small Sample t Confidence Interval“We are 95% confident that the average height of all American adults is between 63.6 and 70.8 inches.”

Learning Objective 3: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Example: Small Sample t Confidence Interval

- In a time use study, 20 randomly selected managers spend a mean of 2.4 hours each day on paperwork. The standard deviation of the 20 times is 1.3 hours. Construct the 95% confidence interval for the mean paperwork time of all managers
- 95% CI = (1.79 < µ < 3.01)
Note that our calculation assumes that the distribution of times is normally distributed

Learning Objective 4: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(How Do We Find a t- Confidence Interval for Other Confidence Levels?

- The 95% confidence interval uses t.025 since 95% of the probability falls between - t.025 and t.025
- For 99% confidence, the error probability is 0.01 with 0.005 in each tail and the appropriate t-score is t.005
- To get other confidence intervals use the appropriate t-value from Table B

Learning Objective 4: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(How Do We Find a t- Confidence Interval for Other Confidence Levels?

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 5:If the Population is Not Normal, is the Method “Robust”?

- A basic assumption of the confidence interval using the t-distribution is that the population distribution is normal
- Many variables have distributions that are far from normal
- We say the t-distribution is a robust method in terms of the normality assumption

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 5:If the Population is Not Normal, is the Method “Robust”?

- How problematic is it if we use the t- confidence interval even if the population distribution is not normal?
- For large random samples, it’s not problematic because of the Central Limit Theorem

- What if n is small?
- Confidence intervals using t-scores usually work quite well except for when extreme outliers are present. The method is robust

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 6:The Standard Normal Distribution is the t-Distribution with df = ∞

Section 8.4

How Do We Choose the Sample Size for a Study?

Learning Objectives height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(

- Sample Size for Estimating a Population Proportion
- Sample Size for Estimating a Population Mean
- What Factors Affect the Choice of the Sample Size?
- What if You Have to Use a Small n?
- Confidence Interval for a Proportion with Small Samples

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 1:Sample Size for Estimating a Population Proportion

To determine the sample size,

- First, we must decide on the desired margin of error
- Second, we must choose the confidence level for achieving that margin of error
- In practice, 95% confidence intervals are most common

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 1:Sample Size for Estimating a Population Proportion

- The random sample size n for which a confidence interval for a population proportion p has margin of error m (such as m = 0.04) is
- In the formula for determining n, setting = 0.50 gives the largest value for n out of all the possible values of

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Example 1: Sample Size For Exit Poll

- A television network plans to predict the outcome of an election between two candidates – Levin and Sanchez
- A poll one week before the election estimates 58% prefer Levin

- What is the sample size for which a 95% confidence interval for the population proportion has margin of error equal to 0.04?

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Example 1: Sample Size For Exit Poll

- The z-score is based on the confidence level, such as z = 1.96 for 95% confidence
- The 95% confidence interval for a population proportion p is:
- If the sample size is such that 1.96(se) = 0.04, then the margin of error will be 0.04

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Example 1: Sample Size For Exit Poll

- Using 0.58 as an estimate for p
or n =585

- Without guessing,
n=601 gives us a more conservative estimate (always round up)

Learning Objective height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(1:Example 2

- Suppose a soft drink bottler wants to estimate the proportion of its customers that drink another brand of soft drink on a regular basis
- What sample size will be required to enable us to have a 99% confidence interval with a margin of error of 1%?
- Thus, we will need to sample at least 16,641 of the soft drink bottler’s customers.

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Example 3

- You want to estimate the proportion of home accident deaths that are caused by falls. How many home accident deaths must you survey in order to be 95% confident that your sample proportion is within 4% of the true population proportion?
- Answer: 601

Learning Objective 2: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Sample Size for Estimating a Population Mean

- The random sample size n for which a confidence interval for a population mean has margin of error approximately equal to m is
where the z-score is based on the confidence level, such as z=1.96 for 95% confidence.

Learning Objective 2: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Sample Size for Estimating a Population Mean

- In practice, you don’t know the value of the standard deviation,
- You must substitute an educated guess for
- Sometimes you can use the sample standard deviation from a similar study
- When no prior information is known, a crude estimate that can be used is to divide the estimated range of the data by 6 since for a bell-shaped distribution we expect almost all of the data to fall within 3 standard deviations of the mean

Learning Objective 2: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Example 1

- A social scientist plans a study of adult South Africans to investigate educational attainment in the black community
- How large a sample size is needed so that a 95% confidence interval for the mean number of years of education has margin of error equal to 1 year? Assume that the education values will fall within a range of 0 to 18 years
- Crude estimate of =range/6=18/6=3

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 2:Example 2

- Find the sample size necessary to estimate the mean height of all adult males to within .5 in. if we want 99% confidence in our results. From previous studies we estimate =2.8.
- Answer: 209 (always round up)

Learning Objective 3: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(What Factors Affect the Choice of the Sample Size?

- The first is the desired precision, as measured by the margin of error, m
- The second is the confidence level
- A third factor is the variability in the data
- If subjects have little variation (that is, is small), we need fewer data than if they have substantial variation

- A fourth factor is financial

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 4:What if You Have to Use a Small n?

- The t methods for a mean are valid for any n
- However, you need to be extra cautious to look for extreme outliers or great departures from the normal population assumption

- In the case of the confidence interval for a population proportion, the method works poorly for small samples because the CLT no longer holds

Learning height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(Objective 5:Confidence Interval for a Proportion with Small Samples

- If a random sample does not have at least 15 successes and 15 failures, the confidence interval formula
is still valid if we use it after adding 2 to the original number of successes and 2 to the original number of failures. This results in adding 4 to the sample size n

Section 8.5

How Do Computers Make New Estimation Methods Possible?

Learning Objectives height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(

- The Bootstrap

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(The Bootstrap: Using Simulation to Construct a Confidence Interval

- When it is difficult to derive a standard error or a confidence interval formula that works well you can use simulation.
- The bootstrap is a simulation method that resamples from the observed data. It treats the data distribution as if it were the population distribution

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(The Bootstrap: Using Simulation to Construct a Confidence Interval

- To use the bootstrap method
- Resample, with replacement, n observations from the data distribution
- For the new sample of size n, construct the point estimate of the parameter of interest
- Repeat process a very large number of times (e.g., selecting 10,000 separate samples of size n and calculating the 10,000 corresponding parameter estimates)

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(The Bootstrap: Using Simulation to Construct a Confidence Interval

Example:

- Suppose your data set includes the following:
This data has a mean of 161.44 and standard deviation of 0.63.

- Use the bootstrap method to find a 95% confidence interval for the population standard deviation

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(The Bootstrap: Using Simulation to Construct a Confidence Interval

- Re-sample with replacement from this sample of size 10 and compute the standard deviation of the new sample
- Repeat this process 100,000 times. A histogram showing the distribution of 100,000 samples drawn from this sample is

Learning Objective 1: height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(The Bootstrap Using Simulation to Construct a Confidence Interval

- Now, identify the middle 95% of these 100,000 sample standard deviations (take the 2.5th and 97.5th percentiles).
- For this example, these percentiles are 0.26 and 0.80.
- The 95% bootstrap confidence interval for is (0.26, 0.80)

Bootstrapping Confidence Intervals Using Excel height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(

- Open Excel -> Data Analysis -> Random Number Generation -> Fill in
Number of Variables = 10,000

Number of Random Numbers = the same size

Distribution = Discrete

Section 9.1: What Are the Steps for Performing a Significance Test?

Homework #9 and Hawkes Hypotheses

- HW #9
- Page 412: 2, 4, 8
- Page 426: 12, 14, 16, 18, 22, 24
- Page 439: 28, 30, 32, 34
- Page 445: 42, 44, 46, 48, 50
- Page 452: 52, 54, 56, 58
- Page 458: 60, 62, 64
- Hawkes: 9.1 - 9.4, 9.6

Learning Objectives Hypotheses

- 5 Steps of a Significance Test
- Assumptions
- Hypotheses
- Calculate the test statistic
- P-Value
- Conclusion and Statistic Significance

Learning Objective 1 Hypotheses:Significance Test

- A significance test is a method of using data to summarize the evidence about a hypothesis
- A significance test about a hypothesis has five steps
- Assumptions
- Hypotheses
- Test Statistic
- P-value
- Conclusion

Learning Objective 2 Hypotheses:Step 1: Assumptions

- A (significance) test assumes that the data production used randomization
- Other assumptions may include:
- Assumptions about the sample size
- Assumptions about the shape of the population distribution

Learning Objective 3 Hypotheses:Step 2: Hypothesis

- A hypothesis is a statement about a population, usually of the form that a certain parameter takes a particular numerical value or falls in a certain range of values
- The main goal in many research studies is to check whether the data support certain hypotheses

Learning Objective Hypotheses3:Step 2: Hypotheses

- Each significance test has two hypotheses:
- The null hypothesisis a statement that the parameter takes a particular value.
- The alternative hypothesisstates that the parameter falls in some alternative range of values.

Learning Objective Hypotheses3:Null and Alternative Hypotheses

- The value in the null hypothesis, called claimed/hypothesized value, usually represents no effect
- The symbol Ho denotes null hypothesis

- The value in the alternative hypothesis usually represents an effect of some type
- The symbol Ha denotes alternative hypothesis
- The alternative hypothesis should express what the researcher hopes to show.

- The hypotheses should be formulated before viewing or analyzing the data!

Learning Objective 4 Hypotheses:Step 3: Test Statistic

- A test statistic describes how far the point estimate falls from the claimed value (usually in terms of the number of standard errors between the two).
- If the test statistic falls far from the claimed value in the direction specified by the alternative hypothesis, it is good evidence against the null hypothesis and in favor of the alternative hypothesis.
- We use the test statistic to assesses the evidence against the null hypothesis by giving a probability , the P-Value.

Learning Objective 5 Hypotheses:Step 4: P-value

- To interpret a test statistic value, we use a probability summary of the evidence against the null hypothesis, Ho
- First, we presume that Ho is true
- Next, we consider the sampling distribution from which the test statistic comes
- We summarize how far out in the tail of this sampling distribution the test statistic falls

Learning Objective Hypotheses5:Step 4: P-value

- We summarize how far out in the tail the test statistic falls by the tail probability of that value and values even more extreme
- This probability is called a P-value
- The smaller the P-value, the stronger the evidence is against Ho

Learning Objective Hypotheses5:Step 4: P-value

Note: This is just one of 3 stories.

Learning Objective Hypotheses5:Step 4: P-value

- The P-value is the probability that the test statistic equals the observed value or a value even more extreme in favor of Ha.
- It is calculated by presuming that the null hypothesis H0 is true
The smaller the P-value, the stronger the evidence the data provide against the null hypothesis. That is, a small P-value indicates a small likelihood of observing the sampled results if the null hypothesis were true.

Learning Objective Hypotheses6:Step 5: Conclusion

- The conclusion of a significance test reports the P-value and interprets what it says about the question that motivated the test

Section 9.2: Significance Tests About Proportions

Learning Objectives Hypotheses:

- Steps of a Significance Test about a Population Proportion
- Example: One-Sided Hypothesis Test
- How Do We Interpret the P-value?
- Two-Sided Hypothesis Test for a Population Proportion
- Summary of P-values for Different Alternative Hypotheses
- Significance Level
- One-Sided vs Two-Sided Tests
- The Binomial Test for Small Samples

Learning Objective Hypotheses1:Steps of a Significance Test about a Population Proportion

Step 1: Assumptions

- The variable is categorical
- The data are obtained using randomization
- The sample size is sufficiently large that the sampling distribution of the sample proportion is approximately normal:
- np ≥ 15 and n(1-p) ≥ 15

Learning Objective 1 Hypotheses:Steps of a Significance Test about a Population Proportion

Step 2: Hypotheses

- The null hypothesis has the form:
- H0: p = p0

- The alternative hypothesis has the form:
- Ha: p > p0 (one-sided test) or
- Ha: p < p0 (one-sided test) or
- Ha: p ≠ p0 (two-sided test)

Learning Objective Hypotheses1:Steps of a Significance Test about a Population Proportion

Step 3: Test Statistic

- The test statistic measures how far the sample proportion falls from the null hypothesis value, p0, relative to what we’d expect if H0 were true
- The test statistic is:

Learning Objective 1 Hypotheses:Steps of a Significance Test about a Population Proportion

Step 4: P-value

- The P-value summarizes the evidence
- It describes how unusual the observeddata would be if H0 were true

Learning Objective 1 Hypotheses:Steps of a Significance Test about a Population Proportion

Step 5: Conclusion

- We summarize the test by reporting and interpreting the P-value

Learning Objective 1: HypothesesExample 1: Are Astrologers’ Predictions Better Than Guessing?

An astrologer prepares horoscopes for 116 adult volunteers. Each subject also filled out a California Personality Index (CPI) survey. For a given adult, his or her horoscope is shown to the astrologer along with their CPI survey as well as the CPI surveys for two other randomly selected adults. The astrologer is asked which survey is the correct one for that adult

- With random guessing, p = 1/3
- The astrologers’ claim: p > 1/3
- The hypotheses for this test:
- Ho: p = 1/3
- Ha: p > 1/3

Learning Objective 2: HypothesesExample 1

Step 1: Assumptions

- The data is categorical – each prediction falls in the category “correct” or “incorrect”
- Subjects were randomly selected
- np=116(1/3) > 15
- n(1-p) = 116(2/3) > 15

Learning Objective 2: HypothesesExample 1

Step 3: Test Statistic:

In the actual experiment, the astrologers were correct with 40 of their 116 predictions (a success rate of 0.345)

Learning Objective 2: HypothesesExample 1

Step 5: Conclusion

- The P-value of 0.40 is not especially small
- It does not provide strong evidence against H0:p = 1/3
- There is not strong evidence that astrologers have special predictive powers

Learning HypothesesObjective 3:How Do We Interpret the P-value?

- A significance test analyzes the strength of the evidence against the null hypothesis
- The smaller the P-value, the stronger the evidence against the null hypothesis
- That is, If the P-value is small, the data contradict H0 and support Ha

Learning HypothesesObjective 4:Two-Sided Significance Tests

- A two-sided alternative hypothesis has the form Ha: p ≠ p0
- The P-value is the two-tail probability under the standard normal curve
- We calculate this by finding the tail probability in a single tail and then doubling it

Learning Objective 4: HypothesesExample 2

- Study: investigate whether dogs can be trained to distinguish a patient with bladder cancer by smelling compounds released in the patient’s urine

Learning Objective 4: HypothesesExample 2

- Experiment:
- Each of 6 dogs was tested with 9 trials
- In each trial, one urine sample from a bladder cancer patient was randomly place among 6 control urine samples

Learning Objective 4: HypothesesExample 2

- Results:
In a total of 54 trials with the six dogs, the dogs made the correct selection 22 times (a success rate of 0.407)

Learning Objective 4: HypothesesExample 2

- Does this study provide strong evidence that the dogs’ predictions were better or worse than with random guessing?

Learning Objective 4: HypothesesExample 2

Step 1: Check the sample size requirement:

- Is the sample size sufficiently large to use the hypothesis test for a population proportion?
- Is np0 >15 and n(1-p0)>15?
- 54(1/7) = 7.7 and 54(6/7) = 46.3

- The first, np0 is not large enough
- We will see that the two-sided test is robust when this assumption is not satisfied

Learning Objective 4: HypothesesExample 2

Step 3: Test Statistic

Learning Objective 4: HypothesesExample 2

Step 4: P-value

Learning Objective 4: HypothesesExample 2

Step 5: Conclusion

- Since the P-value is very small and the sample proportion is greater than 1/7, the evidence strongly suggests that the dogs’ selections are better than random guessing

Learning Objective 4: HypothesesExample 2

- Insight:
- In this study, the subjects were a convenience sample rather than a random sample from some population
- Also, the dogs were not randomly selected
- Any inferential predictions are highly tentative.They are valid only to the extent that the patients and the dogs are representative of their populations
- The predictions become more conclusive if similar results occur in other studies

Learning HypothesesObjective 4:Example 2

Learning HypothesesObjective 5:Summary of P-values for Different Alternative Hypotheses

Learning HypothesesObjective 6:The Significance Level Tells Us How Strong the Evidence Must Be

- Sometimes we need to make a decision about whether the data provide sufficient evidence to reject H0
- Before seeing the data, we decide how small the P-value would need to be to reject H0
- This cutoff point is called the significance level

Learning Objective Hypotheses6:The Significance Level Tells Us How Strong the Evidence Must Be

Learning Objective 6 Hypotheses:Significance Level

- The significance level is a number such that we reject H0 if the P-value is less than or equal to that number
- In practice, the most common significance level is 0.05
- When we reject H0 we say the results are statistically significant

Learning Objective 6 Hypotheses:Possible Decisions in a HypothesisTest

Learning Objective Hypotheses6:Report the P-value

- Learning the actual P-value is more informative than learning only whether the test is “statistically significant at the 0.05 level”
- The P-values of 0.01 and 0.049 are both statistically significant in this sense, but the first P-value provides much stronger evidence against H0 than the second

Learning HypothesesObjective 6:“Do Not Reject H0” Is Not the Same as Saying “Accept H0”

- Analogy: Legal trial
- Null Hypothesis: Defendant is Innocent
- Alternative Hypothesis: Defendant is Guilty
- If the jury acquits the defendant, this does not mean that it accepts the defendant’s claim of innocence
- Innocence is plausible, because guilt has not been established beyond a reasonabledoubt

Learning HypothesesObjective 7:One-Sided vs Two-Sided Tests

- Things to consider in deciding on the alternative hypothesis:
- The context of the real problem
- In most research articles, significance tests use two-sided P-values
- Confidence intervals are two-sided

Learning Objective 8 Hypotheses:The Binomial Test for Small Samples

- In practice, the large-sample z test still performs quite well in two-sided alternatives even for small samples.
- Warning: For one-sided tests, when p0 differs from 0.50, the large-sample test does not work well for small samples. In fact, we can find the exact p-value using the binomial distribution.
- Example: A coin was flipped 20 times and the coin turned up heads 14 times. Test H0: p = 0.5 against Ha: p ≠ 0.5.
- Solution: Let X be the number of heads among the 20 tosses. The exact p-value equals
P(X ≥ 14) + P(X ≤ 6) = 0.05766 + 0.05766 = 0.115.

Learning Objective Hypotheses9:Class Exercise 1

- In a survey by Media General and the Associated Press, 813 of the 1084 respondents indicated support for a ban on household aerosols. At the 1% significance level, test the claim that more than 70% of the population supports the ban.

Learning Objective Hypotheses9:Class Exercise 2

- In a Roper Organization poll of 2000 adults, 1280 have money in regular savings accounts. Use this sample data to test the claim that less than 65% of all adults have money in regular savings accounts. Use a 5% level of significance.

Learning Objective Hypotheses9:Class Exercise 3

- According to a Harris Poll, 71% of Americans believe that the overall cost of lawsuits is too high. If a random sample of 500 people results in 74% who hold that belief, test the claim that the actual percentage is 71%. Use a 10% significance level.

Section 9.3: Significance Tests About Means

Learning Objectives Hypotheses

- Steps of a Significance Test about a Population Mean
- Summary of P-values for Different Alternative Hypotheses
- Example: Significance Test for a Population Mean
- Results of Two-Sided Tests and Results of Confidence Intervals Agree
- What If the Population Does Not Satisfy the Normality Assumption?
- Regardless of Robustness, Look at the Data

Learning Objective 1 Hypotheses:Steps of a Significance Test About a Population Mean

- Step 1: Assumptions
- The variable is quantitative
- The data are obtained using randomization
- The population distribution is approximately normal. This is most crucial when n is small and Ha is one-sided.

Learning Objective 1 Hypotheses:Steps of a Significance Test About a Population Mean

- Step 2: Hypotheses:
- The null hypothesis has the form:
- H0: µ = µ0

- The alternative hypothesis has the form:
- Ha: µ > µ0 (one-sided test) or
- Ha: µ < µ0 (one-sided test) or
- Ha: µ ≠ µ0 (two-sided test)

Learning HypothesesObjective 1:Steps of a Significance Test About a Population Mean

- Step 3: Test Statistic
- The test statistic measures how far the sample mean falls from the null hypothesis value µ0, as measured by the number of standard errors between them
- The test statistic is:

Learning Objective Hypotheses1:Steps of a Significance Test About a Population Mean

- Step 4: P-value
- The P-value summarizes the evidence
- It describes how unusual the data would be if H0 were true

Learning Objective Hypotheses1:Steps of a Significance Test About a Population Mean

- Step 5: Conclusion
- We summarize the test by reporting and interpreting the P-value

Learning Objective Hypotheses2:Summary of P-values for Different Alternative Hypotheses

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- A study compared different psychological therapies for teenage girls suffering from anorexia
- The variable of interest was each girl’s weight change: ‘weight at the end of the study’ – ‘weight at the beginning of the study’

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- One of the therapies was cognitive therapy
- In this study, 29 girls received the therapeutic treatment
- The weight changes for the 29 girls had a sample mean of 3.00 pounds and standard deviation of 7.32 pounds

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- How can we frame this investigation in the context of a significance test that can detect whether the therapy was effective?
- Null hypothesis: “no effect”
- Alternative hypothesis: therapy is “effective”

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- Step 1: Assumptions
- The variable (weight change) is quantitative
- The subjects are a good representation of all girls with anorexia.
- The population distribution is approximately normal

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- Step 2: Hypotheses
- H0: µ = 0
- Ha: µ > 0

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- Step 3: Test Statistic

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- Step 4: P-value
The P-value is the area to the right of t = 2.21 for the t sampling distribution with 28 df. This values is 0.018.

- If the treatment had no effect, the probability of obtaining a sample this extreme or more extreme would be 0.018

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- Step 5: Conclusion
- The small P-value of 0.018 provides considerable evidence against the null hypothesis (the hypothesis that the therapy had no effect)

Learning Objective 3: HypothesesExample: Mean Weight Change in Anorexic Girls

- “The diet had a statistically significant positive effect on weight (mean change = 3 pounds, n = 29, t = 2.21, P-value = 0.018)”
- The effect, however, may be small in practical terms
- 95% CI for µ: (0.2, 5.8) pounds

Learning Objective Hypotheses3:Example: Does Low Carbohydrate Diet Work?

- After 16 weeks on a diet, 41 subjects lost an average of 9.7 kg with a standard deviation of 3.4 kg
- Calculate the P-value for testing: Ho: μ=0 Ha: μ<0

Learning Objective Hypotheses4:Results of Two-Sided Tests and Results of Confidence Intervals Agree

- Conclusions about means using two-sided significance tests are consistent with conclusions using confidence intervals
- If P-value ≤ 0.05 in a two-sided test, a 95% confidence interval does not contain the value specified by the null hypothesis
- If P-value > 0.05 in a two-sided test, a 95% confidence interval does contain the value specified by the null hypothesis

Learning Objective 5 Hypotheses:What If the Population Does Not Satisfy the Normality Assumption?

- For large samples (roughly about 30 or more) this assumption is usually not important
- The sampling distribution of is approximately normal regardless of the population distribution

Learning Objective 5 Hypotheses:What If the Population Does Not Satisfy the Normality Assumption?

- In the case of small samples, we cannot assume that the sampling distribution of is approximately normal
- Two-sided inferences using the t distribution are robust against violations of the normal population assumption. They still usually work well if the actual population distribution is not normal
- The test does NOT work well for a one-sided test with small n when the population distribution is highly skewed

Learning Objective 6 Hypotheses:Regardless of Robustness, Look at the Data

- Whether n is small or large, you should look at the data to check for severe skew or for severe outliers
- In these cases, the sample mean could be a misleading measure

Section 9.4: Decisions and Types of Errors in Significance Tests

Learning Objectives Hypothesis

- Type I and Type II Errors
- Significance Test Results
- Type I Errors
- Type II Errors
- a, b, and Power

Learning Objective 1 Hypothesis:Type I and Type II Errors

- When H0 is true, a Type I Error occurs when H0 is rejected
- When H0 is false, a Type II Error occurs when H0 is not rejected
- As P(Type I Error) goes Down, P(Type II Error) goes Up
- The two probabilities are inversely related

Learning Objective 2 Hypothesis:Significance Test Results

Learning Objective Hypothesis3:Decision Errors: Type I

- If we reject H0 when in fact H0 is true, this is a Type I error.
- If we decide there is a significant relationship in the population (reject the null hypothesis):
- This is an incorrect decision only if H0 is true.
- The probability of this incorrect decision is equal to a.

- If we reject the null hypothesis when it is true and a = 0.05:
- There really is no relationship and the extremity of the test statistic is due to chance.
- About 5% of all samples from this population will lead us to incorrectly reject the null hypothesis and conclude significance.

Learning Objective Hypothesis3:P(Type I Error) = Significance Level α

- Suppose H0 is true. The probability of rejecting H0, thereby committing a Type I error, equals the significance level, α, for the test.
- We can control the probability of a Type I error by our choice of the significance level
- The more serious the consequences of a Type I error, the smaller α should be

Learning Objective 4 Hypothesis:Decision Errors: Type II

- If we fail to reject H0 when in fact Ha is true, this is a Type II error.
- If we decide not to reject the null hypothesis and thus allow for the plausibility of the null hypothesis
- We make an incorrect decision only if Ha is true.
- The probability of this incorrect decision is denoted by

Learning Objective 5 Hypothesis:a, b, and Power

- The probability that a fixed level a significance test will reject H0 when a particular alternative value of the parameter is true is called the power of the test against that specific alternative value. Power = 1-.
- While a gives the probability of wrongly rejecting H0 when in fact H0 is true, power gives the probability of correctly rejecting H0 when in fact H0 should be rejected (because the value of the parameter is some specific value satisfying the alternative hypothesis)
- When m is close to m0, the test will find it hard to distinguish between the two (low power); however, when m is far from m0, the test will find it easier to find a difference (high power).

Section 9.5: Limitations of Significance Tests

Learning Objective Hypothesis

- Statistical Significance vs. Practical Significance
- Significance Tests Are Less Useful Than Confidence Intervals
- Misinterpretations of Results of Significance Tests
- Where Did the Data Come From?

Learning Objective 1: HypothesisStatistical Significance Does Not Mean Practical Significance

- When we conduct a significance test, its main relevance is studying whether the true parameter value is:
- Above, or below, the value in H0 and
- Sufficiently different from the value in H0 and its direction from that value

- The test does not tell us about the practical importance of the results

Learning Objective Hypothesis1:Statistical Significance vs. Practical Significance

- When the sample size is very large, tiny deviations from the null hypothesis (with little practical consequence) may be found to be statistically significant.
- When the sample size is very small, large deviations from the null hypothesis (of great practical importance) might go undetected (statistically insignificant). So,
Statistical significance is not the same thing as practical significance.

Learning HypothesisObjective 1:Statistical Significance vs. Practical Significance

- A small P-value, such as 0.001, is highly statistically significant, but it does not imply an important finding in any practical sense
- In particular, whenever the sample size is large, small P-values can occur even when the point estimate is near the parameter value in H0

Learning Objective Hypothesis2:Significance Tests Are Less Useful Than Confidence Intervals

- A significance test merely indicates whether the particular parameter value in H0 is plausible
- When a P-value is small, the significance test indicates that the hypothesized value is not plausible, but it tells us little about which potential parameter values are plausible
- A Confidence Interval is more informative, because it displays the entire set of believable values

Learning Objective Hypothesis3:Misinterpretations of Results of Significance Tests

- “Fail to Reject H0”does not mean “Accept H0”
- A P-value above 0.05 when the significance level is 0.05, does not mean that H0 is correct
- A test merely indicates whether a particular parameter value is plausible
- When we fail to reject H0: µ = 10, we do not mean to accept µ = 10, but mean that 10 is a plausible value for µ.

Learning Objective Hypothesis3:Misinterpretations of Results of Significance Tests

- Statistical significance does not mean practical significance
- A small P-value does not tell us whether the parameter value differs by much in practical terms from the value in H0

- The P-value cannot be interpreted as the probability that H0 is true or not true.

Learning Objective Hypothesis3:Misinterpretations of Results of Significance Tests

- It is misleading to report results only if they are “statistically significant”
- Some tests may be statistically significant just by chance
- True effects may not be as large as initial estimates reported by the media

Learning Objective Hypothesis4:Where Did the Data Come From?

- When you use statistical inference, you are acting as if your data are a probability sample or come from a randomized experiment.
- Statistical confidence intervals and tests cannot remedy basic flaws in producing data, such as voluntary response samples or uncontrolled experiments.
- If the data do not come from a probability sample or a randomized experiment, the conclusions may be open to challenge. To answer the challenge, ask whether the data can be trusted as a basis for the conclusions of the study.

Section 9.6: How Likely is a Type II Error

Learning Objectives Hypothesis

- Type II Error
- Calculating Type II Error
- Power of a Test

Learning Objective 1: HypothesisType II Error

- A Type II error occurs in a hypothesis test when we fail to reject H0even though it is actually false
- To calculate the probability of a Type II error, we must perform separate calculations for various values of the parameter of interest

Learning Objective 2 Hypothesis:Example 1: Calculating a Type II Error

- Scientific “test of astrology” experiment:
- For each of 116 adult volunteers, an astrologer prepared a horoscope based on the positions of the planets and the moon at the moment of the person’s birth
- Each adult subject also filled out a California Personality Index (CPI)Survey

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error

- For a given adult, his or her birth data and horoscope were shown to the astrologer together with the results of the personality survey for that adult and for two other adults randomly selected from the group
- The astrologer was asked which personality chart of the 3 subjects was the correct one for that adult, based on his or her horoscope

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error

- 28 astrologers were randomly chosen to take part in the experiment
- The National Council for Geocosmic Research claimed that the probability of a correct guess on any given trial in the experiment was larger than 1/3, the value for random guessing

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error The significance level used for the test is 0.05

- With random guessing, p = 1/3
- The astrologers’ claim: p > 1/3
- The hypotheses for this test:
- Ho: p = 1/3
- Ha: p > 1/3

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error

- For what values of the sample proportion can we reject H0?
- A test statistic of z = 1.645 has a P-value of 0.05. So, we reject H0 for z ≥ 1.645 and we fail to reject H0 for z <1.645.

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error

- Find the value of the sample proportion that would give us a z of 1.645:

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error

- So, we fail to reject H0 if
- Suppose that in reality astrologers can make the correct prediction 50% of the time (that is, p = 0.50)
- In this case, (p = 0.50), we can now calculate the probability of a Type II error

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error

- We calculate the probability of a sample proportion < 0.405 assuming that the true proportion is 0.50

Learning Objective 2: HypothesisExample 1: Calculating a Type II Error

- The area to the left of -2.04 in the standard normal table is 0.02
- The probability of making a Type II error and failing to reject H0: p = 1/3 is only 0.02 in the case for which the true proportion is 0.50
- This is only a small chance of making a Type II error

Learning Objective 1: HypothesisType II Error

For a fixed significance level , P(Type II error) decreases

- as the parameter value moves farther into the Ha values and away from the H0 value
- as the sample size increases

Learning Objective 3 Hypothesis:Power of a Test

- Power = 1 – P(Type II error)=probability of rejecting the null hypothesis when it is false
- The higher the power, the better
- In practice, it is ideal for studies to have high power while using a relatively small significance level

Learning Objective 3 Hypothesis:Power of a Test

- In this example, the Power of the test when p = 0.50 is: 1 – 0.02 = 0.98
- Since, the higher the power the better, a test with power of 0.98 is quite good

Download Presentation

Connecting to Server..