- 49 Views
- Uploaded on
- Presentation posted in: General

Statistics for the Physical Sciences STAT 229

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Statistics for the Physical Sciences STAT 229

Chapter 1

Statistics: The Art and Science of Learning from Data

- Problems 1.1 to 1.36 (even numbered)
- Complete the survey on page 22-23

Fall 2008 STAT 229

- Statistics is the art and science of learning from data. It is a collection of methods for
- Planning experiments (Design)
- Obtaining data (data are collected observations, such as measurements and survey responses)
- Organizing data
- Summarizing data (Description)
- Analyzing data
- Interpreting results, and
- Making decisions and predictions (Inference)

- Statistics is a branch of Mathematics ->

Fall 2008 STAT 229

- Statistics is invented for studying Randomness- a lack of order, purpose, cause, or predictability (by Wiki)- without which the world will be of no interest.
- Examples of random phenomena:
- Phelps won 8 gold medals
- A 6-sided die is flipped and landed a 4
- It’s going to rain tomorrow

- Randomness, Fuzziness and Uncertainty
- Randomness creates uncertainty. On the other hand, randomness can be used. When estimating the proportion of adults in USA who smoked, we can survey 1000 adults and use the survey responses as our data. How randomness is used? Why use it?

Fall 2008 STAT 229

- In the previous example, all US adults form a population while the 1000 surveyed adults form a sample.
- In general, a population is the complete collection of all items to be studied. These items can be human subjects, animals, machines, even scores.
- A sample is a sub-collection of items selected from a population.

Fall 2008 STAT 229

- A sample should represent the underlying population. Therefore, sample data must be collected in an appropriate way, such as through a process of random selection.
- How large should a sample be?
- What are those appropriate ways to generate a sample?
- Methods for summarizing sample data are referred to as descriptive statistics, while methods for making decisions or predictions about a population based on sample data are called inferential statistics.

Fall 2008 STAT 229

- A parameter is a numeric summary of the population
- A statistic is a numeric summary of a sample taken from the population

Fall 2008 STAT 229

- Problem: Number of Good Friends
One year the General Social Survey asked, “About how many good friends do you have?” Of the 819 people who responded, 6% reported having only one good friend. Identify

(a) the sample

(b) the population, and

(c) the parameter or statistic

- Try Problem 1.3 on page 8 of the textbook.
Go to the General Social Survey website

http://sda.berkeley.edu/GSS

By entering HEAVEN as the “row variable” name, find the percentages of people who said “yes, definitely,” “yes, probably,” “no, probably not,” and “no, definitely not” when asked whether they believed in heaven.

- Save (large) data files
- Create databases
- Do analysis with software: SAS, Minitab, Spss, R, Splus, C, Matlab, Excel, ...
- Simulation – use of computers to mimic reality.

Fall 2008 STAT 229

NOTES:

1. Pseudo-random numbers are numbers generated by a computer algorithm to simulate real random numbers.

2. Excel has an Analysis ToolPak by which one can do statistical analysis, including simulation.

Fall 2008 STAT 229

Tasks:

When a balanced coin is tossed 20 times, we have a sequence of 20 Heads or Tails. Let 1 denote Heads and 0 denote Tails. Then a sample is a sequence of 1 or 0. The empirical probability or sample proportion of tossing Heads(1) is computed as the number of 1’s divided by the total number of tosses. The coin-tossing process can be simulated using Bernoulli distribution with proportion p = 0.5.

1. Simulate 5 random samples, each consisting of 10 pseudo-random numbers from a Bernoulli(0.5) distribution. Repeat the process using 1000 pseudo-random numbers.

2. Compute the sample proportion for each of the 10 samples.

Simulation

Follow this:

Tools Data Analysis Random Number Generation Bernoulli

More questions:

- Where does randomness play a role?
- Is the amount of variability from sample to sample of size 10 bigger than the amount of variability from sample to sample of size 1000?
- Comment on the effect of sample size.

- Excel 2007 no longer have tools menu.
- To use Analysis ToolPak, go to office button at the upper left corner, click Excel options, then click Add-ins and highlight Analysis ToolPak. Clicking go button to open the Add-ins window. Check the box Analysis ToolPak and click OK.
- Now go to Data menu, click Data Analysis and choose Random Number Generation.

Fall 2008 STAT 229

Statistics for the Physical Sciences STAT 229

Chapter 2

Exploring Data with Graphs and Numerical Summaries

- 2-1 (p29): Problems 2.2, 2.4, 2.6, 2.8
- 2-2 (p44): Problems 2.10, 2.12, 2.14, 2.16, 2.22
- 2-3 (p55): Problems 2.30, 2.32, 2.34, 2.36, 2.42, 2.44
- 2-4 (p64): Problems 2.48, 2.52, 2.56, 2.58, 2.60
- 2-5 (p73): Problems 2.64, 2.66, 2.68, 2.72, 2.74, 2.78, 2.80, 2.82
- 2-6 (p80): Problems 2.84

- A characteristic observed for the subjects in a study is called a variable.
- Examples of variable: major, GPA, religious affiliation, smoking status,...
- Variables can be quantitative (numerical) or qualitative (categorical).
- A variable is quantitative if its numerical values represent different magnitudes of the variable, such as weight, GPA. A variable is categorical, if its value represents a category, such as major, letter garde.

- Quantitative variables can be discrete or continuous.
- A discrete variable is usually a count such as the number of car accident last year, while a continuous variable is a measurement, such as distance.
- The reason we care whether a variable is quantitative, categorical, discrete, or continuous is that the method used to analyze a data set depends on the type of variable the data represent.

- A quantitative variable usually takes different values in a study. Studying the spread (variability) of such a variable is one of the most important tasks in statistics. Another feature of a quantitative variable is the center of all its possible values.
- For a categorical variable, a key feature to describe is the relative number of items (percentage) in the various categories.

- For a categorical variable, counting how often each possible value is taken by the variable is a critical first step in descriptive statistics. The results are summarized in a frequency table.
- The following table shows the frequency of shark attacks in various regions for 1990-2006.

Frequency of shark attacks in various regions for 1990-2006

Questions: What is the variable? Is it categorical?

The mode of categorical data is the category with the highest frequency. Find the mode of the data.

- In the table above, the proportions and percentages are also called relative frequencies. A table like this is called a frequency table.
- A frequency table is a listing of possible values for a variable, together with the number of observations for each value.
- For a quantitative variable, A frequency table is constructed by first categorizing the data into a set of adjacent intervals, then finding the frequencies for each interval.

Frequency Table for Daily TV Watching

- Example
Construct a frequency table for quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

Score Frequency Proportion Percentage

1

1

7

8

3

0.05

0.05

0.35

0.40

0.15

[0,2]

(2,4]

(4,6]

(6,8]

(8,10]

5

5

35

40

15

Total 20

1.00 100

Preliminary results of the election for the European Parliament in 2004

Pie Charts and Bar Graphsfor Categorical Variables

- Pie chart: A circle having a “slice of a pie” for each category. The size of slice corresponds to the percentage of observations in the category.
- Bar graph: Displays a vertical bar for each category. The height of the bar is the percentage of observations in the category.

Example: Use the shark attack data from

this source link to construct a pie chart

of interest.

Pareto Chart: Bar Graph with categories Ordered by Their Frequency from the Tallest Bar to Shortest

Graphs for Quantitative Variables

- Dot plots: Shows a dot for each observation, placed just above the value on the number line for that observation.
- Stem-and-Leaf Plots: similar to dot plot. Each observation is represented by a stem and a leaf.
- Histogram: a graph uses bars to portray the frequencies or relative refrequencies.

Graphs for Quantitative Variables

Example Dot plot

Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

1 2 3 4 5 6 7 8 9 10

Graphs for Quantitative Variables

Example Stem-and-Leaf Plot

Test scores for 12 students: 80, 45, 100, 76, 84, 87, 96, 62, 75,74, 87, 76

Step 1: Sorted test scores: 45, 62, 74, 75, 76, 76, 80, 84, 87, 87, 96, 100

Step 2: Place the scores in the corresponding stems and leaves. (usually the last digit will be the leaf)

Stem Leaves

4

5

6

7

8

9

10

5

2

4 5 6 6

0 4 7 7

6

0

Graphs for Quantitative Variables

Histogram

Step 1: Divide the range of data into

intervals of equal width.

Step 2: Count the frequency and construct a

frequency table (or relative frequency table).

Step 3: Label the endpoints of the intervals on x-axis.

Draw a bar over each interval with height equal

to its frequency (or relative frequency), values

of which are marked on the y-axis.

Graphs for Quantitative Variables

Example Histogram

Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

Score Freq

[0,2) 1

[2,4) 1

[4,6) 7

[6,8) 8

[8,10) 3

[10,12) 0

- When looking at a graph of quantitative data (dot plot, stem-and-leaf plot, and histogram), look for
- the overall pattern: Do the data cluster together?
- the outliers
- modes: unimodal, bimodal,…
- skew: skewed to the left or right
- the underlying smooth curve

Unimodal Bimodal Multimodal

outliers

These Two Histograms Show Differences in Spread

- Time series: a data set collect over time.
- Time plot: a graph displaying time-series data.
- Look for pattern over time.

Gasoline price

- Measures of center: mean and median
- Mean: the sum of the observations divided by the number of observations.
- Median: The midpoint of the observations.

- Example Travel times to work
- How long does it take to get from home to work?
- Here are the travel times in minutes in North
- Carolina, chosen at random by Census Bureau:
- 20 10 40 25 20 10 60 15 40 5 30 12 10 10
- Find the mean travel time.

Step 1: Sort your data from the smallest

to the largest.

Step 2: If n, the number of data points is

odd, the median is the middle

value; if n is even, the median is

the average of the middle two values.

Example Find median for the travel times

30 20 10 40 25 20 10 60 15 40 5 30 12 10 10

Arrange the data in order:

5 10 10 10 10 12 15 20 20 25 30 30 40 40 60

Since n = 15 is odd, Median = 20, the middle value.

Example Find the median for the scores

60 80 87 73 95 92

Arrange the data in order: 60 73 80 87 92 95

Since n = 6 is even, Median = (80 + 87)/2 = 83.5, the

average of the two middle values.

- The mean is the balance point of the data.
- In a symmetric distribution, the mean and median are the same.
- In a skewed distribution, the mean is usually farther out in the long tail than the median.
- Skewed to the right, mean > median
- Skewed to the left, mean < median

- Mean is less resistant to outliers.

- The mean is the balance point.
- The median is the midpoint.
- The mode is the value occurs most frequently.

- City data
- St Cloud, MN
- New Orleans, LA

- Measures of spread:
- The Range
- The Stand Deviation
- The Interquartile Range (Sec2.5)

- Range = largest value - smallest value
- Example: Find the range of the quiz scores : 2, 5, 0, 7, 9, 1, 7, 6, 10, 9, 3, 9, 9, 7, 0, 6, 9, 10, 8,1, 4, 6, 8, 9, 4, 2, 9, 0, 5, 7
Range = largest value - smallest value

= 10 - 0

= 10

- Simple to compute
- Easy to understand
But

- Uses only extreme values
- Affected severely by outliers

- The standard deviation and variance measure spread by looking how far the observations are from their mean.
- The variance of a set of observations is an average of the squares of deviation from the mean.

- The standard deviations is the square root of the variance

The standard deviation: Example

- Example (Calculating the standard deviation s)
Metabolic rates of 7 men who took part in a study of dieting. The units are calories per 24 hours.

1792 1666 1362 1614 1460 1867 1439

Find the mean first:

Cont’d

Observations Deviations Squared deviations

sum = 0 sum = 214870

The variance

The standard deviation

- The greater the spread, the larger the s.
- s ≥ 0.
- s = 0 when all the observations take the same value.
- s can be influenced by outliers.

If a distribution of data is bell shaped, then approximately:

68% of the observations fall within 1 stand deviation of the mean, that is between - s and + s.

95% of the observations fall within 2 stand deviations of the mean, that is between - 2s and + 2s.

99.7% of the observations fall within 3 stand deviations of the mean, that is between - 3s and

+ 3s.

- Population: The collection of all individuals or items under consideration.
- Sample: That part of the population from which we actually collect information.
- We use a sample to draw conclusion about the entire population.

- Parameter: Numerical summary of the population.
- Statistic: Numerical summary of a sample.
- Notations:
Population Mean

Population Standard Deviation

Sample Mean

Sample Standard Deviation s

- Measure of positions:
- Quartiles
- Percentiles.

- Percentiles:
- pth percentile: a value such that p percent of observations fall below or at that value.

- Quartiles
- First quartile, the same as 25th percentile (p=25)
- Second quartile, the same as 50th percentile (p=50)
- Third quartile, the same as 75th percentile (p=75)

Calculating Quartiles

- To calculate the quartiles:
1. Arrange the observations in increasing order.

2. The second quartile is the median M.

( = 50th percentile)

3. The first quartile is the median of the

observations whose position in the ordered list is to

the left location of the overall median. ( = 25th

percentile)

4. The third quartile is the median of the

observations whose position in the ordered list is to

the right location of the overall median. ( = 75th

percentile)

Quartiles: Example

- Example 2.17 Travel times to work Find and .
Arrange the data in order:

5 10 10 10 10 12 15 20 20 25 30 30 40 40 60

the left location of the overall median 20 is:

5 10 10 10 10 12 15

= 10

the right location of the overall median 20 is:

20 25 30 30 40 40 60

= 30

Quartiles: Example

- Example 2.5 Travel times to work Find and .
Travel times in minutes of 20 randomly chosen New York workers: 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45

Arrange the data in order:

5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85

The overall median = 22.5 minutes

the left location of the overall median is: 5 10 10 15 15 15 15 20 20 20

= 15 minutes

the right location of the overall median is:25 30 30 40 40 45 60 60 65 85

= 42.5 minutes

- The Interquartile Range (IQR)
The Interquartile Range = -

- Example (Travel times to work) Find IQR.
5 10 10 10 10 12 15 20 20 25 30 30 40 40 60

- The 1.5*IQR Criterion for Identifying Potential Outliers.
An observation is a potential outlier if it falls more than 1.5*IQR below the first quartile or more than 1.5*IQR above the third quartile.

- Example 2.18 Travel times to work (in minutes). Detecting Potential Outliers.
5 10 10 10 10 12 15

20 20 25 30 30 40 40

80

- The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation.
Minimum Median Maximum

- Example 2.19 The five-number summary of travel times to work.
5 10 10 10 10 12 15 20 20 25 30 30 40 40 80

Constructing a box plot

- A box goes from the and .
- A line drawn inside the box at the median.
- A line goes from the lower end of the box to the smallest observation that is not an potential outlier. A separate line goes from the upper end of the box to the largest observation that is not an potential outlier. These lines are called whiskers.
- The potential outliers are shown separately.

outlier

Example

(Constructing a boxplot)

Travel times to work.

5 10 10 10 10 12 15 20 20 25 30 30 40 40 80

Steps:

- Find Q1, Q2, and Q3:
- Find IQR:
- Determine two fences:
lower fence = Q1 – 1.5*IQR

upper fence = Q3 + 1.5*IQR

- Identify potentialutliers
- Determine whiskers: one from Q1 to the smallest observation within fences, and the other from Q3 to the largest within fences.
- Draw the boxplot.

Largest in fences

Q3 = 30

Q2 = 20

Q1 = 10

smallest in fences

(Text Page 67)

Sodium values for 20 breakfast cereals:

0 70 125 125

140 150 170 170

180 200 200 210

210 220 220 230

250 260 290 290

R codes:

x=c(0,70,125,125,140,150,170,170,180,200,200,210,210,220,220,230,250,260,290,290)

boxplot(x, col=3, horizontal = T)

Example (Boxplot)

- IQR measures the sample variability (or spread).
- A box plot indicates skew. The side with the larger part of the box and the longer whisker usually has skew in that direction.

In terms of symmetry, median, spread, …

- Help to compare groups (in terms of symmetry, median, spread,…).
- Example: (College student heights) Click here to see the “Heights” data on the text CD.

Rcodes (copy and paste to R):

heights=read.table("heights.csv”

, sep=',', header=T)

boxplot(HEIGHT~GENDER,

data=heights, col = 3:4)

- Z-score for an observation is the number of standard deviation that it falls from the mean and in which direction.
- An observation in a bell-shaped distribution is regarded as a potential outlier if it falls more than three standard deviation from the mean;
that is, z > 3 or z < - 3. (Recall the empirical rule, 99.7% of values are within 3 standard deviations of the mean.)

- Self reading

Statistics for the Physical Sciences STAT 229

Chapter 3

Association: Contingency, Correlation, regression

- 3-1: Problems 3.2, 3.4, 3.6, 3.8, 3.10
- 3-2: Problems 3.12, 3.14, 3.16, 3.18, 3.22
- 3-3: Problems 3.26, 3.30, 3.36, 3.38, 3.40
- 3-4: Problems 3.48, 3.50, 3.52, 3.54, 3.56, 3.58, 3.60

- In this chapter, we discuss statistical methods for data on two variables.
- Some times, one of the two variables may be termed the response variable and the other explanatory variable.
- The response variable is the outcome variable on which comparisons are made.
- The explanatory variable defines the group to be compared with respect to values on the response variable.

- This is Example 1 on page 93 of text. 1314 women were asked whether they were smokers. They were followed over a period of 20 years.

It’s natural to treat the variable “Survival Status” as a response variable and “Smoker” as an explanatory variable.

- The main purpose of a data analysis with two variables is to investigate whether there is an association and to describe the nature of that association.
- An association exits between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.
- Is the variable “Survival Status” associated with the variable “Smoker”? Does smoking lead to cancer?

- Smoking and BMI
- Smoking and lung cancer
- Irrigation and plant growth
- Traffic and air pollution
- Gender and height

- A contingency table is used to explore the association between two categorical variables:
- Rows list the categories of one variable.
- Columns list the categories of the other variable.
- Each cell in the table holds the number of observations (frequency) in the sample with certain outcomes on the two variables.

- Cross-tabulation: The process of finding the frequencies for the cells of a contingency table.
- The previous table is an example of a contingency table.

- Excel Data: Two Variables
- Cancer Treatment: treatments given to the cancer patients (Surgery and Radiation therapy).
- Cancer Controlled: whether cancer has been controlled (Yes and No).

Questions: (1) What proportion of the patients who had surgery

had their cancer controlled?

(2) What proportion of all cancer patients had their cancer

controlled?

Answer

(1) 21 / 23 = 91% of the patients who had surgery had their cancer controlled.

(2) 36 / 41 = 88% of all cancer patients had cancer controlled.

A conditional proportion is the proportion of one variable at a given level of the other variable.

A marginal proportion is the proportion of a row or column variable.

- Display conditional proportions.
- Useful for making comparisons.

- The proportion of patients who had their cancer controlled is slightly higher for the patients who had surgery than for those who had radiation therapy.

- Ex 3.8 page 101
- Ex 3.3 page 100

- An association can be studied between
- two categorical variables
- two quantitative variables
- a categorical variable and a quantitative variable.

- In this section, we explore the association between two quantitative variables.
- That is, we will study how a response variable tends to change as the value of an explanatory variable changes.

- A scatterplot is a graphical display of relationship between two quantitative variables. It portrays two variables simultaneously
- horizontal axis: the explanatory variable
- vertical axis: response variable.
- point in the display: observation corresponding to a subject.

- Click to see the data (text, page 103).
- Data dictionary -
GDP: Gross domestic product, per capita, in thousands of US dollars

CO2: Carbon dioxide emissions, per capita, in tons

Cellular: Percentage of adults who are cellular-phone subscribers

Fertility: Mean number of children per adult woman

- Question to explore
(1) Describe the center and spread of the data distribution.

(2) Portray the relationship with a scatterplot for Internet use

and GDP

(3) What do you learn about the association by inspecting

the scatterplot?

Mean: 16.00

Standard deviation: 10.60

Mean: 21.14

Standard deviation: 18.47

- You can describe the overall pattern of a scatterplot by the trend, direction, and strength of the relationship between the two variables
- Trend: linear, curved, clusters, no pattern
- Direction: positive, negative, no direction
- Strength: how closely the points fit the trend

- Also look for outliers from the overall trend

- Two quantitative variables x and y are
- Positively associated when
- High values of x tend to occur with high values of y
- Low values of x tend to occur with low values of y

- Negatively associatedwhen high values of one variable tend to pair with low values of the other variable

- Positively associated when

Would you expect a positive association, a

negative association or no association between

the age of the car and the mileage on the

odometer?

- Positive association
- Negative association
- No association

http://www.gapminder.org

- Measures the strength and direction of the linear association between x and y
- A positive r value indicates a positive association
- A negative r value indicates a negative association
- An r value close to +1 or -1 indicates a strong linear association
- An r value close to 0 indicates a weak association

- Always falls between -1 and +1
- Sign of correlation denotes direction
- (-) indicates negative linear association
- (+) indicates positive linear association

- Correlation has a unitless measure - does not depend on the variables’ units
- Two variables have the same correlation no matter which is treated as the response variable
- Correlation is sensitive to outliers
- Correlation only measures strength of linear relationship

Per Capita Gross Domestic Product and Average Life Expectancy for Countries in Western Europe

Called Z-Scores

II

In quadrant I, both z-scores positive;

In quadrant II, z-scores of Internet are positive, while z-scores of GDP are negative;

In quadrant III, both z-scores negative;

In quadrant IV, z-scores of GDP are positive, while z-scores of INTERNET are negative;

I

IV

III

- When a scatterplot indicates a relationship between two variables, we can start fitting a curve to the data.
- The procedure of fitting a curve to the data, along with inferences about parameters of interest and prediction of the response value, is called regression analysis.

- The first step of a regression analysis is to identify the response and explanatory variables
- We use y to denote the response variable
- We use x to denote the explanatory variable

- A regression line is a straight line that describes how the response variable (y) changes as the explanatory variable (x) changes
- A regression line predicts the value of the response variable (y) for a given level of the explanatory variable (x)
- The y-intercept of the regression line is denoted by a
- The slope of the regression line is denoted by b

- Regression Equation:
- is the predicted height and is the length of a femur (thighbone), measured in centimeters

- Use the regression equation to predict the height of a person whose femur length was 50 centimeters

- y-Intercept:
- The predicted value for y when x = 0
- Helps in plotting the line
- May not have any interpretative value if no observations had x values near 0

- Slope: measures the change in the predicted variable (y) for a 1 unit increase in the explanatory variable in (x)
- Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height

- At a given value of x, the equation:
- Predicts a single value of the response variable
- But… we should not expect all subjects at that value of x to have the same value of y
- Variability occurs in the y values!

- Measures the size of the prediction errors, the vertical distance between the point and the regression line
- Each observation has a residual
- Calculation for each residual:
- A large residual indicates an unusual observation

- Residual sum of squares:
- The least squares regression line is the line that minimizes the vertical distance between the points and their predictions, i.e., it minimizes the residual sum of squares
- Note: the sum of the residuals about the regression line will always be zero

- Slope:
- Y-Intercept:

Regression line always passes through

Slope =26.4

Find a and b.

y intercept=-2.28

Using TI-83

- Enter x data into L1
- Enter y data into L2
- STAT CALC menu
- Choose 8: LinReg(a+bx)
- 1st number = x variable
- 2nd number = y variable
- Enter

- Correlation:
- Describes the strength of the linear association between 2 variables
- Does not change when the units of measurement change
- Does not depend upon which variable is the response and which is the explanatory

- Slope:
- Numerical value depends on the units used to measure the variables
- Does not tell us whether the association is strong or weak
- The two variables must be identified as response and explanatory variables
- The regression equation can be used to predict values of the response variable for given values of the explanatory variable

- When a strong linear association exists, the regression equation predictions tend to be much better than the predictions using only
- We measure the proportional reduction in error and call it, r2, which measures the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x.
- A correlation of 0.9 means that
81% of the variation in the y-values can be explained by the explanatory variable, x

- Be cautious of
- Extrapolation
- Influential outliers
- Interpretation of correlation or association
- Lurking variables
- Confounding

- Extrapolation: Using a regression line to predict y-values for x-values outside the observed range of the data
- It’s riskier as we move farther from the range of the given x-values
- There is no guarantee that the relationship given by the regression equation holds outside the range of sampled x-values

- A regressionoutlier is an observation/point that lies far away from the trend that the rest of the data follows
- An observation is influential if
- Its x value is relatively low or high compared to the remainder of the data, and
- The observation is a regression outlier.

- An influential observation tends to pull the regression line toward that data point and away from the rest of the data.

- Correlation does not imply causation.
- In general, it’s also true that association does not imply causation. This warning holds whether we analyze associations between qualitative variables or between quantitative variables.
- Create a scatterplot for “Crime rate” against “Education” in the “FL crime” data on the text CD.

- A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest.

- Example: A reporter studied the causes of a fire to a house and established a high positive correlation between the damages (in dollars) and the number of firefighters at the scene. Which of the following could be a lurking variable that is responsible for the association?
- (a) Firefighter
- (b) Weather
- (c) Size of the house
- (d) Size of the blaze

STAT 319 Biometrics Fall 2008

135

- Example: An economist noticed that nations with more TV sets have higher life expectancies. He established a high positive correlation between length of life and number of TV sets. Find the lurking variable, if there is one.
- (a) TV sets brands
- (b) Popcorn
- (c) Wealth of the nation
- (d) Sofa
- (e) No confounding variable

STAT 319 Biometrics Fall 2008

136

- Simpson’s Paradox refers to the phenomenon that the direction of an association between two variables can change after we include a third variable and analyze the data at separate levels of that variable. (Book)
- Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined. (Wiki)

The data indicate that smoking could apparently be beneficial to your health. Could a lurking variable be responsible for the association?

This is Example 1 on page 93 of text. 1314 women were asked whether they were smokers. They were followed over a period of 20 years.

- The were also age information about the 1314 women involved in the study. These women can be stratified into 4 different age groups, creating 4 contingency tables.

Question: For each age group, find conditional proportions of deaths for smokers and nonsmokers.

- http://en.wikipedia.org/wiki/Simpson's_paradox

Simpson's paradox for continuous data: a positive trend appears for two separate groups (blue and red), a negative trend (black, dashed) appears when the data are combined.

- When two explanatory variables are both associated with a response variable but are also associated with each other, there is said to be confounding.
- Age is a confounding variable in the study of the association between smoking and survival status.

- A confounding variable is already included in the study. It is associated both with the response variable and the explanatory variable.
- A lurking variable is not measured in the study. It has the potential for confounding.
- The effect of an explanatory variable can be analyzed by adjusting for confounding variables.
- Ignoring lurking variables results in misleading conclusions. (age in smoking-survival association).

Chapter 4:Gathering Data

Section 4.1

Should We Experiment or Should We Merely Observe?

Statistics for the Physical Sciences (STAT 229-02)

- 4-1: Problems 4.2, 4.4, 4.6, 4.8, 4.10
- 4-2: Problems 4.14, 4.18, 4.20, 4.22, 4.28, 4.30
- 4-3: Problems 4.34, 4.36, 4.38, 4.40, 4.42
- 4-4: Problems 4.44, 4.46, 4.48, 4.50, 4.52, 4.54

- Population versus Sample
- Types of Studies: Experimental and Observational
- Comparing Experimental and Observational Studies

- Population: all the subjects of interest
- We use statistics to learn about the population, the entire group of interest

- Sample: subset of the population
- Data is collected for the sample because we cannot typically measure all subjects in the population

Population

Sample

- In an observational study, the researcher observes values of the response variable and explanatory variables for the sampled subjects, without anything being done to the subjects (such as imposing a treatment)

- A sample survey selects a sample of people from a population and interviews them to collect data.
- A sample survey is a type of observational study.
- A census is a survey that attempts to count the number of people in the population and to measure certain characteristics about them

- A researcher conducts an experiment by assigning subjects to certain experimental conditions and then observing outcomes on the response variable
- The experimental conditions, which correspond to assigned values of the explanatory variable, are called treatments

- Headline: “Student Drug Testing Not Effective in Reducing Drug Use”
- Facts about the study:
- 76,000 students nationwide
- Schools selected for the study included schools that tested for drugs and schools that did not test for drugs
- Each student filled out a questionnaire asking about his/her drug use

- Conclusion: Drug use was similar in schools that tested for drugs and schools that did not test for drugs

This study was an observational study.

In order for it to be an experiment, the researcher would had to have assigned each school to use or not use drug testing rather than leaving this decision to the school.

- An experiment reduces the potential for lurking variables to affect the result. Thus, an experiment gives the researcher more control over outside influences.
- Only an experiment can establish cause and effect. Observational studies can not.
- Experiments are not always possible due to ethical reasons, time considerations and other factors.

Chapter 4Gathering Data

Section 4.2

What are Good Ways and Poor Ways to Sample?

- Sampling Frame & Sampling Design
- Simple Random Sample (SRS)
- Random number table
- Margin of Error
- Convenience Samples
- Types of Bias in Sample Surveys
- Key Parts of a Sample Survey

- The sampling frame is the list of subjects in the population from which the sample is taken, ideally it lists the entire population of interest
- The sampling design determines how the sample is selected. Ideally, it should give each subject an equal chance of being selected to be in the sample

- Random Sampling is the best way of obtaining a sample that is representative of the population
- A simple random sample of ‘n’ subjects from a population is one in which each possible sample of that size has the same chance of being selected

- Two club officers are to be chosen for a New Orleans trip
- There are 5 officers: President, Vice-President, Secretary, Treasurer and Activity Coordinator
- The 10 possible samples are:
(P,V) (P,S) (P,T) (P,A) (V,S)

(V,T) (V,A) (S,T) (S,A) (T,A)

- For a SRS, each of the ten possible samples has an equal chance of being selected. Thus, each sample has a 1 in 10 chance of being selected and each officer has a 4 in 10 chance of being selected.

Table of Random Numbers

- Table E on pg. A6 of text

- To select a simple random sample
- Number the subjects in the sampling frame using numbers of the same length (number of digits)
- Select numbers of that length from a table of random numbers or using a random number generator
- Include in the sample those subjects having numbers equal to the random numbers selected

We need to select a random sample of 5 from a class of 20 students.

- List and number all members of the population, which is the class of 20.
- The number 20 is two-digits long.
- Parse the list of random digits into numbers that are two digits long. Here we choose to start with line 2, for no particular reason.

22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34 02

22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34 02

24 1304 83 60 22 52 79 72 65 76 39 36 48 0915 17 92 48 30

1 Alison

2 Amy

3 Brigitte

4 Darwin

5 Emily

6 Fernando

7 George

8 Harry

9 Henry

10 John

11 Kate

12 Max

13 Moe

14 Nancy

15 Ned

16 Paul

17 Ramon

18 Rupert

19 Tom

20 Victoria

- Choose a random sample of size 5 by reading through the list of two-digit random numbers, starting with line 2 and on.
- The first five random numbers matching numbers assigned to people make the SRS.

The first individual selected is Amy, number 02. That’s it from line 2. Move to line 3

Then Moe (13), Darwin, (04), Henry (09), and Net (15)

- Remember that 1 is 01, 2 is 02, etc.
- If you were to hit 09 again before getting five people, don’t sample Ramon twice—you just keep going.

- Sample surveys are commonly used to estimate population percentages
- These estimates include a margin of error which tells us how well the sample estimate predicts the population percentage
- When a SRS of n subjects is used, the margin of error is approximately

- A survey result states: “The margin of error is plus or minus 3 percentage points”
- This means: “It is very likely that the reported sample percentage is no more than 3% lower or 3% higher than the population percentage”
- Click here to see a Gallup example. Read the “Survey Methods” part and justify the margin of error in the survey.

- Convenience Sample: a type of survey sample that is easy to obtain
- Unlikely to be representative of the population
- Often severe biases result from such a sample
- Results apply ONLY to the observed subjects; that is, they are descriptive.

- Volunteer Sample: most common form of convenience sample
- Subjects volunteer for the sample
- Volunteers do not tend to be representative of the entire population

Bias: Tendency to systematically favor certain parts of the population over others

- Sampling Bias: Occurs when using biased samples, which are based on sampling methods such as using nonrandom samples or having undercoverage
- Nonresponse bias: occurs when some sampled subjects cannot be reached or refuse to participate or fail to answer some questions
- Response bias: occurs when the subject gives an incorrect response or the question is misleading
A Large Sample Does Not Guarantee An Unbiased Sample!

- Identify the population of all subjects of interest
- Construct a sampling frame which attempts to list all subjects in the population
- Use a random sampling design to select n subjects from the sampling frame
- Be cautious of sampling bias due to nonrandom samples
We can make inferences about the population of interest when sample surveys that use random sampling are employed.

Chapter 4Gathering Data

Section 4.3

What Are Good Ways and Poor Ways to Experiment?

- Identify the elements of an experiment
- Experiments
- 3 Components of a good experiment
- Blinding the Study
- Define Statistical Significance
- Generalizing Results of the Study

- Experimental units: the subjects of an experiment; the entities that we measure in an experiment
- Treatment: A specific experimental condition imposed on the subjects of the study; the treatments correspond to assigned values of the explanatory variable
- Explanatory variable: Defines the groups to be compared with respect to values on the response variable
- Response variable: The outcome measured on the subjects to reveal the effect of the treatment(s).

- An experiment deliberately imposes treatments on the experimental units in order to observe their responses.
- The goal of an experiment is to compare the effect ofthe treatment on the response.
- Experiments that are randomized occur when the subjects are randomly assigned to the treatments; randomization helps to eliminate the effects of lurking variables

- Control/Comparison group: allows the researcher to analyze the effectiveness of the primary treatment
- Randomization: eliminates possible researcher bias, balances the comparison groups on known as well as on lurking variables (so that the observed difference among subjects is attributed to treatments)
- Replication: allows us to attribute observed effects to the treatments rather than ordinary variability

- A placebo is a dummy treatment, i.e. sugar pill. Many subjects respond favorable to any treatment, even a placebo.
- A control group typically receives a placebo. A control group allows us to analyze the effectiveness of the primary treatment.
- A control group need not receive a placebo. Clinical trials often compare a new treatment for a medical condition, not with a placebo, but with a treatment that is already on the market.

- Experiments should compare treatments rather than attempt to assess the effect of a single treatment in isolation
- Is the treatment group better, worse, or no different than the control group?

- Example: 400 volunteers are asked to quit smoking and each start taking an antidepressant. In 1 year, how many have relapsed? Without a control group (individuals who are not on the antidepressant), it is not possible to gauge the effectiveness of the antidepressant.

- Placebo effect (power of suggestion) : The “placebo effect” is an improvement in health due not to any treatment but only to the patient’s belief that he or she will improve.

- To have confidence in our results we should randomly assign subjects to the treatments. In doing so, we
- Eliminate bias that may result from the researcher assigning the subjects
- Balance the groups on variables known to affect the response
- Balance the groups on lurking variables that may be unknown to the researcher

- Replication is the process of assigning several experimental units to each treatment
- The difference due to ordinary variation is smaller with larger samples
- We have more confidence that the sample results reflect a true difference due to treatments when the sample size is large
- Since it is always possible that the observed effects were due to chance alone, replicating the experiment also builds confidence in our conclusions

- Ideally, subjects are unaware, or blind, to the treatment they are receiving
- If an experiment is conducted in such a way that neither the subjects nor the investigators working with them know which treatment each subject is receiving, then the experiment is double-blinded
- A double-blinded experiment controls response bias from the respondent and experimenter

- If an experiment (or other study) finds a difference in two (or more) groups, is this difference really important?
- If the observed difference is larger than what would be expected just by chance, then it is labeled statistically significant.
- Rather than relying solely on the label of statistical significance, also look at the actual results to determine if they are practically significant.

- Recall that the goal of experimentation is to analyze the association between the treatment and the response for the population, not just the sample
- However, care should be taken to generalize the results of a study only to the population that is represented by the study.

Chapter 4Gathering Data

Section 4.4

What are Other Ways to Conduct Experimental and Observational Studies

- Sample Surveys: Other Random Sampling Designs
- Types of Observational Studies: Prospective and Retrospective
- Multifactor Experiment
- Matched pairs design
- Randomized block design

- It is not always possible to conduct an experiment , so it is necessary to have well designed, informative studies that are not experimental, e.g., sample surveys that use randomization
- Simple Random Sampling
- Cluster Sampling
- Stratified Random Sampling

Steps

- Divide the population into a large number of clusters, such as city blocks
- Select a simple random sample of the clusters
- Use the subjects in those clusters as the sample

- Preferable when
- A reliable sampling frame is unavailable
- The cost of selecting a SRS is excessive

- Disadvantage
- Usually need a larger sample size than with a SRS in order to achieve a particular margin of error

Steps

- Divide the population into separate groups, called strata
- Select asimple random sample from each strata
- Combine the samples from all strata to form complete sample

- Advantage is that you can include in your sample enough subjects in each stratum you want to evaluate
- Disadvantage is that you must have a sampling frame and know the stratum into which each subject belongs

Suppose a university has the following student demographics:

Undergraduate Graduate First Professional Special

55% 20% 5% 20%

In order to insure proper coverage of each demographic, a stratified random sample of 100 students could be chosen as follows: select a SRS of 55 undergraduates, a SRS of 20 graduates, a SRS of 5 first professional students, and a SRS of 20 special students; combine these 100 students.

An observational study can yield useful information when an experiment is not practical.

- Types of observational studies:
- Sample Survey: attempts to take a cross section of a population at the current time
- Retrospective study: looks into the past
- Prospective study: follows its subjects into the future

- Causation can never be definitively established with an observational study, but well designed studies can provide supporting evidence for the researcher’s beliefs

- A case-control study is a retrospective observational study in which subjects who have a response outcome of interest (the cases) and subjects who have the other response outcome (the controls) are compared on an explanatory variable

- Response outcome of interest: Lung cancer
- The cases have lung cancer
- The controls did not have lung cancer

- The two groups were compared on the explanatory variablesmoker/nonsmoker

Nurses’ Health Study:

- Began in 1976 with 121,700 female nurses aged 30 to 55; questionnaires are filled out every two years
- Purpose was to explore the relationships among diet, hormonal factors, smoking habits and exercise habits and the risk of coronary heart disease, pulmonary disease and stroke
- Nurses are followed into the future to determine whether they eventually develop an outcome such as lung cancer and whether certain explanatory variables are associated with it

- A Multifactor experiment uses a single experiment to analyze the effects of two or more explanatory variables on the response
- Categorical explanatory variables in an experiment are often called factors
- We are often able to learn more from a multifactor experiment than from separate one-factor experiments since the response may vary for different factor combinations

- Examine the effectiveness of both Zyban and nicotine patches on quitting smoking
- Two factor experiment
- 4 treatments

- subjects: a certain number of undergraduate students
- all subjects viewed a 40-minute television program that included ads for a digital camera
- some subjects saw a 30-second commercial; others saw a 90-second version
- same commercial was shown either 1, 3, or 5 times during the program
- there were two factors: length of the commercial (2 values), and number of repetitions (3 values)

subjects assigned to Treatment 3 see a 30-second ad five times during the program

- the 6 combinations of one value of each factor form six treatments

- after viewing, all subjects answered questions about: recall of the ad, their attitude toward the commercial, and their intention to purchase the product – these were the response variables.

In a matched pairs design, the subjects receiving the two treatments are somehow matched (same person, husband/wife, two plots in the same field, etc.)

- In a crossover design, the same individual is used for the two treatments

- assign the two treatments to the two matched subjects, or
- randomize the order of applying the treatments in a crossover design

- A block is a set of experimental units that are matched with respect to one or more characteristics
- A Randomized Block Design, RBD, is when the random assignment of experimental units to treatments is carried out separately within each block

- Block = gender; 3 treatments = 3 types of therapy
- The men (as well as the women) are randomly assigned to the
- 3 treatments; differences can be compared with respect to
- gender as well as therapy type

- RBD eliminates variability in the response due to the blocking variable; allows for better comparisons to be made among the treatments of interest
- A matched pairs design is a special case of a RBD with two observations in each block

Chapter 5Probability in our Daily Lives

Section 5.1: How can Probability

Quantify Randomness?

- Section 5.1: 5.2, 5.4, 5.6, 5.8
- Section 5.2: all even
- Section 5.3: all even
- Section 5.4: 5.48, 5.50, 5.56, 5.58, 5.60, 5.62

- Random Phenomena
- Law of Large Numbers
- Probability
- Independent Trials
- Finding probabilities
- Types of Probabilities: Relative Frequency and Subjective

- For random phenomena, the outcome is uncertain
- In the short-run, the proportion of times that something happens is highly random
- In the long-run, the proportion of times that something happens becomes very predictable
Probability quantifies long-run randomness

- As the number of trials increase, the proportion of occurrences of any given outcome approaches a particular number “in the long run”
- For example, as one tosses a die, in the long run 1/6 of the observations will be a 3.

- With random phenomena, the probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations
- Example:
- When rolling a die, the outcome of “3” has probability = 1/6. In other words, the proportion of times that a 3 would occur in a long run of observations is 1/6.

- Different trials of a random phenomenon are independent if the outcome of any one trial is not affected by the outcome of any other trial.
- Example:
- If you have 20 flips of a coin in a row that are “heads”, you are not “due” a “tail” - the probability of a tail on your next flip is still 1/2. The trial of flipping a coin is independent of previous flips.

- Calculate theoretical probabilities based on assumptions about the random phenomena. For example, it is often reasonable to assume that outcomes are equally likely such as when flipping a coin, or a rolling a die.
- Observe many trials of the random phenomenon and use the sample proportion of the number of times the outcome occurs as its probability. This is merely an estimate of the actual probability.

- The relative frequency definition of probability is the long run proportion of times that the outcome occurs in a very large number of trials - not always helpful/possible.
- When a long run of trials is not feasible, you must rely on subjective information. In this case, the subjective definition of the probability of an outcome is your degree of belief that the outcome will occur based on the information available.
- Bayesian statistics is a branch of statistics that uses subjective probability as its foundation

Chapter 5: Probability in Our Daily Lives

Section 5.2: How Can We Find Probabilities?

- Sample Space
- Event
- Probabilities for a sample space
- Probability of an event
- Basic rules for finding probabilities about a pair of events

- Probability of the union of two events
- Probability of the intersection of two events

- For a random phenomenon, the sample space is the set of all possible outcomes.

- An event is a subset of the sample space
- An event corresponds to a particular outcome or a group of possible outcomes.
- For example;
- Event A = “student answers all 3 questions
correctly” = (CCC)

- Event B = “student passes (at least 2 correct)”
= (CCI, CIC, ICC, CCC)

- Event A = “student answers all 3 questions

Each outcome in a sample space has a probability

- The probability of each individual outcome is between 0 and 1
- The total of all the individual probabilities equals 1.

- The probability of an event A, denoted by P(A), is obtained by adding the probabilities of the individual outcomes in the event.
- When all the possible outcomes are equally likely:

- What is the sample space for selecting a taxpayer?
{(under $25,000, Yes), (under $25,000, No),

($25,000 - $49,000, Yes) …}

For a randomly selected taxpayer in 2002,

- What is the probability of an audit?
- 310/80200=0.004

- What is the probability of an income of $100,000 or more?
- 10700/80200=0.133

- What income level has the greatest probability of being audited?
- $100,000 or more = 80/10700= 0.007

- Some events are expressed as the outcomes that
- Are not in some other event (complement of the event)
- Are in one event and in another event (intersection of two events)
- Are in one event or in another event (union of two events)

- The complement of an event A consists of all outcomes in the sample space that are not in A.
- The probabilities of A and of Ac add to 1
- P(Ac) = 1 – P(A)

- Two events, A and B, are disjoint if they do not have any common outcomes

- The intersection of A and B consists of outcomes that are in both A and B

- The union of A and B consists of outcomes that are in A or B or in both A and B.

- Addition Rule:
- For the union of two events,
- P(A or B) = P(A) + P(B) – P(A and B)
- If the events are disjoint, P(A and B) = 0, so
- P(A or B) = P(A) + P(B)

- 80.2 million tax payers (80,200 thousand)
- Event A = being audited
- Event B = income greater than $100,000
- P(A and B) = 80/80200=.001

Multiplication Rule:

For the intersection of two independent events, A and B,P(A and B) = P(A) x P(B)

- What is the probability of getting 3 questionscorrect by guessing?
- Probability of guessing correctly is .2

- What is the probability that a student answers at least 2 questions correctly?
- P(CCC) + P(CCI) + P(CIC) + P(ICC) =
- 0.008 + 3(0.032) = 0.104

- Don’t assume that events are independent unless you have given this assumption careful thought and it seems plausible.

- Example: A Pop Quiz with 2 Multiple Choice Questions
- Data giving the proportions for the actual responses of students in a class
Outcome: II IC CI CC

Probability: 0.26 0.11 0.05 0.58

- Data giving the proportions for the actual responses of students in a class

- Define the events A and B as follows:
- A: {first question is answered correctly}
- B: {second question is answered correctly}

- P(A) = P{(CI), (CC)} = 0.05 + 0.58 = 0.63
- P(B) = P{(IC), (CC)} = 0.11 + 0.58 = 0.69
- P(A and B) = P{(CC)} = 0.58
- If A and B were independent,
P(A and B) = P(A) x P(B) = 0.63 x 0.69 = 0.43

Thus, in this case, A and B are not independent!

Chapter 5Probability in Our Daily Lives

Section 5.3: Conditional Probability:

What’s the Probability of A, Given B?

- Conditional probability
- Multiplication rule for finding P(A and B)
- Independent events defined using conditional probability

- For events A and B, the conditional probability of event A, given that event B has occurred, is:
- P(A|B) is read as “the probability of event A, given event B.” The vertical slash represents the word “given”. Of the times that B occurs, P(A|B) is the proportion of times that A also occurs

- What was the probability of being audited, given that the income was ≥ $100,000?
- Event A: Taxpayer is audited
- Event B: Taxpayer’s income ≥ $100,000

- What is the probability of being audited given that the income level is < $25,000
- Let A =Event being Audited
- Let B = Income < $25,000
- P(A and B) = .0011
- P(B)=.1758
- .0011/.1758=.0063

- A study of 5282 women aged 35 or over analyzed the Triple Blood Test to test its accuracy

- A positive test result states that the condition is present
- A negative test result states that the condition is not present
- False Positive: Test states the condition is present, but it is actually absent
- False Negative: Test states the condition is absent, but it is actually present

- Assuming the sample is representative of the population, find the estimated probability of a positive test for a randomly chosen pregnant woman 35 years or older
- P(POS) = 1355/5282 = 0.257

- Given that the diagnostic test result is positive, find the estimated probability that Down syndrome truly is present
- Summary: Of the women who tested positive, fewer than 4% actually had fetuses with Down syndrome

- For events A and B, the probability that A and B both occur equals:
- P(A and B) = P(A|B) x P(B)
also

- P(A and B) = P(B|A) x P(A)

- P(A and B) = P(A|B) x P(B)

- Roger Federer – 2006 men’s champion in the Wimbledon tennis tournament
- He made 56% of his first serves
- He faulted on the first serve 44% of the time
- Given that he made a fault with his first serve, he made a fault on his second serve only 2% of the time

- Assuming these are typical of his serving performance, when he serves, what is the probability that he makes a double fault?
- P(F1) = 0.44
- P(F2|F1) = 0.02
- P(F1 and F2) = P(F2|F1) x P(F1)
= 0.02 x 0.44 = 0.009

- Two events A and B are independent if the probability that one occurs is not affected by whether or not the other event occurs
- Events A and B are independent if:
P(A|B) = P(A), or equivalently, P(B|A) = P(B)

- If events A and B are independent,
P(A and B) = P(A) x P(B)

- To determine whether events A and B are independent:
- Is P(A|B) = P(A)?
- Is P(B|A) = P(B)?
- Is P(A and B) = P(A) x P(B)?

- If any of these is true, the others are also true and the events A and B are independent

- The diagnostic blood test for Down syndrome:
POS = positive result

NEG = negative result

D = Down Syndrome

DC = Unaffected

- Are the events POS and D independent or dependent? Is P(POS|D) = P(POS)?
- P(POS|D) =P(POS and D)/P(D)
= 0.009/0.010 = 0.90

- P(POS) = 0.257
- The events POS and D are dependent

Chapter 5: Probability in Our Daily Lives

Section 5.4: Applying the Probability Rules

- Is a “Coincidence” Truly an Unusual Event?
- Probability Model
- Probabilities and Diagnostic Testing
- Simulation

- The law of very large numbers states that if something has a very large number of opportunities to happen, occasionally it will happen, even if it seems highly unusual

- What is the probability that at least two students in a group of 25 students have the same birthday?

- P(at least one match) = 1 – P(no matches)

- P(no matches) = P(students 1 and 2 and 3 …and 25 have different birthdays)

- P(no matches) =
(365/365) x (364/365) x (363/365) x …

x (341/365)

- P(no matches) = 0.43

- P(at least one match) =
1 – P(no matches) = 1 – 0.43 = 0.57

Not so surprising when you consider that there are 300 pairs of students who can share the same birthday!

- We’ve dealt with finding probabilities in many idealized situations
- In practice, it’s difficult to tell when outcomes are equally likely or events are independent
- In most cases, we must specify a probability model that approximates reality

- A probability model specifies the possible outcomes for a sample space and provides assumptions on which the probability calculations for events composed of these outcomes are based
- Probability models merely approximate reality

- Out of the first 113 space shuttle missions there were two failures
- What is the probability of at least one failure in a total of 100 missions?
- P(at least 1 failure)=1-P(0 failures)
=1-P(S1 and S2 and S3 … and S100)

=1-P(S1)xP(S2)x…xP(S100)

=1-[P(S)]100=1-[0.971]100=0.947

- P(at least 1 failure)=1-P(0 failures)

- This answer relies on the assumptions of
- Same success probability on each flight
- Independence
These assumptions are suspect since other variables (temperature at launch, crew experience, age of craft, etc.) could affect the probability

- Sensitivity = P(POS|S)
- Specificity = P(NEG|SC)

Random Drug Testing of Air Traffic Controllers

- Sensitivity of test = 0.96
- Specificity of test = 0.93
- Probability of drug use at a given time ≈ 0.007 (prevalence of the drug)

What is the probability of a positive test result?

P(POS)=P(S and POS)+P(SC and POS)

- P(S and POS)=P(S)P(POS|S)
= 0.007x0.96=0.0067

- P(SC and POS)=P(SC)P(POS|SC)
= 0.993x0.07=0.0695

- P(POS)=.0067+.0695=0.0762
Even though the prevalence is < 1%, there is an almost 8% chance of the test suggesting drug use!

Some probabilities are very difficult to find with ordinary reasoning. In such cases, we can approximate an answer by simulation.

Carrying out a Simulation:

- Identify the random phenomenon to be simulated
- Describe how to simulate observations
- Carry out the simulation many times (at least 1000 times)
- Summarize results and state the conclusion

R code

birthdayMatch = function(n = 25, rep = 1000){

match = 0

for (i in 1:rep){

x = sample(1:365, 25, replace = T)

if (sum(duplicated(x)) > 0) match = match + 1

}

match/rep

}

birthdayMatch(rep = 1000)

In table tennis, the first person to get at least 21 points while being ahead of the opponent by at least 2 points wins the game. In games between you and an opponent, suppose successive points are independent, and suppose the probability of your winning any given point is 0.40. Simulate the table tennis process and find your chance of winning a game.

The chance of winning a game is approximately the proportion of games you win in 1000 games.

tableTennis = function(p = 0.4, sim = 1000){

win = 0

score = matrix(0, sim, 2)

for (i in 1:sim){

A = B = 0 # your point is A and your opponent’s point is B

cat("The No.",i, "game is: ")

while (max(A, B) < 21 | abs(A - B) < 2){

if (runif(1) < p) A = A + 1

else B = B + 1

print(c(A, B))

}

score[i, ] = c(A, B)

win = win + (A > B)

}

list(score, prob = win / sim)

}

tableTennis(p = 0.4, sim = 10)

Chapter 6: Probability Distributions

Section 6.1: How Can We Summarize Possible Outcomes and Their Probabilities?

- Page 277: 6.1, 6.4, 6.6, 6.8, 6.10, 6.12
- Page 290: 6.16, 6.20, 6.22, 6.24, , 6.26, 6.27, 6.28, 6.30
- Page 299: 6.36, 6.38, 6.40, 6.42, 6.46, 6.48

- Section 5.1, 5.2
- Section 6.1, 6.2, 6.3, 6.4
- Due with Homework #6

- Random variable
- Probability distributions for discrete random variables
- Mean of a probability distribution
- Summarizing the spread of a probability distribution
- Probability distribution for continuous random variables

- A random variable is a numerical measurement of the outcome of a random phenomenon.

- Use a capital letter, such as X, to refer to the random variable itself.
- Use a lowercase letter, such as x, to refer to A particular value of the random variable X.
Example: Flip a coin three times

- X=number of heads in the 3 flips; defines the random variable
- x=2; represents a realized value of the random variable X.

- Use a lowercase letter, such as x, to refer to A particular value of the random variable X.

- The probability distribution of a discrete random variable specifies its possible values and their probabilities.
- The probability distribution of a continuous random variable specifies the intervals where the random variable falls and their probabilities.

- A discrete random variableX has separate values (such as 0,1,2,…) as its possible outcomes
- Its probability distribution assigns a probability P(x) to each possible value x:
- The sum of the probabilities for all the possible x values equals 1

- What is the estimated probability of at least three home runs?
P(3)+P(4)+P(5)=0.13+0.03+0.01=0.17

- The mean of a probability distribution for a discrete random variable is
where the sum is taken over all possible values of x.

- The mean of a probability distribution is denoted by the parameter, µ.
- The mean is a weighted average; values of x that are more likely receive greater weight P(x)

- The mean of a probability distribution of a random variable X is also called the expected value of X.
- The expected value reflects not what we’ll observe in a single observation, but rather that we expect for the average in a long run of observations.

- Find the mean of this probability distribution.

The mean:

= 0(0.23) + 1(0.38) + 2(0.22) + 3(0.13) + 4(0.03) + 5(0.01) = 1.38

The standard deviation of a probability distribution, denoted by the parameter, σ, measures its spread.

- Larger values of σ correspond to greater spread.
- Roughly, 0.8σ describes how far the random variable falls, on the average, from the mean of its distribution

- A continuous random variable has an infinite continuum of possible values in an interval.
- Examples are: time, age and size measures such as height and weight.

- A continuous random variable has possible values that form an interval.
- Its probability distribution is specified by a curve.
- Each interval has probability between 0 and 1.
- The interval containing all possible values has probability equal to 1.

Pr( 0.51≤X ≤1.48)

Chapter 6: Probability Distributions

Section 6.2: How Can We Find Probabilities for Bell-Shaped Distributions?

- Normal Distribution
- 68-95-99.7 Rule for normal distributions
- Z-Scores and the Standard Normal Distribution
- The Standard Normal Table: Finding Probabilities
- Using the TI-calculator: find probabilities

- Using the Standard Normal Table in Reverse
- Using the TI-calculator: find z-scores
- Probabilities for Normally Distributed Random Variables
- Percentiles for Normally Distributed Random Variables
- Using Z-scores to Compare Distributions

The normal distribution is symmetric, bell-shaped and characterized by its mean µ and standard deviation .

- The normal distribution is the most important distribution in statistics
- Many distributions have an approximate normal distribution
- Approximates many discrete distributions well when there are a large number of possible outcomes
- Many statistical methods use it even when the data are not bell shaped

- Normal distributions are
- Bell shaped
- Symmetric around the mean

- The mean () and the standard deviation () completely describe the density curve
- Increasing/decreasing moves the curve along the horizontal axis
- Increasing/decreasing controls the spread of the curve

The bigger the variance, the narrower the curve.

- Within what interval do almost all of the men’s heights fall? Women’s height?

- 68% of the observations fall within one standard deviation of the mean
- 95% of the observations fall within two standard deviations of the mean
- 99.7% of the observations fall within three standard deviations of the mean

- Heights of adult women
- can be approximated by a normal distribution
- = 65 inches; =3.5 inches

- 68-95-99.7 Rule for women’s heights
- 68% are between 61.5 and 68.5 inches
[ µ = 65 3.5 ]

- 95% are between 58 and 72 inches
[ µ 2 = 65 2(3.5) = 65 7 ]

- 99.7% are between 54.5 and 75.5 inches
[ µ 3 = 65 3(3.5) = 65 10.5 ]

- 68% are between 61.5 and 68.5 inches

- The z-score for a value x of a random variable is the number of standard deviations that x falls from the mean
- A negative (positive) z-score indicates that the value is below (above) the mean
- z-scores can be used to calculate the probabilities of a normal random variable using the normal tables in the back of the book

- A standard normal distribution has mean µ=0 and standard deviation σ=1
- When a random variable has a normal distribution and its values are converted to z-scores by subtracting the mean and dividing by the standard deviation, the z-scores have the standard normal distribution.

Standard normal curve

Table A enables us to find normal probabilities

- It tabulates the normal cumulative probabilities falling below the point +z
To use the table:

- Look up the closest value in the table to the z score.
- First column gives z to the first decimal place
- First row gives the second decimal place of z

- The corresponding probability found in the body of the table gives the probability of falling below the z-score

- Find the probability that a normal random variable takes a value less than 1.43 standard deviations above µ; P(z < 1.43)=.9236

TI Calculator = Normcdf(-1e99, 1.43 , 0, 1)= .9236

- Find the probability that a normal random variable takes a value greater than 1.43 standard deviations above µ: P(z>1.43)=1-.9236=.0764

TI Calculator = Normcdf(1.43,1e99,0,1)= 0.0764

- Find the probability that a normal random variable assumes a value within 1.43 standard deviations of µ
- Probability below 1.43σ = .9236
- Probability below -1.43σ = .0764 (1-.9236)
- P(-1.43<z<1.43) =.9236-.0764=.8472

TI Calculator = Normcdf(-1.43,1.43,0,1)= .8472

To calculate the cumulative probability

- 2nd DISTR; 2:normalcdf(lower bound, upper bound, mean, sd)
- Use –1E99 for negative infinity and 1E99 for positive infinity

- Find probability to the left of -1.64
- P(z<-1.64)=normcdf(-1e99,-1.64,0,1)=.0505

- Find probability to the right of 1.56
- P(z>1.56)=normcdf(1.56,1e99,0,1)=.0594

- Find probability between -.50 and 2.25
- P(-.5<z<2.25)=normcdf(-.5,2.25,0,1)=.6793

- http://www.math.unb.ca/~knight/utility/NormTble.htm
- From the standard normal distribution table, we can find
probabilities such as

Find Normal Probabilities in Excel

In Excel, use NORMDIST(x, 0, 1, true). For example, to find P(Z < 0.62), in Excel, type

=NORMDIST(0.62, 0, 1, TRUE)

And press the ENTER key. The answer is 0.732.

- To solve some of our problems, we will need to find the value of z that corresponds to a certain normal cumulative probability
- To do so, we use Table A in reverse
- Rather than finding z using the first column (value of z up to one decimal) and the first row (second decimal of z)
- Find the probability in the body of the table
- The z-score is given by the corresponding values in the first column and row

- Rather than finding z using the first column (value of z up to one decimal) and the first row (second decimal of z)

- Example: Find the value of z for a cumulative probability of 0.025.
- Look up the cumulative probability of 0.025 in the body of Table A.
- A cumulative probability of 0.025 corresponds to z = -1.96.
- Thus, the probability that a normal
random variable falls at least 1.96

standard deviations below the

mean is 0.025.

- Example: Find the value of z for a cumulative probability of 0.975.
- Look up the cumulative probability of 0.975 in the body of Table A.
- A cumulative probability of 0.975 corresponds to z = 1.96.
- Thus, the probability that a normal
random variable takes a value no more

than 1.96 standard deviations above

the mean is 0.975.

- 2nd DISTR 3:invNorm; Enter
- invNorm(percentile,mean,sd)
- Percentile is the probability under the curve from negative infinity to the z-score

- Enter

- The probability that a standard normal random variable assumes a value that is ≤ z is 0.975. What is z? Invnorm(.975,0,1)=1.96
- The probability that a standard normal random variable assumes a value that is > z is 0.0275.
What is z? Invnorm(.975,0,1)=1.96

- The probability that a standard normal random variable assumes a value that is ≥ z is 0.881.
What is z? Invnorm(1-.881,0,1)=-1.18

- The probability that a standard normal random variable assumes a value that is < z is 0.119.
What is z? Invnorm(.119,0,1)= -1.18

- Find the z-score z such that the probability within z standard deviations of the mean is 0.50.
- Invnorm(.75,0,1)= .67
- Invnorm(.25,0,1)= -.67

- Probability = P(-.67<Z<.67)=.5

- State the problem in terms of the observed random variable X, i.e., P(X<x)
- Standardize X to restate the problem in terms of a standard normal variable Z
- Draw a picture to show the desired probability under the standard normal curve
- Find the area under the standard normal curve using Table A

Standard normal

Shaded areas are kept same.

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure less than 100?
- P(X<100) =
- Normcdf(-1E99,100,120,20)=.1587
- 15.9% of adults have systolic blood pressure less than 100

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure greater than 100?
- P(X>100) = 1 – P(X<100)
- P(X>100)= 1-.1587=.8413
- Normcdf(100,1e99,120,20)=.8413
- 84.1% of adults have systolic blood pressure greater than 100

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure greater than 133?
- P(X>133) = 1 – P(X<133)
- Normcdf(133,1E99,120,20)=.2578
- 25.8% of adults have systolic blood pressure greater than 133

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What percentage of adults have systolic blood pressure between 100 and 133?
- P(100<X<133) = P(X<133) - P(X<100)
- Normcdf(100,133,120,20)=.5835
- 58% of adults have systolic blood pressure between 100 and 133

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. What is the 1st quartile?
- Translation: Given P(X < x) = .25, find x.
- Look up .25 in the body of Table A to find z = -0.67
- Solve equation to find x:

- Check:
- P(X<106.6) = P(Z<-0.67)=0.25
- TI Calculator = Invnorm(.25,120,20)=106.6

- Adult systolic blood pressure is normally distributed with µ = 120 and σ = 20. 10% of adults have systolic blood pressure above what level?
- P(X>x)=.10, find x.
- P(X>x)=1-P(X<x)
- Look up 1-0.1=0.9 in the body of Table A to find z=1.28
- Solve equation to find x:

- Check:
- P(X>145.6) =P(Z>1.28)=0.10
- TI Calculator = Invnorm(.9,120,20)=145.6

Z-scores can be used to compare observations from different normal distributions

- Example:
- You score 650 on the SAT which has =500 and
=100 and 30 on the ACT which has =21.0 and

=4.7. On which test did you perform better?

- Compare z-scores
SAT: ACT:

- Since your z-score is greater for the ACT, you performed better on this exam

- You score 650 on the SAT which has =500 and

Chapter 6: Probability Distributions

Section 6.3: How Can We Find Probabilities When Each Observation Has Two Possible Outcomes?

- The Binomial Distribution
- Conditions for a Binomial Distribution
- Probabilities for a Binomial Distribution
- Factorials
- Examples using Binomial Distribution
- Do the Binomial Conditions Apply?
- Mean and Standard Deviation of the Binomial Distribution
- Normal Approximation to the Binomial

- Each observation is binary: it has one of two possible outcomes.
- Examples:
- Accept, or decline an offer from a bank for a credit card.
- Have, or do not have, health insurance.
- Vote yes or no on a referendum.

- Each of n trials has two possible outcomes: “success” or “failure”.
- Each trial has the same probability of success, denoted by p.
- The ntrials are independent.
- Let X be the number of successes in the n trials. Then, X has a Binomial Distribution.

- Denote the probability of success on a trial by p.
- For n independent trials, the probability of x successes equals:

See def. of n!

Rules for factorials:

- n!=n*(n-1)*(n-2)…2*1
- 1!=1
- 0!=1
For example,

- 4!=4*3*2*1=24

- John Doe claims to possess ESP, extrasensory perception.
- An experiment is conducted:
- A person in one room picks one of the integers 1, 2, 3, 4, 5 at random.
- In another room, John Doe identifies the number he believes was picked.
- Three trials are performed for the experiment.
- Doe got the correct answer twice.

If John Doe does not actually have ESP and is actually guessing the number, what is the probability that he’d make a correct guess on two of the three trials?

- Let S = Success and F = Failure. All possible answers for the three trials are: SSS, …, FFF, all equally likely. The three ways John Doe could make two correct guesses in three trials are: SSF, SFS, and FSS.
- The probability of two correct guesses is then P(SSF, SFS, or FSS), an “or” probability.

- The probability of exactly 2 correct guesses is the binomial probability with n = 3 trials, x = 2 correct guesses and p = 0.2 probability of a correct guess.

TI calculator:

2nd Vars

0:binampdf(n,p,x)

Binampdf(3,.2,2)=0.096

- 1000 employees, 50% Female
- 10 employees were chosen for management training. None of these were female. Is the selection procedure random?

- The probability that no females are chosen is:
- TI calculator: Binompdf(10,.5,0)=9.765625E-4=0.000975625
- It is very unlikely (one chance in a thousand) that none of the 10 selected for management training would be female if the employees were chosen randomly

- Before using the binomial distribution, check that its three conditions apply:
- Binary data (success or failure).
- The same probability of success for each trial (denoted by p).
- Independent trials.

- The data are binary (male, female).
- If employees are selected randomly, the probability of selecting a female on a given trial is 0.50.
- With random sampling of 10 employees from a large population, outcomes for one trial does not depend on the outcome of another trial

- The binomial probability distribution for n trials with probability p of success on each trial has mean µ and standard deviation σ given by:

- Data:
- 262 police car stops in Philadelphia in 1997.
- 207 of the drivers stopped were African-American.
- In 1997, Philadelphia’s population was 42.2% African-American.
- Does the number of African-Americans stopped suggest possible bias, being higher than we would expect (other things being equal, such as the rate of violating traffic laws)? Use the 68-95-99.7 Empirical Rule.

- Assume:
- 262 car stops represent n = 262 trials.
- Successive police car stops are independent.
- P(driver is African-American) is p = 0.422.

- Calculate the mean and standard deviation of this binomial distribution:

- Recall: Empirical Rule
- When a distribution is bell-shaped, close to 100% of the observations fall within 3 standard deviations of the mean.

- If there is no racial profiling, we would not be surprised if between about 87 and 135 of the 262 drivers stopped were African-American.
- The actual number stopped (207) is well above these values.
- The number of African-Americans stopped is too high, even taking into account random variation.

- Limitation of the analysis:
- Different people do different amounts of driving, so we don’t really know that 42.2% of the potential stops were African-American.

- An observed value may be larger than expected. How can we identify such values?
- An observed value x of a random variable X is said to unusually high, if P(X ≥ x) is very small, say < 0.05.
- Example: Toss a balanced coin 10 times and 8 are heads. Let X = # of heads. Is X = 8 unusually high?
- We calculate
P(X ≥ 8) = P(X = 8) + P(X = 9) + P(X = 10) = 0.055 > 0.05,

X = 8 is not unusually high.

- An observed value x of a random variable X is said to unusually low, if P(X ≤ x) is very small, say < 0.05.
- Is X = 2 unusually small? P(X ≤ 2) = 0.055 > 0.05, so No.

- The binomial distribution can be well approximated by the normal distribution when the expected number of successes, np, and the expected number of failures, n(1-p) are both at least 15.

Chapter 7: Sampling Distributions

Section 7.1

How Likely Are the Possible Values of a Statistic? The Sampling Distribution

- Problems 7.1 to 7.34 (Even)
- Skip problems that need simulation.
- Hawkes: 7-2 and 7-3

- Statistic vs. Parameter
- Sampling Distributions
- Mean and Standard Deviation of the Sampling Distribution of a Proportion
- Standard Error
- Sampling Distribution Example
- Population, Data, and Sampling Distributions

- A statistic is a numerical summary of sample data such as a sample proportion or sample mean
- A parameter is a numerical summary of a population such as a population proportion or population mean.
- In practice, we seldom know the values of parameters.
- Parameters are estimated using sample data.
- We use sample statistics to estimate the corresponding population parameters.

Example:

- Prior to counting the votes, the proportion in favor of recalling Governor Gray Davis was an unknown parameter.
- An exit poll of 3160 voters reported that the sample proportion in favor of a recall was 0.54.
- If a different random sample of about 3000 voters were selected, a different sample proportion would occur.
The sampling distribution of the sample proportion shows all possible values and the probabilities for those values.

- The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take.
- Sampling distributions describe the variability that occurs from study to study using statistics to estimate population parameters
- Sampling distributions help to predict how close a statistic falls to the parameter it estimates

- For a random sample of size n from a population with proportion p of outcomes in a particular category, the sampling distribution of the proportion of the sample in that category has

- To distinguish the standard deviation of a sampling distribution from the standard deviation of an ordinary probability distribution, we refer to it as a standard error.

- If the population proportion supporting the re-election of Schwarzenegger was 0.50, would it have been unlikely to observe the exit-poll sample proportion of 0.565?
- Based on your answer, would you be willing to predict that Schwarzenegger would win the election?

- Given that the exit poll had 2705 people and assuming 50% support the reelection of Schwarzenegger,
- Find the estimate of the population proportion and the standard error:

- The sample proportion of 0.565 is more than six standard errors from the expected value of 0.50.
- The sample proportion of 0.565 voting for reelection of Schwarzenegger would be very unlikely if the population proportion were p = 0.50 or p < 0.50

- Population distribution: This is the probability distribution from which we take the sample.
- Values of its parameters are usually unknown. They’re what we’d like to learn about.

- This is the distribution of the sample data. It’s the distribution we actually see in practice.
- It’s described by statistics
- With random sampling, the larger the sample size n, the more closely the data distribution resembles the population distribution

- In the 2006 U.S. Senate election in NY
- An exit poll of 1336 voters showed
- 67% (895) voted for Clinton
- 33% (441) voted for Spencer

- When all 4.1 million votes were tallied
- 68% voted for Clinton
- 32% voted for Spencer

- An exit poll of 1336 voters showed
- Let X= vote outcome, with x=1 for Clinton and x=0 for Spencer

- The population distribution is the 4.1 million values of the x vote variable, 32% of which are 0 and 68% of which are 1.
- The data distribution is the 1336 values of the x vote for the exit poll, 33% of which are 0 and 67% of which are 1.
- The sampling distribution of the sample proportion is approximately a normal distribution with p=0.68 and
- Only the sampling distribution is bell-shaped; the others are discrete and concentrated at the two values 0 and 1.

Chapter 7: Sampling Distributions

Section 7.2

How Close Are Sample Means to Population Means?

- The Sampling Distribution of the Sample Mean
- Effect of n on the Standard Error
- Central Limit Theorem (CLT)
- Calculating Probabilities of Sample Means

- The sample mean, x, is a random variable.
- The sample mean varies from sample to sample.
- By contrast, the population mean, µ, is a single fixed number.

- For a random sample of size n from a population having mean µ and standard deviation σ, the sampling distribution of the sample mean has:
- Center described by the mean µ (the same as the mean of the population).
- Spread described by the standard error, which equals the population standard deviation divided by the square root of the sample size:
- standard error of

- Daily sales at a pizza restaurant vary from day to day.
- The sales figures fluctuate around a mean µ = $900 with a standard deviation σ = $300.
- What are the center and spread of the sampling distribution of the average sales in a week?

- Knowing how to find a standard error gives us a mechanism for understanding how much variability to expect in sample statistics “just by chance.”
- The standard error of the sample mean =
- As the sample size n increases, the denominator increases, so the standard error decreases.
- With larger samples, the sample mean is more likely to fall closer to the population mean.

- Question: How does the sampling distribution of the sample mean relate with respect to shape, center, and spread to the population distribution from which the samples were taken?
- For random sampling with a large sample size n, the sampling distribution of the sample mean is approximately a normal distribution.
- This result applies no matter what the shapeof the probability distribution from which the samples are taken.

- The sampling distribution of the sample mean takes more of a bell shape as the random sample size n increases.
- The more skewed the population distribution, the larger n must be for CLT to work.
- In practice, the sampling distribution is usually close to normal when the sample size n is at least 30.
- If the population distribution is normal, then the sampling distribution is normal for all sample sizes.

- CLT: For large n, the sampling distribution is approximately normal even if the population distribution is not.
- This enables us to make inferences about population means regardless of the shape of the population distribution.

- The distribution of weights of milk bottles is normally distributed with a mean of 1.1 lbs and a standard deviation (σ)=0.20.
- What is the probability that the mean of a random sample of 5 bottles will be greater than 0.99 lbs?
- Calculate the mean and standard error for the sampling distribution of a random sample of 5 milk bottles
- By the CLT, is approximately normal with mean=1.1 and standard error = =0.0894

- P( >0.99)=

- Calculate the mean and standard error for the sampling distribution of a random sample of 5 milk bottles

- Closing prices of stocks have a right skewed distribution with a mean (µ) of $25 and σ= $20.
- What is the probability that the mean of a random sample of 40 stocks will be less than $20?
- Calculate the mean and standard error for the sampling distribution of a random sample of 40 stocks
- By the CLT, is approximately normal with mean=25 and standard error = =3.1623

- P( <20)=

- Calculate the mean and standard error for the sampling distribution of a random sample of 40 stocks

- An automobile insurer has found that repair claims have a mean of $920 and a standard deviation of $870. Suppose that the next 100 claims can be regarded as a random sample from the long-run claims process.
- What is the probability that the average of the 100 claims is larger than $900?

Example: the distribution of actual weights of 8 oz. wedges of cheddar cheese produced by a certain company is normal with mean =8.1 oz. and standard deviation =0.1 oz.

- Find the value x such that there is only a 10% chance that the average weight of a sample of five wedges will be above x.

Example: the distribution of actual weights of 8 oz. wedges of cheddar cheese produced by a certain company is normal with mean =8.1 oz. and standard deviation =0.1 oz.

- Find the value x such that there is only a 5% chance that the average weight of a sample of five wedges will be below x.

Chapter 7: Sampling Distributions

Section 7.3

How Can We Make Inferences About a Population?

- Using the CLT to Make Inferences
- Standard Errors in Practice
- Sampling Distribution for a Proportion

Implications of the CLT

- When the sampling distribution of the sample mean is approximately normal, falls within 2 standard errors of with probability close to 0.95 and almost certainly falls within 3 standard errors of . (Empirical Rule)
- For large n, the sampling distribution of is approximately normal no matter what the shape of the underlying population distribution.

In practice, standard errors are estimated

- Standard errors have exact values depending on parameter values, e.g.,
- for a sample proportion
- for a sample mean

- In practice, these parameter values are unknown. Inference methods use standard errors that substitute sample values for the parameters in the exact formulas above
These estimated standard errors are the numbers we use in practice.

- The binomial probability distribution is the sampling distribution for the number of successes in n independent trials
- In practice, the sample proportion of successes is the statistic usually reported
- Since the sample proportion is simply the number of successes divided by the number of trials, the formulas for the mean and standard deviation of the sampling distribution of the proportion of successes are the formulas for the mean and standard deviation of the number of successes divided by n.

- For a binomial random variable with n trials and probability p of success for each, the sampling distribution of the proportion of successes has
- Mean = p
- Standard error =

- For large n, by CLT, the sampling distribution can be approximated by a normal distribution with the same mean and the same standard error.

Chapter 8: Statistical Inference: Confidence Intervals

Section 8.1

What are Point and Interval Estimates of Population Parameters?

- Strong suggestion to visit: http://www.studio4learning.tv
- Try Math -> Statistics
- 8.1 to 8.60 All even-numbered questions
- Note: If a problem needs a simulation, the problem is optional.
- Hawkes: 8-1, 8-2, 8-3, 8-4

- Point Estimate and Interval Estimate
- Properties of Point Estimators
- Confidence Intervals
- Logic of Confidence Intervals
- Margin of Error
- Example

- A point estimate is a single number that is our “best guess” for the parameter
- An interval estimate is an interval of numbers within which the parameter value is believed to fall.

- A point estimate doesn’t tell us how close the estimate is likely to be to the parameter
- An interval estimate is more useful
- It incorporates a margin of error which helps us to gauge the accuracy of the point estimate

- Property 1: A good estimator has a sampling distribution that is centered at the parameter
- An estimator with this property is unbiased
- The sample mean is an unbiased estimator of the population mean
- The sample proportion is an unbiased estimator of the population proportion

- An estimator with this property is unbiased

- Property 2: A good estimator has a small standard error compared to other estimators
- This means it tends to fall closer than other estimates to the parameter
- The sample mean has a smaller standard error than the sample median when estimating the population mean of a normal distribution

- This means it tends to fall closer than other estimates to the parameter

- A confidence interval is an interval containing the most believable values for a parameter
- The probability that this method produces an interval that contains the parameter is called the confidence level
- This is a number chosen to be close to 1, most commonly 0.95.

- To construct a confidence interval for a population proportion, start with the sampling distribution of a sample proportion, which
- Gives the possible values for the sample proportion and their probabilities
- Is approximately a normal distribution for large random samples by the CLT
- Has mean equal to the population proportion
- Has standard deviation called the standard error

- Fact: Approximately 95% of a normal distribution falls within 1.96 standard deviations of the mean
- With probability 0.95, the sample proportion falls within about 1.96 standard errors of the population proportion
- The distance of 1.96 standard errors is the margin of error in calculating a 95% confidence interval for the population proportion

- The margin of error measures how accurate the point estimate is likely to be in estimating a parameter
- It is a multiple of the standard error of the sampling distribution of the estimate when the sampling distribution is a normal distribution.
- The distance of 1.96 standard errors in the margin of error for a 95% confidence interval for a parameter from a normal distribution

Example: The GSS asked 1823 respondents whether they agreed with the statement “It is more important for a wife to help her husband’s career than to have one herself”. 19% agreed. Assuming the standard error is 0.01, calculate a 95% confidence interval for the population proportion who agreed with the statement

- Margin of error = 1.96*se=1.96*0.01=0.02
- 95% CI = 0.19±0.02 or (0.17 to 0.21)
We predict that the population proportion who agreed is somewhere between 0.17 and 0.21.

Chapter 8: Statistical Inference: Confidence Intervals

Section 8.2

How Can We Construct a Confidence Interval to Estimate a Population Proportion?

- Finding the 95% Confidence Interval for a Population Proportion
- Sample Size Needed for Large-Sample Confidence Interval for a Proportion
- How Can We Use Confidence Levels Other than 95%?
- What is the Error Probability for the Confidence Interval Method?
- Summary
- Effect of the Sample Size
- Interpretation of the Confidence Level

- We symbolize a population proportion by p
- The point estimate of the population proportion is the sample proportion
- We symbolize the sample proportion by

- A 95% confidence interval uses a margin of error = 1.96(standard errors)
- CI = [point estimate ± margin of error] =
for a 95% confidence interval

- The exact standard error of a sample proportion equals:
- This formula depends on the unknown population proportion, p
- In practice, we don’t know p, and we need to estimate the standard error as

- A 95% confidence interval for a population proportion p is:

- In 2000, the GSS asked: “Are you willing to pay much higher prices in order to protect the environment?”
- Of n = 1154 respondents, 518 were willing to do so

- Find and interpret a 95% confidence interval for the population proportion of adult Americans willing to do so at the time of the survey

TI Calculator

Press Stats,

- For the 95% confidence interval for a proportion p to be valid, you should have at least 15 successes and 15 failures:

- “95% confidence“ means that there’s a 95% chance that a sample proportion value occurs such that the confidence interval contains the unknown value of the population proportion, p
- With probability 0.05, the method produces a confidence interval that misses p

- In practice, the confidence level 0.95 is the most common choice
- But, some applications require greater (or less)confidence
- To increase the chance of a correct inference, we use a larger confidence level, such as 0.99

- In using confidence intervals, we must compromise between the desired margin of error and the desired confidence of a correct inference
- As the desired confidence level increases, the margin of error gets larger

- A recent GSS asked “If the wife in a family wants children, but the husband decides that he does not want any children, is it all right for the husband to refuse to have children?
- Of 598 respondents, 366 said yes
- Calculate the 99% confidence interval

- Exit poll: Out of 1400 voters, 660 voted for the Democratic candidate.
- Calculate a 95% and a 99% Confidence Interval

- The general formula for the confidence interval for a population proportion is:
Sample proportion ± (z-score)(std. error)

which in symbols is

- A confidence interval for a population proportion p is:
- Assumptions
- Data obtained by randomization
- A large enough sample size n so that the number of success, n , and the number of failures, n(1- ), are both at least 15

- The margin of error for a confidence interval:
- Increases as the confidence level increases
- Decreases as the sample size increases

- If we used the 95% confidence interval method to estimate many population proportions, then in the long run about 95% of those intervals would give correct results, containing the population proportion

CI = function(p = 0.3, n = 40, level = 0.95, rep = 1000){

A = matrix(0, rep, 2)

plot(1,1,type = "n", xlim = c(1,rep), ylim=c(0,1),xlab="simulation", ylab = "CI")

abline(p, 0)

legend(1,1,legend=c(paste("p =",p),paste("n =",n),paste("Level =",level)),

text.col = 'blue', bty = 'n')

for (i in 1:rep){

phat = mean(rbinom(n, 1, p))

E = qnorm(1-(1-level)/2)*sqrt(phat*(1-phat)/n)

a = A[i, ] = c(phat-E, phat+E)

if ((p - a[1])*(p - a[2]) <= 0) lines(c(i, i), a, type = 'l', lwd = 4)

else lines(c(i, i), a, type = 'l', col = "red", lwd = 4)

Sys.sleep(.1)

}

print(A)

ActualLevel = mean(apply(A, 1, function(a) (p - a[1])*(p - a[2]) <= 0))

list("Nominal Confidence Level" = level, 'Actual Level' = ActualLevel)

}

CI(p = 0.3, n = 40, level = 0.95, rep = 80) ## do suggest use of rep = 5000

Chapter 8: Statistical Inference: Confidence Intervals

Section 8.3

How Can We Construct a Confidence Interval to Estimate a Population Mean?

- How to Construct a Confidence Interval for a Population Mean
- Properties of the t Distribution
- Formula for 95% Confidence Interval for a Population Mean
- How Do We Find a t Confidence Interval for Other Confidence Levels?
- If the Population is Not Normal, is the Method “Robust”?
- The Standard Normal Distribution is the t
Distribution with df = ∞

- Point estimate ± margin of error
- The sample mean is the point estimate of the population mean
- The exact standard error of the sample mean is σ/
- In practice, we estimate σ by the sample standard deviation, s

- For large n… from any population
and also

- For small n from an underlying population that is normal…
- The confidence interval for the population mean is:

- In practice, we don’t know the population standard deviation
- Substituting the sample standard deviation s for σ to get se = s/ introduces extra error
- To account for this increased error, we replace the z-score by a slightly larger score, the t-score

- The t-distribution is bell shaped and symmetric about 0
- The probabilities depend on the degrees of freedom, df=n-1
- The t-distribution has thicker tails than the standard normal distribution, i.e., it is more spread out

The t-distribution has thicker tails and is more spread out than the standard normal distribution

- When the standard deviation of the population is unknown, a 95% confidence interval for the population mean µ is:
- To use this method, you need:
- Data obtained by randomization
- An approximately normal population distribution

Do you tend to get a higher, or a lower, price if you give bidders the “buy-it-now” option?

- Consider some data from sales of the Palm M515 PDA (personal digital assistant)
- During the first week of May 2003, 25 of these handheld computers were auctioned off, 7 of which had the “buy-it-now” option

- Summary of selling prices for the two types of auctions:

- Let µ denote the population mean for the “buy-it-now” option
- The estimate of µ is the sample mean: = $233.57
- The sample standard deviation: s = $14.64
- Table B df=6, with 95% Confidence: t = 2.447
233.57 ± 13.54 or (220.03, 247.11)

- The 95% confidence interval for the mean sales price for the bidding only option is:
(220.70, 242.52)

- Notice that the two intervals overlap a great deal:
- “Buy-it-now”:(220.03, 247.11)
- Bidding only: (220.70, 242.52)

- There is not enough information for us to conclude that one probability distribution clearly has a higher mean than the other

A study of 7 American adults from an SRS yields an average height of 67.2 inches and a standard deviation of 3.9 inches. Assuming the heights are normally distributed, a 95% confidence interval for the average height of all American adults(m) is:

“We are 95% confident that the average height of all American adults is between 63.6 and 70.8 inches.”

- In a time use study, 20 randomly selected managers spend a mean of 2.4 hours each day on paperwork. The standard deviation of the 20 times is 1.3 hours. Construct the 95% confidence interval for the mean paperwork time of all managers
- 95% CI = (1.79 < µ < 3.01)
Note that our calculation assumes that the distribution of times is normally distributed

- The 95% confidence interval uses t.025 since 95% of the probability falls between - t.025 and t.025
- For 99% confidence, the error probability is 0.01 with 0.005 in each tail and the appropriate t-score is t.005
- To get other confidence intervals use the appropriate t-value from Table B

- A basic assumption of the confidence interval using the t-distribution is that the population distribution is normal
- Many variables have distributions that are far from normal
- We say the t-distribution is a robust method in terms of the normality assumption

- How problematic is it if we use the t- confidence interval even if the population distribution is not normal?
- For large random samples, it’s not problematic because of the Central Limit Theorem

- What if n is small?
- Confidence intervals using t-scores usually work quite well except for when extreme outliers are present. The method is robust

Chapter 8: Statistical Inference: Confidence Intervals

Section 8.4

How Do We Choose the Sample Size for a Study?

- Sample Size for Estimating a Population Proportion
- Sample Size for Estimating a Population Mean
- What Factors Affect the Choice of the Sample Size?
- What if You Have to Use a Small n?
- Confidence Interval for a Proportion with Small Samples

To determine the sample size,

- First, we must decide on the desired margin of error
- Second, we must choose the confidence level for achieving that margin of error
- In practice, 95% confidence intervals are most common

- The random sample size n for which a confidence interval for a population proportion p has margin of error m (such as m = 0.04) is
- In the formula for determining n, setting = 0.50 gives the largest value for n out of all the possible values of

- A television network plans to predict the outcome of an election between two candidates – Levin and Sanchez
- A poll one week before the election estimates 58% prefer Levin

- What is the sample size for which a 95% confidence interval for the population proportion has margin of error equal to 0.04?

- The z-score is based on the confidence level, such as z = 1.96 for 95% confidence
- The 95% confidence interval for a population proportion p is:
- If the sample size is such that 1.96(se) = 0.04, then the margin of error will be 0.04

- Using 0.58 as an estimate for p
or n =585

- Without guessing,
n=601 gives us a more conservative estimate (always round up)

- Suppose a soft drink bottler wants to estimate the proportion of its customers that drink another brand of soft drink on a regular basis
- What sample size will be required to enable us to have a 99% confidence interval with a margin of error of 1%?
- Thus, we will need to sample at least 16,641 of the soft drink bottler’s customers.

- You want to estimate the proportion of home accident deaths that are caused by falls. How many home accident deaths must you survey in order to be 95% confident that your sample proportion is within 4% of the true population proportion?
- Answer: 601

- The random sample size n for which a confidence interval for a population mean has margin of error approximately equal to m is
where the z-score is based on the confidence level, such as z=1.96 for 95% confidence.

- In practice, you don’t know the value of the standard deviation,
- You must substitute an educated guess for
- Sometimes you can use the sample standard deviation from a similar study
- When no prior information is known, a crude estimate that can be used is to divide the estimated range of the data by 6 since for a bell-shaped distribution we expect almost all of the data to fall within 3 standard deviations of the mean

- A social scientist plans a study of adult South Africans to investigate educational attainment in the black community
- How large a sample size is needed so that a 95% confidence interval for the mean number of years of education has margin of error equal to 1 year? Assume that the education values will fall within a range of 0 to 18 years
- Crude estimate of =range/6=18/6=3

- Find the sample size necessary to estimate the mean height of all adult males to within .5 in. if we want 99% confidence in our results. From previous studies we estimate =2.8.
- Answer: 209 (always round up)

- The first is the desired precision, as measured by the margin of error, m
- The second is the confidence level
- A third factor is the variability in the data
- If subjects have little variation (that is, is small), we need fewer data than if they have substantial variation

- A fourth factor is financial

- The t methods for a mean are valid for any n
- However, you need to be extra cautious to look for extreme outliers or great departures from the normal population assumption

- In the case of the confidence interval for a population proportion, the method works poorly for small samples because the CLT no longer holds

- If a random sample does not have at least 15 successes and 15 failures, the confidence interval formula
is still valid if we use it after adding 2 to the original number of successes and 2 to the original number of failures. This results in adding 4 to the sample size n

Chapter 8: Statistical Inference: Confidence Intervals

Section 8.5

How Do Computers Make New Estimation Methods Possible?

- The Bootstrap

- When it is difficult to derive a standard error or a confidence interval formula that works well you can use simulation.
- The bootstrap is a simulation method that resamples from the observed data. It treats the data distribution as if it were the population distribution

- To use the bootstrap method
- Resample, with replacement, n observations from the data distribution
- For the new sample of size n, construct the point estimate of the parameter of interest
- Repeat process a very large number of times (e.g., selecting 10,000 separate samples of size n and calculating the 10,000 corresponding parameter estimates)

Example:

- Suppose your data set includes the following:
This data has a mean of 161.44 and standard deviation of 0.63.

- Use the bootstrap method to find a 95% confidence interval for the population standard deviation

- Re-sample with replacement from this sample of size 10 and compute the standard deviation of the new sample
- Repeat this process 100,000 times. A histogram showing the distribution of 100,000 samples drawn from this sample is

- Now, identify the middle 95% of these 100,000 sample standard deviations (take the 2.5th and 97.5th percentiles).
- For this example, these percentiles are 0.26 and 0.80.
- The 95% bootstrap confidence interval for is (0.26, 0.80)

- Open Excel -> Data Analysis -> Random Number Generation -> Fill in
Number of Variables = 10,000

Number of Random Numbers = the same size

Distribution = Discrete

Chapter 9: Statistical Inference: Significance Tests About Hypotheses

Section 9.1: What Are the Steps for Performing a Significance Test?

- HW #9
- Page 412: 2, 4, 8
- Page 426: 12, 14, 16, 18, 22, 24
- Page 439: 28, 30, 32, 34
- Page 445: 42, 44, 46, 48, 50
- Page 452: 52, 54, 56, 58
- Page 458: 60, 62, 64
- Hawkes: 9.1 - 9.4, 9.6

- 5 Steps of a Significance Test
- Assumptions
- Hypotheses
- Calculate the test statistic
- P-Value
- Conclusion and Statistic Significance

- A significance test is a method of using data to summarize the evidence about a hypothesis
- A significance test about a hypothesis has five steps
- Assumptions
- Hypotheses
- Test Statistic
- P-value
- Conclusion

- A (significance) test assumes that the data production used randomization
- Other assumptions may include:
- Assumptions about the sample size
- Assumptions about the shape of the population distribution

- A hypothesis is a statement about a population, usually of the form that a certain parameter takes a particular numerical value or falls in a certain range of values
- The main goal in many research studies is to check whether the data support certain hypotheses

- Each significance test has two hypotheses:
- The null hypothesisis a statement that the parameter takes a particular value.
- The alternative hypothesisstates that the parameter falls in some alternative range of values.

- The value in the null hypothesis, called claimed/hypothesized value, usually represents no effect
- The symbol Ho denotes null hypothesis

- The value in the alternative hypothesis usually represents an effect of some type
- The symbol Ha denotes alternative hypothesis
- The alternative hypothesis should express what the researcher hopes to show.

- The hypotheses should be formulated before viewing or analyzing the data!

- A test statistic describes how far the point estimate falls from the claimed value (usually in terms of the number of standard errors between the two).
- If the test statistic falls far from the claimed value in the direction specified by the alternative hypothesis, it is good evidence against the null hypothesis and in favor of the alternative hypothesis.
- We use the test statistic to assesses the evidence against the null hypothesis by giving a probability , the P-Value.

- To interpret a test statistic value, we use a probability summary of the evidence against the null hypothesis, Ho
- First, we presume that Ho is true
- Next, we consider the sampling distribution from which the test statistic comes
- We summarize how far out in the tail of this sampling distribution the test statistic falls

- We summarize how far out in the tail the test statistic falls by the tail probability of that value and values even more extreme
- This probability is called a P-value
- The smaller the P-value, the stronger the evidence is against Ho

Note: This is just one of 3 stories.

- The P-value is the probability that the test statistic equals the observed value or a value even more extreme in favor of Ha.
- It is calculated by presuming that the null hypothesis H0 is true
The smaller the P-value, the stronger the evidence the data provide against the null hypothesis. That is, a small P-value indicates a small likelihood of observing the sampled results if the null hypothesis were true.

- The conclusion of a significance test reports the P-value and interprets what it says about the question that motivated the test

Chapter 9: Statistical Inference: Significance Tests About Hypotheses

Section 9.2: Significance Tests About Proportions

- Steps of a Significance Test about a Population Proportion
- Example: One-Sided Hypothesis Test
- How Do We Interpret the P-value?
- Two-Sided Hypothesis Test for a Population Proportion
- Summary of P-values for Different Alternative Hypotheses
- Significance Level
- One-Sided vs Two-Sided Tests
- The Binomial Test for Small Samples

Step 1: Assumptions

- The variable is categorical
- The data are obtained using randomization
- The sample size is sufficiently large that the sampling distribution of the sample proportion is approximately normal:
- np ≥ 15 and n(1-p) ≥ 15

Step 2: Hypotheses

- The null hypothesis has the form:
- H0: p = p0

- The alternative hypothesis has the form:
- Ha: p > p0 (one-sided test) or
- Ha: p < p0 (one-sided test) or
- Ha: p ≠ p0 (two-sided test)

Step 3: Test Statistic

- The test statistic measures how far the sample proportion falls from the null hypothesis value, p0, relative to what we’d expect if H0 were true
- The test statistic is:

Step 4: P-value

- The P-value summarizes the evidence
- It describes how unusual the observeddata would be if H0 were true

Step 5: Conclusion

- We summarize the test by reporting and interpreting the P-value

An astrologer prepares horoscopes for 116 adult volunteers. Each subject also filled out a California Personality Index (CPI) survey. For a given adult, his or her horoscope is shown to the astrologer along with their CPI survey as well as the CPI surveys for two other randomly selected adults. The astrologer is asked which survey is the correct one for that adult

- With random guessing, p = 1/3
- The astrologers’ claim: p > 1/3
- The hypotheses for this test:
- Ho: p = 1/3
- Ha: p > 1/3

Step 1: Assumptions

- The data is categorical – each prediction falls in the category “correct” or “incorrect”
- Subjects were randomly selected
- np=116(1/3) > 15
- n(1-p) = 116(2/3) > 15

Step 2: Hypotheses

- H0: p = 1/3
- Ha: p > 1/3

Step 3: Test Statistic:

In the actual experiment, the astrologers were correct with 40 of their 116 predictions (a success rate of 0.345)

Step 4: P-value

- The P-value is 0.40

Step 5: Conclusion

- The P-value of 0.40 is not especially small
- It does not provide strong evidence against H0:p = 1/3
- There is not strong evidence that astrologers have special predictive powers

- A significance test analyzes the strength of the evidence against the null hypothesis
- The smaller the P-value, the stronger the evidence against the null hypothesis
- That is, If the P-value is small, the data contradict H0 and support Ha

- A two-sided alternative hypothesis has the form Ha: p ≠ p0
- The P-value is the two-tail probability under the standard normal curve
- We calculate this by finding the tail probability in a single tail and then doubling it

- Study: investigate whether dogs can be trained to distinguish a patient with bladder cancer by smelling compounds released in the patient’s urine

- Experiment:
- Each of 6 dogs was tested with 9 trials
- In each trial, one urine sample from a bladder cancer patient was randomly place among 6 control urine samples

- Results:
In a total of 54 trials with the six dogs, the dogs made the correct selection 22 times (a success rate of 0.407)

- Does this study provide strong evidence that the dogs’ predictions were better or worse than with random guessing?

Step 1: Check the sample size requirement:

- Is the sample size sufficiently large to use the hypothesis test for a population proportion?
- Is np0 >15 and n(1-p0)>15?
- 54(1/7) = 7.7 and 54(6/7) = 46.3

- The first, np0 is not large enough
- We will see that the two-sided test is robust when this assumption is not satisfied

Step 2: Hypotheses

- H0:p = 1/7
- Ha:p ≠ 1/7

Step 3: Test Statistic

Step 4: P-value

Step 5: Conclusion

- Since the P-value is very small and the sample proportion is greater than 1/7, the evidence strongly suggests that the dogs’ selections are better than random guessing

- Insight:
- In this study, the subjects were a convenience sample rather than a random sample from some population
- Also, the dogs were not randomly selected
- Any inferential predictions are highly tentative.They are valid only to the extent that the patients and the dogs are representative of their populations
- The predictions become more conclusive if similar results occur in other studies

- Sometimes we need to make a decision about whether the data provide sufficient evidence to reject H0
- Before seeing the data, we decide how small the P-value would need to be to reject H0
- This cutoff point is called the significance level

- The significance level is a number such that we reject H0 if the P-value is less than or equal to that number
- In practice, the most common significance level is 0.05
- When we reject H0 we say the results are statistically significant

- Learning the actual P-value is more informative than learning only whether the test is “statistically significant at the 0.05 level”
- The P-values of 0.01 and 0.049 are both statistically significant in this sense, but the first P-value provides much stronger evidence against H0 than the second

- Analogy: Legal trial
- Null Hypothesis: Defendant is Innocent
- Alternative Hypothesis: Defendant is Guilty
- If the jury acquits the defendant, this does not mean that it accepts the defendant’s claim of innocence
- Innocence is plausible, because guilt has not been established beyond a reasonabledoubt

- Things to consider in deciding on the alternative hypothesis:
- The context of the real problem
- In most research articles, significance tests use two-sided P-values
- Confidence intervals are two-sided

- In practice, the large-sample z test still performs quite well in two-sided alternatives even for small samples.
- Warning: For one-sided tests, when p0 differs from 0.50, the large-sample test does not work well for small samples. In fact, we can find the exact p-value using the binomial distribution.
- Example: A coin was flipped 20 times and the coin turned up heads 14 times. Test H0: p = 0.5 against Ha: p ≠ 0.5.
- Solution: Let X be the number of heads among the 20 tosses. The exact p-value equals
P(X ≥ 14) + P(X ≤ 6) = 0.05766 + 0.05766 = 0.115.

- In a survey by Media General and the Associated Press, 813 of the 1084 respondents indicated support for a ban on household aerosols. At the 1% significance level, test the claim that more than 70% of the population supports the ban.

- In a Roper Organization poll of 2000 adults, 1280 have money in regular savings accounts. Use this sample data to test the claim that less than 65% of all adults have money in regular savings accounts. Use a 5% level of significance.

- According to a Harris Poll, 71% of Americans believe that the overall cost of lawsuits is too high. If a random sample of 500 people results in 74% who hold that belief, test the claim that the actual percentage is 71%. Use a 10% significance level.

Chapter 9: Statistical Inference: Significance Tests about Hypotheses

Section 9.3: Significance Tests About Means

- Steps of a Significance Test about a Population Mean
- Summary of P-values for Different Alternative Hypotheses
- Example: Significance Test for a Population Mean
- Results of Two-Sided Tests and Results of Confidence Intervals Agree
- What If the Population Does Not Satisfy the Normality Assumption?
- Regardless of Robustness, Look at the Data

- Step 1: Assumptions
- The variable is quantitative
- The data are obtained using randomization
- The population distribution is approximately normal. This is most crucial when n is small and Ha is one-sided.

- Step 2: Hypotheses:
- The null hypothesis has the form:
- H0: µ = µ0

- The alternative hypothesis has the form:
- Ha: µ > µ0 (one-sided test) or
- Ha: µ < µ0 (one-sided test) or
- Ha: µ ≠ µ0 (two-sided test)

- Step 3: Test Statistic
- The test statistic measures how far the sample mean falls from the null hypothesis value µ0, as measured by the number of standard errors between them
- The test statistic is:

- Step 4: P-value
- The P-value summarizes the evidence
- It describes how unusual the data would be if H0 were true

- Step 5: Conclusion
- We summarize the test by reporting and interpreting the P-value

- A study compared different psychological therapies for teenage girls suffering from anorexia
- The variable of interest was each girl’s weight change: ‘weight at the end of the study’ – ‘weight at the beginning of the study’

- One of the therapies was cognitive therapy
- In this study, 29 girls received the therapeutic treatment
- The weight changes for the 29 girls had a sample mean of 3.00 pounds and standard deviation of 7.32 pounds

- How can we frame this investigation in the context of a significance test that can detect whether the therapy was effective?
- Null hypothesis: “no effect”
- Alternative hypothesis: therapy is “effective”

- Step 1: Assumptions
- The variable (weight change) is quantitative
- The subjects are a good representation of all girls with anorexia.
- The population distribution is approximately normal

- Step 2: Hypotheses
- H0: µ = 0
- Ha: µ > 0

- Step 3: Test Statistic

- Step 4: P-value
The P-value is the area to the right of t = 2.21 for the t sampling distribution with 28 df. This values is 0.018.

- If the treatment had no effect, the probability of obtaining a sample this extreme or more extreme would be 0.018

- Step 5: Conclusion
- The small P-value of 0.018 provides considerable evidence against the null hypothesis (the hypothesis that the therapy had no effect)

- “The diet had a statistically significant positive effect on weight (mean change = 3 pounds, n = 29, t = 2.21, P-value = 0.018)”
- The effect, however, may be small in practical terms
- 95% CI for µ: (0.2, 5.8) pounds

- After 16 weeks on a diet, 41 subjects lost an average of 9.7 kg with a standard deviation of 3.4 kg
- Calculate the P-value for testing:Ho: μ=0 Ha: μ<0

- Conclusions about means using two-sided significance tests are consistent with conclusions using confidence intervals
- If P-value ≤ 0.05 in a two-sided test, a 95% confidence interval does not contain the value specified by the null hypothesis
- If P-value > 0.05 in a two-sided test, a 95% confidence interval does contain the value specified by the null hypothesis

- For large samples (roughly about 30 or more) this assumption is usually not important
- The sampling distribution of is approximately normal regardless of the population distribution

- In the case of small samples, we cannot assume that the sampling distribution of is approximately normal
- Two-sided inferences using the t distribution are robust against violations of the normal population assumption. They still usually work well if the actual population distribution is not normal
- The test does NOT work well for a one-sided test with small n when the population distribution is highly skewed

- Whether n is small or large, you should look at the data to check for severe skew or for severe outliers
- In these cases, the sample mean could be a misleading measure

Chapter 9: Statistical Inference: Significance Tests about Hypothesis

Section 9.4: Decisions and Types of Errors in Significance Tests

- Type I and Type II Errors
- Significance Test Results
- Type I Errors
- Type II Errors
- a, b, and Power

- When H0 is true, a Type I Error occurs when H0 is rejected
- When H0 is false, a Type II Error occurs when H0 is not rejected
- As P(Type I Error) goes Down, P(Type II Error) goes Up
- The two probabilities are inversely related

- If we reject H0 when in fact H0 is true, this is a Type I error.
- If we decide there is a significant relationship in the population (reject the null hypothesis):
- This is an incorrect decision only if H0 is true.
- The probability of this incorrect decision is equal to a.

- If we reject the null hypothesis when it is true and a = 0.05:
- There really is no relationship and the extremity of the test statistic is due to chance.
- About 5% of all samples from this population will lead us to incorrectly reject the null hypothesis and conclude significance.

- Suppose H0 is true. The probability of rejecting H0, thereby committing a Type I error, equals the significance level, α, for the test.
- We can control the probability of a Type I error by our choice of the significance level
- The more serious the consequences of a Type I error, the smaller α should be

- If we fail to reject H0 when in fact Ha is true, this is a Type II error.
- If we decide not to reject the null hypothesis and thus allow for the plausibility of the null hypothesis
- We make an incorrect decision only if Ha is true.
- The probability of this incorrect decision is denoted by

- The probability that a fixed level a significance test will reject H0 when a particular alternative value of the parameter is true is called the power of the test against that specific alternative value. Power = 1-.
- While a gives the probability of wrongly rejecting H0 when in fact H0 is true, power gives the probability of correctly rejecting H0 when in fact H0 should be rejected (because the value of the parameter is some specific value satisfying the alternative hypothesis)
- When m is close to m0, the test will find it hard to distinguish between the two (low power); however, when m is far from m0, the test will find it easier to find a difference (high power).

Chapter 9: Statistical Inference: Significance Tests about Hypothesis

Section 9.5: Limitations of Significance Tests

- Statistical Significance vs. Practical Significance
- Significance Tests Are Less Useful Than Confidence Intervals
- Misinterpretations of Results of Significance Tests
- Where Did the Data Come From?

- When we conduct a significance test, its main relevance is studying whether the true parameter value is:
- Above, or below, the value in H0 and
- Sufficiently different from the value in H0 and its direction from that value

- The test does not tell us about the practical importance of the results

- When the sample size is very large, tiny deviations from the null hypothesis (with little practical consequence) may be found to be statistically significant.
- When the sample size is very small, large deviations from the null hypothesis (of great practical importance) might go undetected (statistically insignificant). So,
Statistical significance is not the same thing as practical significance.

- A small P-value, such as 0.001, is highly statistically significant, but it does not imply an important finding in any practical sense
- In particular, whenever the sample size is large, small P-values can occur even when the point estimate is near the parameter value in H0

- A significance test merely indicates whether the particular parameter value in H0 is plausible
- When a P-value is small, the significance test indicates that the hypothesized value is not plausible, but it tells us little about which potential parameter values are plausible
- A Confidence Interval is more informative, because it displays the entire set of believable values

- “Fail to Reject H0”does not mean “Accept H0”
- A P-value above 0.05 when the significance level is 0.05, does not mean that H0 is correct
- A test merely indicates whether a particular parameter value is plausible
- When we fail to reject H0: µ = 10, we do not mean to accept µ = 10, but mean that 10 is a plausible value for µ.