Chapter 4: Correlation and Regression

Chapter 4:Correlation and Regression Lecture PowerPoint Slides

Chapter 4 Overview • 4.1 Scatterplots and Correlation • 4.2 Introduction to Regression • 4.3 Further Topics in Regression

The Big Picture Where we are coming from and where we are headed… • Chapter 3 showed us methods for summarizing data using descriptive statistics, but only one variable at a time. • In Chapter 4, we learn how to analyze the relationship between two quantitative variables using scatterplots, correlation, and regression. • In Chapter 5, we will learn about probability, which we will need in order to perform statistical inference.

4.1: Scatterplots and Correlation Objectives: • Construct and interpret scatterplots for two quantitative variables. • Calculate and interpret the correlation coefficient. • Determine whether a linear correlation exists between two variables.

Scatterplots Whenever you are examining the relationship between two quantitative variables, your best bet is to start with a scatterplot. A scatterplot is used to summarize the relationship between two quantitative variables that have ben measured on the same element. A scatterplot is a graph of points (x,y), each of which represents one observation from the data set. One of the variables is measured along the horizontal axis and is called the x variable. The other variable is measured along the vertical axis and is called the y variable.

Scatterplots The relationship between two quantitative variables can take many different forms. Four of the most common are: Positive linear relationship: As x increases, y also tends to increase. Negative linear relationship: As x increases, y tends to decrease. No apparent relationship: As x increases, y tends to remain unchanged. Nonlinear relationship: The x and y variable are related, but not in a way that can be approximated using a straight line.

Correlation Coefficient Scatterplots provide a visual description of the relationship between two quantitative variables. The correlation coefficient is a numerical measure for quantifying the linear relationship between two quantitative variables. The correlation coefficient rmeasures the strength and direction of the linear relationship between two variables. The correlation coefficient r is where sx is the sample standard deviation of the x data values, and sy is the sample standard deviation of the y data values.

Calculating Correlation Coefficient

Properties of r • If most of the data values fall in Regions 1 and 3, r will tend to be positive. • If most of the data values fall in Regions 2 and 4, r will tend to be negative. • If the four regions share the data values more or less equally, then r will be near zero.

Properties of r • The correlation coefficient r is always -1 ≤ r ≤ 1. • When r = +1, a perfect positive relationship exists between x and y. • Values of r near +1 indicate a positive relationship between x and y. • The closer r gets to +1, the stronger the evidence for a positive relationship. • The variables are said to be positively associated. • As x increases, y tends to increase. • When r = -1, a perfect negative relationship exists between x and y. • Values of r near -1 indicate a negative relationship between x and y. • The closer r gets to -1, the stronger the evidence for a negative relationship. • The variables are said to be negatively associated. • As x increases, y tends to decrease. • Values of r near 0 indicate there is no linear relationship between x and y. • The closer r gets to 0, the weaker the evidence for a linear relationship. • The variables are not linearly associated. • A nonlinear relationship may exist between x and y.

Properties of r

Test for Linear Correlation There is a simple comparison test that will tell us whether the correlation coefficient is strong enough to conclude that the variables are correlated. • Comparison Test for Linear Correlation • Find the absolute value |r| of the correlation coefficient. • Turn to the Table of Critical Values for the Correlation Coefficient and select the row corresponding to the sample size n. • Compare |r| to the critical value from the Table. • If |r| > critical value, you can conclude x and y are linearly correlated. • If r > 0, they are positively correlated. • If r < 0, they are negatively correlated. • If |r| is not greater than critical value, then x and y are not linearly correlated.

4.2: Introduction to Regression Objectives: • Understand and calculate the range of a data set. • Explain in my own words what a deviation is. • Calculate the variance and the standard deviation for a population or a sample.

The Regression Line Section 4.1 introduced the correlation coefficient. In this section, we learn how to approximate the linear relationship between two numerical variables using the regression line and regression equation. We write the equation of the regression line as

The Regression Line Equation of the Regression Line The equation of the regression line that approximates the relationship between x and y is where the regression coefficients are the slope, b1, and the y intercept, b0. The equations of these coefficients are Note: The “hat” over the y (pronounced “y-hat”) indicates this is an estimate of y and not necessarily an actual value of y.

Interpreting Slope and y-Intercept • In statistics, we interpret the slope of the regression line as the estimated change in y per unit increase in x. • The y-intercept is interpreted as the estimated value of y when x equals 0. b1= 0.9. For each increase of 1F in low temp, the estimated high temp increases by 0.9F. b0= 20. When the low temp is 0F, the estimated high temp is 20F. • The slope b1and the correlation coefficient r always have the same sign. • b1is positive if and only if r is positive. • b1is negative if and only if r is negative.

Predictions and Prediction Error We can use the regression equation to make estimates or prediction. For any value of x, the predicted value of y lies on the regression line. Example: Low Temp = 50F Note: The predicted high temp for a city with a low temp of 50F is 65F. Dallas had a low temp of 50F and an actual high temp of 70F.

Predictions and Prediction Error • The prediction error, or residual, measures how far the predicted “y-hat” value is from the actual value of y observed in the data set. The prediction error may be positive or negative. • Positive prediction error: The data value lies above the regression line, so the observed value is greater than the predicted value for the given value of x. • Negative prediction error: The data value lies below the regression line, so the observed value is less than the predicted value for the given value of x. • Prediction error equal to zero: The data value lies directly on the regression line, so the observed value of y is exactly equal to what is predicted for the given value of x.

Cautions with Regression The correlation coefficient and regression line are both sensitive to extreme values. Extrapolation consists of using the regression equation to make estimates or predictions based on x-values that are outside the range of the x-values in the data set.

4.3: Further Topics in Regression Analysis Objectives: • Calculate the sum of squares error (SSE), and use the standard error of the estimate s as a measure of a typical prediction error. • Describe how total variability, prediction error, and improvement are measured by the total sum of squares (SST), the sum of squares error (SSE), and the sum of squares regression (SSR). • Explain the meaning of the coefficient of determination r2as a measure of the usefulness of the regression.

Sum of Squares Error (SSE) Consider the results for ten subjects who were given a set of short-term memory tasks. The memory score and time to memorize are given. Sum of Squares Error (SSE) The least-squares criterion states that the regression line is the line for which the SSE is minimized.

Standard Error of the Estimate s The standard error of the estimate gives a measure of the typical residual. That is, s is a measure of the typical prediction error. If the typical prediction error is large, then the regression line may not be useful. Standard Error of the Estimate s SSE = 12 s = 1.2247

SST, SSR, and SSE The coefficient of determination r2depends on the values of two new statistics, SST and SSR. The least-squares criterion guarantees that SSE is the smallest possible value for a data set. However, this does not guarantee that the regression is useful. Suppose our best estimate for y was the average y value. Now suppose you found the difference between each y value and the average y and summed the squared differences. This would be the total sum of squares (SST):

SST, SSR, and SSE Since we have a regression equation, we can make predictions for y that are (hopefully) more accurate than the average y value. The amount of improvement is the difference between y-hat and the average y. This leads us to the sum of squares regression (SSR): Relationship Among SST, SSR, and SSE SST = SSR + SSE

Coefficient of Determination r2 SSR represents the amount of variability in the response variable that is accounted for by the regression equation. SSE represents the amount of variability in the y that is left unexplained after accounting for the relationship between x and y. Since we know that SST represents the sum of SSR and SSE, it makes sense to consider the ratio of SSR and SST, called the coefficient of determination r2. Coefficient of Determination r2 The coefficient of determination r2= SSR/SST measures the goodness of fit of the regression equation to the data. We interpret r2as the proportion of the variability in y that is accounted for by the linear relationship between y and x. The values that r2can take are 0 ≤ r2 ≤ 1.

Chapter 4 Overview • 4.1 Scatterplots and Correlation • 4.2 Introduction to Regression • 4.3 Further Topics in Regression

Chapter 4: Correlation and Regression