1 / 44

Chapter 10 Correlation and Regression

Chapter 10 Correlation and Regression. Scatter Diagrams and Linear Correlation. Note:. Studies of correlation and regression of two variables usually begin with a graph of paired data values ( x,y ). We call this a scatter diagram. Scatter Plot/Scatter Diagram.

gannon
Download Presentation

Chapter 10 Correlation and Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 10 Correlation and Regression

  2. Scatter Diagrams and Linear Correlation

  3. Note: • Studies of correlation and regression of two variables usually begin with a graph of paired data values (x,y). We call this a scatter diagram

  4. Scatter Plot/Scatter Diagram • It is a graph in which data pairs (x,y) are plotted as individual points on a grid with horizontal axis x and vertical axis y. We call x the explanatory variable and y the response variable.

  5. Why do we use scatter plot? • We use scatter plot to observe whether there seems to be a linear relationship between x and y values.

  6. Example: • Phosphorous is a chemical used in many household and industrial cleaning compound. Unfortunately, phosphorous tends to find its way into surface water, where it can kill fish, plants, and other wetland creatures. Phosphorous reduction programs are required by law and are monitored by the EPA. A random sample of eight sites in a California wetlands study gave the following information about phosphorous reduction in drainage water. X is a random variable that represents phosphorous concentration and y is a random variable that represents total phosphorous concentration. • Graph the scatter plot and then comment on the relationship between x and y

  7. Group Work • Here are the safety report of different divisions: • Make a scatter plot • Does a line fit the data reasonably well? • Draw a line that “fits best”

  8. Scatter Diagrams with its correlation

  9. Correlation coefficient r • It is a numerical measurement that assesses the strength of a linear relationship between two variables x and y • 1. r is a unitless measurement between -1 and 1. . If r=1, there is perfect positive linear correlations. If r=-1, there is perfect negative linear correlation. If r=0, then there is no linear correlation. The closer r is to 1 or -1, the better a line describes the relationship between the two variables x and y. • 2. Positive values of r imply that as x increases, y tends to increase. • Negative values of r imply that as x increases, y tends to decrease. • 3. The value of r is the same regardless of which variable is the explanatory variable and which is the response variable. In other words, the value of r is the same for the pairs (x,y) and the corresponding pairs (y,x) • 4. The value of r does not change when either variable is converted to different units.

  10. How do you compute the sample correlation coefficient r? • Obtain a random sample of n data pairs (x,y). The data pairs should have a bivariate normal distribution. This means that for a fixed value of x, the y values should have a normal distribution, and for a fixed y, the x values should have their own normal distribution • 1. Using the data pairs, compute • 2. With n=sample size, , you are ready to compute the sample correlation coefficient r using the computation formula

  11. Example: calculate r

  12. Group work: calculate r

  13. Note: • Usually a scatter diagram does not contain all possible data points that could be gathered. • r=sample correlation coefficient computed from a random sample of (x,y) data pairs. • = population correlation coefficient computed from all population data pairs (x,y) • In ordered pairs (x,y), x is called the explanatory variable and y is called the response variable. When r indicates a linear correlation between x and y, changes in values of y tend to respond to changes in values of x according to a linear model. A lurking variable is a variable that is neither an explanatory nor a response variable. Yet, a lurking variable may be responsible for changes in both x and y

  14. TI 83/TI 84 • Enter the data into two columns. Use Stat Plot and choose the first type. Use option 9:ZoomStat under Zoom. • CATALOG, find DiagnosticON, press enter twice. Then, press STAT, CALC, then option 8:LinReg(a+bx)

  15. Homework Practice • P503 #1-16 even

  16. Linear Regression and the Coefficient of Determination

  17. Least-squares criterion • The sum of the squares of the vertical distances from the data points (x,y) to the line is made as small as possible.

  18. What does it mean? • It means that for any given point, the distance from the point to the line has the least amount of error (the sum of the squares of the vertical distance from the points to the line be made as small as possible). We use the least-squares criterion to find the best linear equation.

  19. How to find the equation of the Least-Squares line • Slope: • Y-intercept= • Another way to calculate b is:

  20. Example: Find the least-squares liney=a+bx

  21. Sketch the scatter plot and the least-squares line

  22. Remember • The point is always on the least-squares line

  23. Using the Least-Squares Line for Prediction • Predicting values for x values that between observed x values in the data set is called interpolation • Predicting values of x values that are beyond observed x values in the data set is called extrapolation. Extrapolation may produce unrealistic forecasts.

  24. Group Work • Find the least-squares line. • Predict when x = 51 X= Super soldier serum y=Captain America

  25. Coefficient of determination • is the coefficient of determination. • 1) Compute the sample correlation r using the procedure for r. Then simply compute , the sample coefficient of determination • 2) The value is the ratio of explained variation over total variation. That is, is the fractional amount of total variation in y that can be explained by using the linear model • 3)Furthermore, is the fractional amount of total variation in y that is due to random chance or to the possibility of lurking variables that influence y.

  26. Homework Practice • Pg 520 #1-18 eoo

  27. Inferences for Correlation and Regression

  28. Sample Statistic to Population Parameter

  29. Important Note: To make inferences regarding the population correlation coefficient and the slope of the population least-squares line, we need to be sure that • The set (x,y) of ordered pairs is a random sample from the population of all possible such (x,y) pairs. • For each fixed value of x, the y values have a normal distribution. All of the y distributions have the same variance, and, for a given x value, the distribution of y values has a mean that lies on the least-squares line. We also assume that for a fixed y, each x has its own normal distribution. In most cases the result are still accurate if the distribution are simply mound-shaped and symmetric, and the y variances are approximately equal.

  30. How to test the Population Correlation Coefficient • 1) Use the null hypothesis . In the context of the application, state the alternate hypothesis ( and set the level of significance . • 2) Obtain a random sample of data pairs (x,y) and compute the sample test statistic • with degrees of freedom • 3) Use a Student’s t distribution and the type of test, one-tailed or two-tailed, to find (estimate) the P-value corresponding to the test statistic • 4) Conclude the test. If P-value If P-value • 5) Interpret your conclusion in the context of the application.

  31. Example: • Learn more, earn more! We have probably all heard this platitude. The question is whether or not there is some truth in the statement. Do college graduates have an improved chance at a better income? Is there a trend in the general population? Consider the following variables: x=percentage of the population 25 or older with at least four years of college and y = percentage growth in per capita income over the past seven years. A random sample of six communities in Ohio gave the info. Use 1% level of significance

  32. Answer • First you compute the sample correlation coefficient r. • with d.f=4 • 0.005<P-value<0.010 • Actual P-value=0.0092 so we reject null and conclude that population correlation coefficient between x and y is positive.

  33. Group Work • A medical research team is studying the effect of a new drug on red blood cells. Let x be a random variable representing milligrams of the drug given to a patient. Let y be a random variable representing red blood cells per cubic milliliter of whole blood. A random sample of n=7 volunteer patients gave the following result. • Use 1% level of significance to test claim that

  34. Standard Error of Estimate • It is to find the “error” from the data points to the best fit line. You want to minimalize this. • 1) Obtain a random sample of n data pairs (x,y) • 2)Use the procedure to find a and b from the sample least-squares line

  35. TI 83/TI 84 • The value of is given as s under STAT, TEST, Option E:LinREGTTest

  36. How to Find a Confidence Interval for a predicted y from the Least-Squares Line • 1) Obtain a random sample of data pairs (x,y) • 2) Use the procedure to find the least-squares line . You also need to find from the sample data and the standard error of estimate using equation of this section. • 3) The c confidence interval for y for a specified value x is

  37. Confidence interval Continue. • is the predicted value of y from the least-squares line for a specified x value • c=confidence level • n= number of data pairs • critical value from student’s t distribution for c confidence level using d.f.=n-2 • = standard error of estimate

  38. Example: Find the least-square equation Find r Find when x is at 45 Find the 95% confidence interval

  39. Inferences about the slope • It is important because it measures the rate at which y changes per unit change in x

  40. How to Test and find a confidence interval for • Obtain a random sample of 3 or more data pairs. • 1) Use the null hypothesis . Use an alternate hypothesis . Set the level of significance . • 2) sample test statistic • with d.f.= n-2 • 3) Use a Student’s t distribution and the type of test, one-tailed or two-tailed, find the P-value • 4) Conclude • 5) Interpret your result

  41. To Find a confidence interval for • Where • c=confidence level • n=number of data pairs • students t distribution critical value for confidence level c and d.f.=n-2 • standard error of estimate

  42. Example: • Plate tectonics and the spread of the ocrean floor are very important topics in modern studies of earthquakes and earth science in general. • A) Compute a,b, , , • B) Use 5% level of significance to test the claim that is positive • C) Find a 75% confidence interval for

  43. Group Work • How fast do puppies grow? X=age in weeks and y=weight in kilograms for a random sample of male wolf pups. • A) Find the a, b, and standard error • B) Use a 1% level of significance to test the claim that • C) Compute an 80% confidence interval for

  44. Homework Practice • Pg 543 #1-12 odd

More Related