1 / 87

Chapter 4: Describing the Relation between Two Variables

Chapter 4: Describing the Relation between Two Variables. 4.1 Scatter Diagrams and Correlation 4.2 Least Squares Regression 4.3 Diagnostics on the Least Squares Regression Line. October 11, 2008. Variable Association.

brandi
Download Presentation

Chapter 4: Describing the Relation between Two Variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4: Describing the Relation between Two Variables 4.1 Scatter Diagrams and Correlation 4.2 Least Squares Regression 4.3 Diagnostics on the Least Squares Regression Line October 11, 2008

  2. Variable Association Question: In a population are two or more variables of the population linked? For example, do Math 127A students with brown eyes have higher IQ than students with other eye colors?

  3. Bivariate Data Recall: A variable is any characteristic of the objects in the population that will be analyzed. Data is the value (categorical or quantitative) that is measured for a variable. If we have only one variable that is measured, then we call this univariatedata. If two variables are measured simultaneously, then we call it bivariatedata.

  4. Example 1 Consider the population of all cars in the State of Tennessee. Suppose we collect data on the number of miles on each car and the age of the car. One variable is the mileage of each car and a second variable is the age (in years) of each car. This would form a bivariate dataset. Question: Is there a relationship between the age of the car and the number of miles?

  5. Example 2 Consider the population of all undergraduates at Vanderbilt during the present academic year. Suppose that we survey each student to determine the number of hours that they watch television each week and their GPA at the end of Spring 2007 semester. The two variables are: (1) hours of TV watching per week and (2) GPA. This forms a bivariate dataset. Question: Is there a relationship for Vanderbilt undergraduates concerning these two variables?

  6. Response & Explanatory Variables Definition: Suppose we have bivariate data for two variables in a population or sample. The response ( or dependent) variable is the variable whose value can be explained by the values of the explanatory (or independent) variable.

  7. Association between Variables Definition: Consider two variables associate with a population. We say that an association exists between the two variables if a particular value for one of the variables is more likely to occur with certain values of the other variable.

  8. Association between Two Quantitative Variables We now consider a sample that contains information about two quantitative variables. We want to determine if an association between these two variables exist.

  9. Association between Sets (variables)

  10. One Approach One approach is to look at the descriptive characteristics (statistics) of set of the bivariate dataset separately. Example:S = (-2,3,7,8,9) and T = (0,1,4,5,10). Range: r = 11r = 10 Mean: m = 5m = 4 Median: median = 7median = 4 SD: s = 4.52s = 3.94 Conclusion: Not much help!

  11. A Better Approach: Scatterplots Definition: The plot of the points of A as points in the xy-plane is called a scatterplot. Remark: Although it technically doesn’t matter, we choose the first set to be the explanatory variable (horizontal axis) and the second set to be the response variable (vertical axis).

  12. Example Suppose that we have bivariate data where the one sample data (explanatory variable ) is (1,3,4,6,9,12) and the other sample data (response variable) is (2,-1,3,0,1,4).

  13. Scatterplots & Excel It is easy to create scatterplots in Excel. Assuming that your data is list in two columns (or two rows), you select Chart from the Insert Menu and then choose xy scatter from the different types of charts. Then use the chart wizard to construct the scatterplot.

  14. Example Explanatory Variable: GDP Response Variable: Internet Use

  15. Positive & Negative Associations Definition: We say that two numerical variables (x & y) have a positiveassociation if as x increases, then y also tends to increase. We say that they have a negativeassociation if as x increases, y tends to decrease. If there is neither a positive or negative association, we say that there is noassociation.

  16. Positive & Negative Association in Scatterplot

  17. Example • Consider the bivariate date: • S = (0,1,2,3,4,5,6,7,8,9,10) (explanatory) • T = (4,4,5,6,4,4,5,9,5,11,6) (response) • Is there an association?

  18. Example • Consider the bivariate date: • S = (0,1,2,3,4,5,6,7,8,9,10) (explanatory) • T = (4,4,5,6,4,4,5,9,5,11,6) (response) • Is there an association? There appears to be a positive association between the explanatory and response variables.

  19. Example This example deals with the correlation between the Pat Buchanan and Ross Perot countywide votes in the 1996 and 2000 elections in Florida. Each dot (x,y) is a county in Florida with the first component the Perot vote and the second component the Buchanan vote for two different elections (1996, 2000).

  20. Generic Scatterplot Consider a bivariate set of quantitative data and suppose that we construct the scatterplot for this data.

  21. Linear Response Consider a bivariate set of quantitative data and suppose thatwe construct the scatterplot for this data. There appears to be a linear relationship between x and y in a “fuzzy” sense.

  22. The Linear Correlation Coefficient Consider a bivariate set of data. If we believe that there is a linear response between the two variables, then we can define a number (which we will denote by r) that is a measure of how much the scatterplot varies from a linear relationship between the two variables (x & y): y = mx + b. Remark: The correlation coefficient is sometimes called the Pearson Correlation Coefficient.

  23. Calculation of r

  24. Remark

  25. What does r tell us? Suppose we have a bivariate set of data with x be arbitrary, but y = mx + b. That is, the two set are linearly related. What is the correlation number for this type of set? Example: Let m = 2 and b = 1. Then S = (1,2,3,4,…,12) and T = (3,5,7,9,…,25). Using the formula, we find r = 1 .

  26. What does r tell us? Example: Let m = -2 and b = 25. Then S = (1,2,3,4,…,12) and T = (23,21,19,…,1). Using the formula, we find r = -1 .

  27. Linear Correlation: r • If r = 1, then there is a perfectpositive linear association between the variables. • If r = -1, then there is a perfectnegative linear association between the variables. • If r = 0, then there is no linear correlation between the variables. • If 0 < r < 1, then there is some positive correlation, although the nearer that r is to zero, the weaker the correlation. • If -1 < r < 0, then there is some negative correlation between the variables. • If r = 0, it does not mean that there is no association, but rather no linear association. In other words, r measures the strength of the linearassociation between the two variables. The relationship between two variables may be nonlinear, yet you can approximate the nonlinear relationship by a linear relationship.

  28. The Bottom Line If you want to know if there is a linear assoication between two quantitative variables in a bivariate set, compute the correlation coefficient r. Its sign (+ or -) will tell you if there is a positive or negative association and the magnitude of r, |r|, will tell you the strength of the association.

  29. Example Consider bivariate data: S = (0,1,2,…,8,9) and T = (1.00,2.00,2.09,2.14,…,2.30,2.32). The data in set T was generated by the function: f(x) = x(1/8) + 1. r = 0.72

  30. Example Consider the function y = f(x) = x10 and the points (0, 0.1, 0.2, 0.3,…, 0.9, 1.0). We form a bivariate set with these points: ((0,0), (0.1, 10-10),…,(0.9,0.348678), (1,1)). The correlation coefficient for this data is r = 0.669641. This indicates a medium strength linear correlation. However, it is a perfect nonlinear correlation with the nonlinear function x10.

  31. Example Is there a linear association between gestation period and life expectancy?

  32. Note: The explanatory variable is the gestation period and the response variable is the life expectancy. Also, dogs and cats have the same data.

  33. Example The U.S. Federal Reserve Board provides data on the percentage of disposable personal income required to meet consumer loan payments and mortgage payments. The following table summarizes this yearly data over the pass several years. Question: Are consumer debt and house debt correlated?

  34. Means: 7.22133 (consumer) 5.94333 (household) Sample SD: 0.623583 (consumer) 0.175526 (household) Correlation Coefficient: r = 0.117813

  35. Excel and Correlation Excel can be used to find the regression line for a set of bivariate data. In the Tools menu, select Data Analysis. In the Data Analysis window, select Correlation and follow the wizard. It produces what is called a correlation matrix. The number that occupies the 2nd row and 1st column is the correlation coefficient. It is possible to calculate the correlation coefficient between several variables using this tool.

  36. Least Squares Regression Section 4.2

  37. Reminder about Lines The equation of a straight line is: y = mx + b. The number m is called the slope of the line and the number b is called the y-intercept. If m > 0, the y increases with x and if m < 0, then it decreases with x. Given two distinct points in the plane, one can find the numbers m and b. Points, (x,y), that satisfy the same equation, y = mx + b, are said to be co-linear.

  38. Remark

  39. Problem Give a set of points in the xy-plane, there are an infinite number of lines that can be drawn through the points if the points are not co-linear.

  40. Error and Residual

  41. Least Squares Line

  42. Lot of things to compute! • To compute the least squares lines one must calculate: • the sample standard deviations of two sets • the mean of two sets • the correlation between two sets. • Fortunately, there is technology available to do this for us: http://www.shodor.org/unchem/math/lls/leastsq.html .

  43. Example (by hand) • Find the least squares line for the data set: ((-1,1),(0,2),(2,-1),(3,0)). • X = (-1,0,2,3) and Y = (1,2,-1,0). • Means: 1 and 0.5, respectively • Sample standard deviations: 1.83 and 1.29, respectively • Correlation coefficient: r = -0.71 • m = -0.71(1.29/1.83) = -0.50 • b = 0.5 - (-0.50)(1) = 1.0 • y = -0.5x + 1.0

  44. Example Anthropolgists using bones to predict height of individuals. x = length of femur (thighbone) in cm y = height in cm What is the predicted height of an individual with a 50 cm femur? The regression equation predicts: (2.4)(50) + 61.4 = 181.4 cm = 71.4 in

  45. Interpreting the y-intercept • y-intercept: • The predicted value for y when x = 0. • Helps in plotting the line, because it gives the point where the least squares regression line crosses the y-axis. • May not have any interpretative value if no observations had x values near 0.

  46. Interpreting the Slope Slope: Measures the change in the predicted variable for every unit change in the explanatory variable. Hence, it is a rate of change between the explanatory variable and the predicted (response) variable. Note that slope has units (units of response variable divided by the units of the explanatory variable).

  47. Slopes and Association

  48. Example The population of the Detroit Metropolitan Area is summarized in the following table from 1950 to 2000:

  49. Residuals • They measure the difference between a data point (observation) and a prediction: y - (mx + b). • Every data point has a residual. • A residual with a large absolute value (±) indicates an unusual observation. • Large residuals can be found by constructing a histogram of the residuals.

  50. Example Research at NASA studied the relationship between the right humerus and right tibia of 11 rats that were sent into space on the Spacelab. Here is the data collected. Find a least-squares regression line with x being the right humerus and y the right tibia.

More Related