1 / 93

CORRELATION AND REGRESSION

This example examines the relationship between education level and income in two different societies. The scattergrams show a clear positive association, but there are differences in the magnitude and dispersion of income for different education levels.

mcneal
Download Presentation

CORRELATION AND REGRESSION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CORRELATION AND REGRESSION Topic #13

  2. An Example Recall sentence #11 in Problem Set #3A (and #9): If you want to get ahead, stay in school. Underlying this nagging parental advice is the following claimed empirical relationship: + LEVEL OF EDUCATION =====> LEVEL OF SUCCESS IN LIFE Suppose we collect data through by means of a survey asking respondents (say a representative sample of the population aged 35-55) to report the number of years of formal EDUCATION they completed and also their current INCOME (as an indicator of SUCCESS). We then analyze the association between the two interval variables in this reformulated hypothesis. + LEVEL OF EDUCATION =========> LEVEL OF INCOME (# of years reported) ($000 per year) Since these are both interval continuous variables, we analyze their association by means of a scattergram.

  3. Having collected such data in two rather different societies A and B, we produce these two scattergrams.

  4. An Example (cont.) • Note that the two scattergrams are drawn with the same horizontal and vertical scales. • With respect to the vertical axis in scattergram A, this violates the usual guideline that scales be set to minimize “white space” in the scattergram. • But here we want to facilitate comparison between the two charts. • Both scattergrams show a clear positive association between the two variables, i.e., the plotted points in both form an upward-sloping pattern running from Low – Low to High – High. • At the same time there are obvious differences between the two scattergrams (and thus between the relation-ships between INCOME and EDUCATION in societies A and B).

  5. Questions For Discussion • In which society, A or B, is the hypothesis most powerfully confirmed? • In which society, A or B, is their a greater incentive for people to stay in school? • Which society, A or B, does the U.S. more closely resemble? • How might we characterize the difference between societies A and B?

  6. An Example (cont.) • We can visually compare and contrast the nature of the associations between the two variables in the two scattergrams by drawing a number of vertical strips in each scattergram (as we did in the earlier Pearson scattergram of SON’S HEIGHT by FATHER’S HEIGHT). • Points that lie within each vertical strip represent respondents who have (just about) the same value on the independent (horizontal) variable of EDUCATION. • Within each strip, we can estimate (by “eyeball methods”) the average magnitude of the dependent (vertical) variable INCOME and put a mark at the appropriate level.

  7. Average Income for Selected Levels of Education

  8. We can connect these marks to form a line of averages that is apparently (close to being) a straight line.

  9. An Example (cont.) • Now we can assess two distinct characteristics of the relationships between EDUCATION and INCOME in scattergrams A and B. • How much the does the average level of INCOME change among people with different levels of education? • How much dispersion of INCOME there is among people with the same level of EDUCATION?

  10. An Example (cont.) • In both scattergams, the line of averages is upward-sloping, indicating a clear apparent positive effect on EDUCATION on INCOME. • But in the scattergram for society A, the upward slope of the line of averages is fairly shallow. • The line of averages indicates that average INCOME increases by only about $1000 for each additional year of EDUCATION. • On the other hand, in the scattergram for society B, the upward slope of the line of averages is much steeper. • The graph in Figure 1B indicates that average INCOME increases by about $4000 for each additional year of EDUCATION. • In this sense, EDUCATION is on average more “rewarding” in society B than A.

  11. An Example (cont.) • There is another difference between the two scattergrams. • In scattergram A, there is almost no dispersion within each vertical strip (and almost no dispersion around the line of averages as a whole). • In scattergram B, there is a lot of dispersion within each vertical strip (and around the line of averages as a whole). • We can put this point in substantive language. • In society A, while additional years of EDUCATION produce rewards in terms of INCOME that are modest (as we saw before), these modest rewards are essentially certain. • In society B, while additional years of EDUCATION produce on average much more substantial rewards in terms of INCOME (as we saw before), these large expected rewards are highly uncertain and are indeed realized only on average. • For example, in scattergram B (but not A), we can find many pairs of cases such that one case has (much) higher EDUCATION but the other case has (much) higher INCOME.

  12. An Example (cont.) • This means that in society B, while EDUCATION has a big impact on EDUCATION, there are evidently other (independent) variables (maybe family wealth, ambition, career choice, athletic or other talent, just plain luck, etc.) that also have major effects on LEVEL OF INCOME. • In contrast, in society A it appears that LEVEL OF EDUCATION (almost) wholly determines LEVEL OF INCOME and that essentially nothing else matters. • Another difference between the two societies is that, while both societies have similar distributions of EDUCATION, their INCOME distributions are quite different. • A is quite egalitarian with respect to INCOME, which ranges only from about $40,000 to about $60,000, while B is considerably less egalitarian with respect to INCOME, which ranges from under to $10,000 to at least $100,000 — and possibly higher.) • In summary, in society A the INCOME rewards of EDUCATION are modest but essentially certain, while in society B the INCOME rewards of EDUCATION are substantial on average but quite uncertain in individual cases.

  13. Two Kinds of “Strength of Association” • This example illustrates that, given bivariate data for interval variables, “strength of association” between them can mean either of two quite different things: • the independent variable has a very reliable or certain association with the dependent variable, as is true for society A but not B, or • the independent variable has on average a big impact on the dependent variable, as is true for society B but not A. • There are two bivariate summary statistics that capture these two different kinds of strength of association between interval variables: • the first second is called the regression coefficient, customarily designated b, which is the slope of the line of averages; and • the second is called the correlation coefficient, customarily designated r , which is (more or less) determined by the magnitude of dispersion in each vertical strip. • In scattergrams A and B, A has the greater correlation coefficient and B has the greater regression coefficient.

  14. Review: The Equation of a Line • Having drawn (at least by “eyeball methods”) the line of averages in a scattergram, it is convenient to write an equation for the line. • You should recall from high-school algebra that, given any graph with a horizontal axis X and a vertical axis Y, any straight line drawn on the graph has an equation of the form (using the symbols you probably used in high school) y = m x + b , where • m is the slope of the line expressed as Δ y / Δ x (“change in y” divided by “change in x” or “rise over run”), and • b is the y-intercept (i.e., the value of y when x = 0). • Evidently to further torment students, in college statistics the symbol b is used in place of m to represent the slope of the line and the symbol a is used in place of b to represent the intercept (and a is customarily placed in front of the bx term), so the equation for a straight line is usually written as y = a + b x .

  15. Slope and Intercept in Scattergrams A and B

  16. Equation of a Line (cont.) • The equation for the line of averages in scattergram A appears to be approximately Y = a + b × x AVERAGE INCOME = $40,000 + $1000 × EDUCATION . • The equation for the line of averages in Figure B appears to be approximately AVERAGE INCOME = $10,000 + $4000 × EDUCATION . • Given such an equation (or formula), we can take any value for the independent variable EDUCATION, plug it into the appropriate formula above, and calculate or “predict” the corresponding average or “expected” value of the dependent variable INCOME. • In society A, such a prediction is likely to be quite reliable, because there is very little dispersion in any vertical strip and the association/correlation between the two variables is almost perfect • In society B, this prediction will be much less reliable or “fuzzier,” because there is considerable dispersion in every vertical strip and the association/correlation between the two variables is much less than perfect.

  17. Before we used eyeball methods to draw the line ofaverages in the SON’S HEIGHT by FATHER’S HEIGHT scattergram. Let’s write its equation in the formSON’S HEIGHT = a + b × FATHER’S HEIGHT .

  18. Equation for Son’s Height • What is the equation for the line of averages? • SH = 63 + 0.5 × FH • SH = 63 + 0.5 × 74 = 63 + 37 = 100 (8' 4") • Clearly something is wrong: • 63" is not the true intercept, i.e., it is • not average SH when FH = 0, • but rather average SH when FH = 58". • My height is 16" greater than 58", so my son’s expected height is • 63 + 0.5 × 16 = 72" • To read the true y-intercept a on the chart, we need to extend the FH scale down to FH = 0 to see where the line of averages intersects the true SH axis.

  19. Determining the “Line of Averages” [Regression Line] and the Degree of Association [Correlation] Numerically • Clearly statisticians are not satisfied with “eyeballing” a scattergram and • visually estimating the slope and intercept of the line averages, or (especially) • guessing at the degree of association between the DEPENDENT and INDEPENDENT variables. • Before we can make exact numerical calculations, we must have an exact definition of the “line of averages.” • This will lead to precise formula for the slope, intercept, and association/correlation.

  20. Example: Worksheet in Topic #13

  21. Correlation and Regression (cont.) Consider the scattergram to the right, which displays the bivariate numerical data for x and y in the Sample Problem presented on p. 9 of Handout #13. The “vertical strips” kind of argument simply will not work, because most strips include no data points and only one strip (for x≈ 5) includes more than one point. Using this simple example, we will now proceed a little more formally.

  22. Correlation and Regression (cont.) • Suppose we were to ask what horizontal line (i.e., a line for which the slope b = 0, so its equation is simply y = a) would “best fit” (come as close as possible to) the plotted points. • In order to answer this question, we must have a specific criterion for “best fitting.” • The one statisticians use is called the least squares criterion. • For each horizontal line, we calculate the sum (or mean) of squared deviations of the y-observations from a line. • We’re looking for the horizontal line that minimizes this sum/mean. • Lets try the horizontal line y = 6: • The sum of squared deviations = 138, • so the mean squared deviation is 23. • Can we do better that this?

  23. Correlation and Regression (cont.) • That is, we want to find the horizontal line such that the squared vertical deviations from the line to each point are (in total or on average) as small as possible. • You might remember (or should be able to guess) what line this is — it is the line y = mean of y , or in this case y = 5. • Recall from Handout #6, p. 4, point (e) that “the sum (or average) of the squared deviations from the mean is less than the sum of the squared deviations from any other value of the variable.” • You should also recall that we have a special name for “the average squared deviation from the mean (of y)” – namely, the variance (of y) • This (or its positive square root, i.e., the standard deviation of y) is the standard univariate measure of dispersion in y.

  24. Correlation and Regression (cont.) • So the line y = 5 is the best fitting horizontal line by the least squares criterion. • Now suppose that, in finding the best fitting line by the least squares criterion, we are no longer restricted to horizontal lines but can tip the line up or down, so it has a non-zero slope and its height varies with the independent variable x. • More particularly, suppose that we pivot such straight lines on the point that is right in the middle of the all data plotted points, • specifically the point that represents the mean of x and the mean of y (in this case, x = 4, y = 5), • until it has a slope that best fits all the plotted points by the same least squares criterion. • In our example, we can clearly improve the fit by tipping the line counter-clockwise.

  25. Tipping the Horizontal Line Can Improve the Fit by the Least Squares Criterion

  26. Finding the Best Fitting Line • The question is how far we must tip the line to get the best fit (by the least squares criterion). • If we tip it too far, the fit becomes worse. • When we have found this best fitting (by the least squares criterion) line, we have found what is thereby defined as the regression line. • Fortunately, we don’t have to find this best fitting line by trial and error methods. • There are formulas for finding slope of the regression line (i.e., the regression coefficient) and intercept from the numerical data. • There is a related formula for finding the correlation coefficient from the numerical data, which tells us how well this “best fitting” regression line fits the data points.

  27. Association Between X and Y? • Mean Squared Deviation from regression line equals 7.9375. • The degree of assiciation would be related to how big the MSD is compared with MSD around the best fitting horizontal line, which equals 22. • Degree of association = 1 – 7.9375 22 = 1 - .3608 = 0.6392 • Note that this number must be between 0 and 1 (but cannot be negative).

  28. Correlation and Regression (cont.) • One statistical theorem tells us that the regression line goes through the point corresponding to the mean of x and the mean of y. • Another theorem gives us formulas to calculate the slope and intercept of the regression line, given the numerical data. • A third theorem • gives us the formula to calculate the correlation coefficient, and also • tells us that the squared correlation coefficient r2 measures how much we can improve the goodness of fit (by the least squares criterion) when we move from the best fitting horizontal line to the best fitting tipped line.

  29. How to Calculate the Regression and Correlation Coefficients • Let’s call the independent variable X and the dependent variable Y. (This usage is standard.) • Set up a worksheet like the one that follows (also p. 9 of Handout #13).

  30. How to Calculate Coefficients (cont.) Compute the following univariate statistics: (1) the mean of X, (2) the mean of Y, (3) the variance of X, (4) the variance of Y, (5) the SD of X, and (6) the SD of Y.

  31. How to Calculate Coefficients (cont.) Now we make bivariate calculations for each case by multiplying its deviation from the mean with respect to X by its deviation from the mean with respect to Y. • This is called the crossproduct of the deviations.

  32. How to Calculate Coefficients (cont.) • Notice that a crossproduct in a given case: • is positive if, in that case, the X and Y values both deviate in the same direction (i.e., both upwards or both downwards) from their respective means; and • it is negative if, in that case, the X and Y values deviate in opposite directions (i.e., one upwards and one downwards) from their respective means. • In either event, the [absolute] crossproduct is large if both [absolute] deviations are large and small if either deviation is small. • The crossproduct is zero if either deviation is zero. • Thus: • (a) if there is a positive relationship between the two variables, most crossproducts are positive and many are large; • (b) if there is a negative relationship between the two variables, most crossproducts are negative and many are large; and • (c) if the two variables are unrelated, the crossproducts are positive in about half the cases and negative in about half the cases. • So, the average of all crossproducts is indicative of the association between variables.

  33. How to Calculate Coefficients (cont.) • Add up the crossproducts over all cases. • The sum (and thus the average) of crossproducts is • positive in the event of (a) above, • negative in the event of (b) above, and • (close to) zero in the event of (c) above. • Divide the sum of the crossproducts by the number of cases to get the average (mean) crossproduct. This average is called the covariance of X and Y, and its formula is the bivariate parallel to the univariatevariance (of X or Y).

  34. How to Calculate Coefficients (cont.) • Notice the following: • If the association between X and Y is positive, their covariance is positive (because most crossproducts are positive). • If the association between X and Y is negative, their covariance is negative (because most crossproducts are negative). • If there is [almost] no association between X and Y , their covariance is [approximately] zero (because positive and negative crossproducts [approximately] cancel out when added up). • Thus the covariance does measure the association between the two variables.

  35. How to Calculate Coefficients (cont.) • However, the covariance is not a (fully) valid measure of the association, because the magnitude of the average (positive or negative) crossproduct depends on not only • how closely the two variables are associated, but also • on the magnitude of their dispersions (as indicated by their standard deviations or variances). • So, for example: • two very closely (and positively) associated variables have a positive but small covariance if they both have small standard deviations; • two not so closely (but still positively) associated variables have a larger covariance if their standard deviations are sufficiently larger.

  36. Multiplying all Values by 10 does not Changethe Degree of Association between the Two Variables X and Y

  37. But Doing This Increases the Covariance of X and Y 100-Fold

  38. The Correlation Coefficient • The correlation coefficient is a measure of association only, which (like other measures of association) is standardized so that it always falls in the range from -1 to +1. • This is accomplished by dividing the covariance of X and Y by the standard deviation of X and also by the standard deviation of X. • This is equivalent to calculating the covariance of X and Y if the X and Y data is converted into standard scores. • The degree of correlation in a scattergram can be observed by looking only at the plotted points, the regression line, and the orientation of the horizontal and vertical axes. • The units of measurement on each axis make no difference and do not have to be shown.

  39. The Correlation Coefficient (cont.) • Divide the covariance of X and Y by the standard deviation of X and also by the standard deviation of Y. • This gives the correlation coefficientr, which measures the degree (or “reliability” or “completeness”) and direction of association between two interval variables X and Y. Cov (X,Y) Correlation coefficient = r = SD(X) × SD(Y) • If you calculate r to be greater than +1 or less than -1, you have made a mistake. • The correlation coefficient is the one measure of association for which you are expected to know how to use the calculating formula (above). • It is a good idea first to construct a scattergram of the data on X and Y and then to check whether your calculated correlation coefficient looks plausible in light of the scattergram.

  40. Calculating the Correlation Coefficient

  41. Properties of the Correlation Coefficient • The correlation coefficient formula is based on a ratio of “X-units × Y-units” (i.e., their respective deviations from the mean) divided by “X-units × Y-units” (i.e., their respective SDs). • This means that all units of measurement cancel out and the correlation coefficient a pure number that is not expressed in any units. • For example, suppose the correlation between the height (measured in inches) and weight (measured in pounds) of students in the class is r = +.45. • This is not r = +.45 inches or r = +.45 pounds, or r = +.45 pounds per inch, etc., • rather it is just r = +.45. • Moreover, the magnitude of the correlation coefficient is independent of the units used to measure the two variables. • If we measured students’ heights in feet, meters, etc. (instead of inches) and/or measured their weights in ounces, kilograms, etc. (instead of pounds), and then calculated the correlation coefficient based on the new numbers, it would be just the same as before, i.e., r = +.45. • In addition, if you check the formula above, you can see that r is symmetric, i.e., it is unchanged if we interchange X and Y. • Thus, the correlation between two variables is the same regardless of which variable is considered to be independent and which dependent.

More Related