1 / 74

Dr. Ka-fu Wong

Dr. Ka-fu Wong. ECON1003 Analysis of Economic Data. Chapter Thirteen. Linear Regression and Correlation. GOALS. Draw a scatter diagram. Understand and interpret the terms dependent variable and independent variable.

nigel-davis
Download Presentation

Dr. Ka-fu Wong

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dr. Ka-fu Wong ECON1003 Analysis of Economic Data

  2. Chapter Thirteen Linear Regression and Correlation GOALS • Draw a scatter diagram. • Understand and interpret the terms dependent variable and independent variable. • Calculate and interpret the coefficient of correlation, the coefficient of determination, and the standard error of estimate. • Conduct a test of hypothesis to determine if the population coefficient of correlation is different from zero. • Calculate the least squares regression line and interpret the slope and intercept values. • Construct and interpret a confidence interval and prediction interval for the dependent variable. • Set up and interpret an ANOVA table. l

  3. Correlation Analysis • Correlation Analysis is a group of statistical techniques used to measure the strength of the association between two variables. • A Scatter Diagramis a chart that portrays the relationship between the two variables. • The Dependent Variableis the variable being predicted or estimated. • The Independent Variable provides the basis for estimation. It is the predictor variable.

  4. Types of Relationships • Direct vs. Inverse • Direct - X and Y increase together • Inverse - X and Y have opposite directions • Linear vs. Curvilinear • Linear - Straight line best describes the relationship between X and Y • Curvilinear - Curved line best describes the relationship between X and Y

  5. Direct Relationship Inverse Relationship Negative Slope Sales Pollution Emissions Positive Slope Advertising Anti-Pollution Expenditures Direct vs. Inverse Relationship

  6. Example • Suppose a university administrator wishes to determine whether any relationship exists between a student’s score on an entrance examination and that student’s cumulative GPA. A sample of eight students is taken. The results are shown below

  7. Scatter Diagram: GPA vs. Exam Score 4.00 - 3.75 - 3.50 - 3.25 - 3.00 - 2.75 - 2.50 - 2.25 - 2.00 - Cumulative GPA | | | | | | | | | | 50 55 60 65 70 75 80 85 90 95 Exam Score

  8. Y Y Y (a) Direct linear (b) Inverse linear (c) Direct curvilinear X X X Possible relationships between X and Y in Scatter Diagrams Y Y Y (e) Inverse linear with more scattering (f) No relationship (d) Inverse curvilinear X X X

  9. The Coefficient of Correlation, r • The Coefficient of Correlation(r) is a measure of the strength of the linear relationship between two variables. • It requires interval or ratio-scaled data. • It can range from -1.00 to 1.00. • Values of -1.00 or 1.00 indicate perfect and strong correlation. • Values close to 0.0 indicate weak correlation. • Negative values indicate an inverse relationship and positive values indicate a direct relationship.

  10. Formula for r We calculate the coefficient of correlation from the following formulas.

  11. Perfect Negative Correlation (r = -1) 10 9 8 7 6 5 4 3 2 1 0 Y 0 1 2 3 4 5 6 7 8 9 10 X

  12. Perfect Positive Correlation (r = +1) 10 9 8 7 6 5 4 3 2 1 0 Y 0 1 2 3 4 5 6 7 8 9 10 X

  13. Zero Correlation (r = 0) 10 9 8 7 6 5 4 3 2 1 0 Y 0 1 2 3 4 5 6 7 8 9 10 X

  14. Strong Positive Correlation (0<r<1) 10 9 8 7 6 5 4 3 2 1 0 Y 0 1 2 3 4 5 6 7 8 9 10 X

  15. Coefficient of Determination • The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). • It is the square of the coefficient of correlation. • It ranges from 0 to 1. • It does not give any information on the direction of the relationship between the variables. • Special cases: • No correlation: r=0, r2=0. • Perfect negative correlation: r=-1, r2=1. • Perfect positive correlation: r=+1, r2=1.

  16. EXAMPLE 1 • Dan Ireland, the student body president at Toledo State University, is concerned about the cost to students of textbooks. He believes there is a relationship between the number of pages in the text and the selling price of the book. To provide insight into the problem he selects a sample of eight textbooks currently on sale in the bookstore. Draw a scatter diagram. Compute the correlation coefficient.

  17. Example 1 continued

  18. Example 1 continued

  19. Example 1 continued • The correlation between the number of pages and the selling price of the book is 0.614. This indicates a moderate association between the variable.

  20. EXAMPLE 1 continuedIs there a linear relation between number of pages and price of books? • Test the hypothesis that there is no correlation in the population. Use a .02 significance level. • Under the null hypothesis that there is no correlation in the population. The statistic follows student t-distribution with (n-2) degree of freedom.

  21. EXAMPLE 1 continued • Step 1:H0: The correlation in the population is zero. H1: The correlation in the population is not zero. • Step 2:H0 is rejected if t>3.143 or if t<-3.143. There are 6 degrees of freedom, found byn – 2 = 8 – 2 = 6. • Step 3: To find the value of the test statistic we use: • Step 4:H0 is not rejected. We cannot reject the hypothesis that there is no correlation in the population. The amount of association could be due to chance.

  22. Regression Analysis • In regression analysis we use the independent variable (X) to estimate the dependent variable (Y). • The relationship between the variables is linear. • Both variables must be at least interval scale.

  23. Simple Linear Regression Model • Relationship Between Variables Is a Linear Function b0 and b1 are unknown, therefore, are estimated from the data. Random Error Y intercept Slope y b1 = Rise/Run Rise Dependent (Response) Variable Independent (Explanatory) Variable Run b0 x

  24. Finance Application: Market Model • One of the most important applications of linear regression is the market model. • It is assumed that rate of return on a stock (R) is linearly related to the rate of return on the overall market (Rm). R = b0 + b1Rm +e Rate of return on some major stock index Rate of return on a particular stock The beta coefficient measures how sensitive the stock’s rate of return is to changes in the level of the overall market.

  25. Assumptions Underlying Linear Regression • For each value of X, there is a group of Y values, and these Y values are normally distributed. • The means of these normal distributions of Y values all lie on the straight line of regression. • The standard deviations of these normal distributions are equal. • The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values.

  26. Choosing the line that fits best • The estimates are determined by • drawing a sample from the population of interest, • calculating sample statistics. • producing a straight line that cuts into the data. The question is: Which straight line fits best? y w w w w w w w w w w w w w w w x

  27. Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99 1 1 Choosing the line that fits best The best line is the one that minimizes the sum of squaredvertical differences between the points and the line. Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89 Let us compare two lines (2,4) 4 The second line is horizontal w (4,3.2) w 3 2.5 The smaller the sum of squared differences the better the fit of the line to the data. That is, the line with the least sum ofsquares (of differences) will fit the line best. 2 w (1,2) (3,1.5) w 2 3 4

  28. Choosing the line that fits bestOrdinary Least Squares (OLS) Principle • Straight lines can be described generally by Y = b0 + b1X • Finding the best line with smallest sum of squared difference is the same as • Let b0* and b1*be the solution of the above problem. Y* = b0* + b1*Xis known as the “average predicted value” (or simply “predicted value”) of y for any X.

  29. Coefficient estimates from the ordinary least squares (OLS) principle • Solving the minimization problem implies the first order conditions:

  30. Coefficient estimates from the ordinary least squares (OLS) principle • Solving the first order conditions implies

  31. EXAMPLE 2 continued from Example 1 • Develop a regression equation for the information given in EXAMPLE 1. The information there can be used to estimate the selling price based on the number of pages.

  32. Example 2 continued from Example 1 The regression equation is: Y* = 48.0 + .05143X • The equation crosses the Y-axis at $48. A book with no pages would cost $48. • The slope of the line is .05143. Each additional page costs about $0.05 or five cents. • The sign of the b value and the sign of r will always be the same.

  33. Example 2 continued from Example 1 We can use the regression equation to estimate values of Y. • The estimated selling price of an 800 page book is $89.14, found by Y* = 48.0 + .05143X = 48.0 + .05143(800) = 89.14

  34. Standard Error of Estimate (denoted se or Sy.x) • Measures the reliability of the estimating equation • A measure of dispersion • Measures the variability, or scatter of the observed values around the regression line 

  35. Scatter Around the Regression Line More Accurate Estimator of X, Y Relationship Less Accurate Estimator of X, Y Relationship

  36. Interpreting the Standard Error of the Estimate • se measures the dispersion of the points around the regression line • If se = 0, equation is a “perfect” estimator • se is used to compute confidence intervals of the estimated value • Assumptions: • Observed Y values are normally distributed around each estimated value of Y* • Constant variance

  37. Variation of Errors Around the Regression Line y values are normally distributed around the regression line. For each x value, the “spread” or variance around the regression line is the same. f(e) Y X2 X1 X Regression Line

  38. Y = b0 + b1X + 2se Y = b0 + b1X + 1se  Y = b0 + b1X regression line Y = b0 + b1X - 1se Y = b0 + b1X - 2se Dependent Variable ( Y) 2se (95.5% Lie in this Region) Independent Variable (X) Scatter around the Regression Line 1se (68% Lie in this Region)

  39. Example 3 continued from Example 1 and 2. Find the standard error of estimate for the problem involving the number of pages in a book and the selling price.

  40. Equations for the Interval Estimates Confidence Interval for the Mean of y Prediction Interval for the Mean of y

  41. Confidence Interval Estimate for Mean Response y* = b0+b1xi The following factors influence the width of the interval: Std Error, Sample Size, X Value X

  42. Confidence Interval continued from Example 1, 2 and 3. • For books of 800 pages long, what is that 95% confidence interval for the mean price? • This calls for a confidence interval on the average price of books of 800 pages long.

  43. Prediction Intervalcontinued from Example 1, 2 and 3. • For a book of 800 pages long, what is the 95% prediction interval for its price? • This calls for a prediction interval on the price of an individual book of 800 pages long.

  44. Interpretation of Coefficients • Slope (b1) • Estimated Y changes by b1 for each 1 unit increase in X • Y-Intercept (b0 ) • Estimated value of Y when X = 0

  45. Test of Slope Coefficient (b1) • Tests if there is a linear relationship between X & Y • Involves population slope b1 • Hypotheses • H0: b1= 0 (no linear relationship) • H1: b1 0 (linear relationship) • Theoretical basis is sampling distribution of slopes

  46. Sampling Distribution of the Least Squares Coefficient Estimator • If the standard least squares assumptions hold, then b1 is an unbiased estimator of 1 and has a population variance and an unbiased sample variance estimator Bias = E(b1) -b1 “Unbiasd” means E(b1) -b1 =0

  47. Basis for Inference About the Population Regression Slope • Let 1 be a population regression slope and b1 its least squares estimate based on n pairs of sample observations. Then, if the standard regression assumptions hold and it can also be assumed that the errors i are normally distributed, the random variable is distributed as Student’s t with (n – 2) degrees of freedom. In addition the central limit theorem enables us to conclude that this result is approximately valid for a wide range of non-normal distributions and large sample sizes, n.

  48. Tests of the Population Regression Slope • If the regression errors i are normally distributed and the standard least squares assumptions hold (or if the distribution of b1 is approximately normal), the following tests have significance value : • To test either null hypothesis H0: b1 = b1* or H0: b1  b1*against the alternative H1: b1 > b1* The decision rule is to reject if

  49. Tests of the Population Regression Slope • To test either null hypothesis H0: b1 = b1* or H0: b1 > b1*against the alternative H1: b1  b1* • the decision rule is to reject if

  50. Tests of the Population Regression Slope • To test either null hypothesis H0: b1 = b1*against the alternative H1: b1  b1* • the decision rule is to reject if Equivalently

More Related