1 / 27

Day 4 Correlation and Regression by Binam Ghimire

Day 4 Correlation and Regression by Binam Ghimire. Example Data. The scatter diagram of the data would appear as below:. Alternatively a negative correlation would appear as below:. Alternatively data with no correlation may appear as below: . Correlation Scale. -1 0 +1

balin
Download Presentation

Day 4 Correlation and Regression by Binam Ghimire

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Day 4 Correlation and Regression by Binam Ghimire

  2. Example Data

  3. The scatter diagram of the data would appear as below:

  4. Alternatively a negative correlation would appear as below:

  5. Alternatively data with no correlation may appear as below:

  6. Correlation Scale -1 0 +1 Perfect negative No correlation Perfect positive correlation correlation

  7. Pearson’s product moment correlation coefficient (r) • r = n ∑ xy - ∑x ∑y √ [n ∑x2- (∑x)2] [n ∑y2- (∑y)2] • x y xy x2y2 23 581334 5293364 17 50 850 289 2500 24 54 1296 576 2916 ∑ 212 516 11452 5000 27242

  8. Pearson’s product moment correlation coefficient (r) or the formula may be

  9. Pearson’s product moment correlation coefficient (r) (2) • r = 10 x 11452 – 212 * 516 √ [10 x 5000 – (212)] [10 x 27242 – (516)] = 5128 √ 5056 x 6164 = 0.9186

  10. Linear Regression • Need to establish a ‘line of best fit’ • The ‘freehand method’ has many drawbacks. • In some sense we need the ‘best fit’ to the data. To obtain this we do not use crude graphical techniques. We identify the ‘line of best fit’ or ‘least squares line.’

  11. Linear Regression (2) The equation for this line is Y = 30.10 + 1.014X

  12. Linear Regression (3) • The equation of this line is Y =30.10 +1.014X • But how is this obtained? • The scattered points illustrate the actual data, while the least squares line is an estimate of Y for a given value of X. Notice the distance between the scattered points and the line; this will give you some idea of how good a fit the line is.

  13. Linear Regression (4) • How do we determine the least squares line? • Simply we need to determine the intercept (a) and the (b) gradient. • The formula is therefore Y = a + bx • You need to apply a little calculus (we will omit that process here) to develop standard equations.

  14. Linear Regression Equations • b = n ∑ xy - ∑ x ∑ y n ∑X2 – (∑X)2 • b = 10 x 11452 – 212 x 516 10 x 5000 – 44944 b = 1.0142405 Or

  15. Linear Regression Equations (2) And a = Y – b. X or a = 51.6 – 1.0142405 x 21.2 a = 30.098101 Rounding these values a little: Y = 30.10 + 1.014X

  16. Coefficient of Determination • The coefficient of determination measures the proportion of the variation in the dependent variable (y) explained by the variation in the independent variable (x). • It is reported as r2 - the square of the product moment correlation coefficient. • Does not explain causation

  17. Coefficient of Determination (2) • For our previous example: • r2 = (0.9186)2 = 0.844 • This means that 84.4% of the variation in cost is dependent upon output volume. Alternatively, 15.6% of variation is not explained.

  18. So far • Correlation is measured on a scale from -1 to +1 using Pearson’s product moment correlation coefficient (r). • Linear regression identifies the line of ‘best fit’ using the formula Y = a + bx • The coefficient of determination (r2) measures the extent to which the dependent variable is explained by the independent variable.

  19. Question to learn the terminologies • The data below shows annual company income (£m) against year of trading. Year Income (£m) 1 20 2 23 3 26 4 28 5 35 • A regression of income on year gives the following results: r = 0.974, r squared = 0.948, intercept = 11.4, slope = 3.5 • Can we explain each of the results above. • Use the results above to make a forecast for company income for year 6. • What assumption is made in making this forecast?

  20. Coefficient of Determination (3) • Relationship among SST, SSR, SSE: SST = SSR + SSE Where, SST = total sum of squares given by SSR = sum of squares due to regression SSE = sum of squares due to error Coefficient of Determination is : r2 = SSR/ SST

  21. Coefficient of Determination (4) • Relationship among SST, SSR, SSE: SST = SSR + SSE Total Variation= Explained Variation + Unexplained Variation Coefficient of Determination is : r2 = SSR/ SST

  22. Calculating the Standard Error of the Estimate (SEE) • SEE measures the accuracy of the prediction from a regression equation • It is the standard dev. of the error term • The lower the SEE, the greater the accuracy • Where • SSE = sum of Squared Errors

  23. Covariance of Rates of Return Example: Calculate the covariance between the returns on the two stocks indicated below:

  24. Σ = 0.0154 Cov = 0.0154 / 2 = 0.0077 R1 = 0.05 Covariance Using Historical Data R2 = 0.07

  25. Sample Correlation Coefficient Correlation, ρ, is a standardized measure of covariance and is bounded by +1 and –1 Example: The covariance of returns on two assets is 0.0051 and σ1= 7% and σ2= 11%. Calculate ρ1,2.

  26. Testing H0: Correlation = 0 The test of whether the true correlation between two random variables is zero (i.e., there is no correlation) is a t-test based on the sample correlation coefficient, r. With n(pairs of) observations the test statistic is: Degrees of freedom is n – 2

  27. Example Data: n = 10 r = 0.475 Determine if the sample correlation is significant at the 5% level of significance. t = 0.475 (8)0.5 / [1 – (0.475)2] 0.5 = 1.3435 / 0.88 = 1.5267 The two-tailed critical t – values at a 5% level of significance with df = 8 (n-2) are found to be +/- 2.306. Since -2.306≤ 1.5267≤ 2.306, the null hypothesis cannot be rejected, i.e. correlation between variables X and Y is not significantly different from zero at a 5% significance level.

More Related