1 / 61

Chapter 11

Chapter 11. Regression and correlation methods. Goals. To relate (associate) a continuous random variable, preferably normally distributed, to other variables. Terminology. Dependent Variable (Y): The variable which is supposed to depend on others e.g., Birthweight

ivana
Download Presentation

Chapter 11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 11 Regression and correlation methods

  2. Goals • To relate (associate) a continuous random variable, preferably normally distributed, to other variables Abdus Wahed BIOST 2041

  3. Terminology • Dependent Variable (Y): • The variable which is supposed to depend on others e.g., Birthweight • Independent variable, explanatory variable or predictors (x): • The variables which are used to predict the dependent variable, or explains the variation in the dependent variable, e.g., estriol levels Abdus Wahed BIOST 2041

  4. Assumptions • Dependent Variable: • Continuous, preferably normally distributed • Have a linear association with the predictors • Independent variable: • Fixed (not random) Abdus Wahed BIOST 2041

  5. Simple Linear Regression Model • Assume Y be the dependent variable and x be the lone covariate. Then a linear regression assumes that the true relationship between Y and x is given by E(Y|x) = α + βx (1) Abdus Wahed BIOST 2041

  6. Simple Linear Regression Model • (1) can be written as Y = α + βx + e, (2) where e is an error term with mean 0 and variance σ2. Abdus Wahed BIOST 2041

  7. e e

  8. Implication • If there was a perfect linear relationship, every subject with the same value of x would have a common value of Y. • Deterministic relationship • The error term takes into account the inter-patient variability. • σ2 = Var(Y) = Var(e). Abdus Wahed BIOST 2041

  9. Parameters • α is the intercept of the line. • β is the slope of the line, referred to as regression coefficient • β < 0 indicates a negative linear association (the higher the x, the smaller the Y) • β = 0, no linear relationship. • β > 0 indicates a positive linear association (the higher the x, the larger the Y) • β is the amount of change in Y for a unit change in x. Abdus Wahed BIOST 2041

  10. Data Abdus Wahed BIOST 2041

  11. Goal • How to estimate α, β, and σ2? • Fitting Regression Lines • How to draw inference? The relationship we see – is it just due to chance? • Inference about regression parameters Abdus Wahed BIOST 2041

  12. Fitting Regression Line • Least Square method Abdus Wahed BIOST 2041

  13. Least square method • Idea: • Estimate α and β in a way that the observations are “closest” to the line • Impossible • Implement: • Estimate α and β in a way that the sum of squared deviations is minimized. Abdus Wahed BIOST 2041

  14. Least square method • Minimize • Σ(yi - α – βxi)2 Least square estimate of α a = (Σyi – bΣxi)/n Σxiyi – ΣxiΣ yi/n Least square estimate of β b = Σxi2–(Σxi)2/n Estimated Regression line: y = a + bx Abdus Wahed BIOST 2041

  15. Example 11.3 • Estimate the regression line for the birthweight data in Table 11.1, i.e. • Estimate the intercept a and slope b • We do the following calculations (see the corresponding Excel file) Abdus Wahed BIOST 2041

  16. Regression analysis for the data in Table 11.1 • Sum of products: 17500 (1) • Sum of X: 534 (2) • Sum of Y: 992 (3) • Sum of squared x: 9876 (4) • Corrected Sum of products : (1) - (2)*(3)/n Lxy=412 (5) • Corrected Sum of products : (4) - (2)*(2)/n Lxx=677.4194 (6) • Regression coefficient: (5)/(6) b=Lxy/Lxx=0.60819 (7) • Intercept: [(3) - (7)*(2)]/n a=21.52343 • Estimated Regression Line: Birthweight (g/100) = 21.52 +0.61 *Estriol (mg/24hr) Abdus Wahed BIOST 2041

  17. Regression Analysis: Interpretation • There is a positive association (statistically significant or not, we will test later) between birthweight and estriol levels. • For each mg increase in estriol level, the birthweight of the newborn is increased by 61 g. Abdus Wahed BIOST 2041

  18. Prediction • The predicted value of Y for a given value of x is Abdus Wahed BIOST 2041

  19. Prediction • What is the estimated (predicted) birthweight if a pregnant women has an estriol level of 15 mg/24hr? = 30.65 (g/100) = 3065 g Abdus Wahed BIOST 2041

  20. Calibration • If low birthweight is defined as <= 2500, for what estriol level would the newborn be low birthweight? • That is to what value of estriol level does the predicted birthweight of 2500 correspond to? Abdus Wahed BIOST 2041

  21. Calibration Women having estriol level of 5.72 or lower are expected to have low birthweight newborns Abdus Wahed BIOST 2041

  22. Goodness of fit of a regression line • How good is x in predicting Y? Abdus Wahed BIOST 2041

  23. Goodness of fit of a regression line • Residual sum of squares (Res SS) Summary Measure of Distance Between the Observed and Predicted The smaller the Res. SS, the better the regression line is in predicting Y Abdus Wahed BIOST 2041

  24. Total variation in observed Y • Total sum of squares Summary Measure of Variation in Y Abdus Wahed BIOST 2041

  25. Total variation in predicted Y • Total sum of squares Summary Measure of Variation in predicted Y Abdus Wahed BIOST 2041

  26. Goodness of fit of a regression line Abdus Wahed BIOST 2041

  27. Goodness of fit of a regression line • It can be shown that • The smaller the residual SS, the closer the total and regression sum of squares are, the better the regression is Abdus Wahed BIOST 2041

  28. Coefficient of determination R2 R2 is the proportion of total variation in Y explained by the regression on x. R2 lies between 0 and 1. R2 = 1 implies a perfect fit (all the points are on the line). Abdus Wahed BIOST 2041

  29. F-test • Another way of formally looking at how good the regression of Y on x is, is through F-test. • The F-test compares Reg. SS to Residual SS: • Larger F indicates Better Regression Fit Abdus Wahed BIOST 2041

  30. F-test • Test • Test statistic • Reject H0 if F > F1,n-2,1-α Abdus Wahed BIOST 2041

  31. Summary of Goodness of regression fit • We need to compute three quantities • Total SS • Reg. SS • Res. Ss • Total SS = Lyy • Reg. SS = b*Lxy • Res. SS = Total SS – Reg.SS Abdus Wahed BIOST 2041

  32. Example 11.12 • Total SS : 674 • Reg. SS : 250.57 • R^2 : 0.37 => 37% of the variation in birthweight is explained by the regression on estriol level • F :17.16 • p-value : P(F1,29 > 17.16) = 0.0003 • H0 is rejected => The slope of the regression line is significantly different from zero, implying a statistically significant linear relationship between estriol level and birthweight Abdus Wahed BIOST 2041

  33. T-test • Same hypothesis can be tested using a t-test. Abdus Wahed BIOST 2041

  34. T-test Abdus Wahed BIOST 2041

  35. T-test P-value = 2 Pr(tn-2 > |t|) 100(1-α)% CI for β Abdus Wahed BIOST 2041

  36. Example 11.12 • Is the regression coefficient (slope) for the estriol level significantly different from zero? • S^2= 14.6 s= 3.82 • SE(b)= 0.15 t= 4.14 • p= 0.00027123 • 95% CI for reg coeff (0.31, 0.91) • H0: β = 0 is rejected => The slope of the regression line is significantly different from zero, implying a statistically significant linear relationship between estriol level and birthweight Abdus Wahed BIOST 2041

  37. Correlation • Correlation refers to a quantitative measure of the strength of linear relationship between two variables • Regression, on the other hand is used for prediction • No distinction between dependent and independent variable is made when assessing the correlation Abdus Wahed BIOST 2041

  38. Correlation: Example 11.14 Abdus Wahed BIOST 2041

  39. Correlation Abdus Wahed BIOST 2041

  40. Correlation coefficient • Population correlation coefficient (See section 5.4.2 in my notes) • If X and Y could be measured on everyone in the population, we could have calculated ρ. Abdus Wahed BIOST 2041

  41. Interpretation of ρ • ρ lies between −1 and 1, • ρ = 0 implies no linear relationship, • ρ = −1 implies perfect negative linear relationship, • ρ = +1 implies perfect positive linear relationship. Abdus Wahed BIOST 2041

  42. Sample correlation coefficient • Unfortunately, we cannot measure X and Y on everyone in the population. • We estimate ρ from the sample data as follows: Abdus Wahed BIOST 2041

  43. Interpretation of r • r lies between −1 and 1, • r = 0 implies no linear relationship, • r = −1 implies perfect negative linear relationship, • r = +1 implies perfect positive linear relationship, • The closer |r| is to 1, the stronger the relationship is. Abdus Wahed BIOST 2041

  44. Sample correlation coefficient r = 1 Abdus Wahed BIOST 2041

  45. Sample correlation coefficient r = -1 Abdus Wahed BIOST 2041

  46. Sample correlation coefficient r=0 Abdus Wahed BIOST 2041

  47. Sample correlation coefficient r=0.988 Abdus Wahed BIOST 2041

  48. Sample correlation coefficient r=0.49 Abdus Wahed BIOST 2041

  49. Sample correlation coefficient r=-0.37 Abdus Wahed BIOST 2041

More Related