1 / 17

Biostatistics in Practice

Biostatistics in Practice. Session 5: Methods for Assessing Associations. Peter D. Christenson Biostatistician http://gcrc.LABioMed.org/Biostat. Readings for Session 5 from StatisticalPractice.com. Simple Linear Regression Introduction to Simple Linear Regression

Download Presentation

Biostatistics in Practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biostatistics in Practice Session 5: Methods for Assessing Associations Peter D. Christenson Biostatistician http://gcrc.LABioMed.org/Biostat

  2. Readings for Session 5from StatisticalPractice.com • Simple Linear Regression • Introduction to Simple Linear Regression • Transformations in Linear Regression • Multiple Regression • Introduction to Multiple Regression • What Does Multiple Regression Look Like? • Which Predictors are More Important? • Also, without any reading: Correlation

  3. Correlation • Visualize Y (vertical) by X (horizontal) scatterplot • Pearson correlation, r, is used to measure association between two measures X and Y • Ranges from -1 (perfect inverse association) to 1 (direct association) • Value of r does not depend on: • scales (units) of X and Y • which role X and Y assume, as in a X-Y plot • Value of r does depend on: • the ranges of X and Y • values chosen for X, if X is fixed and Y is measured

  4. Graphs and Values of Correlation

  5. Correlation Depends on Ranges of X and Y A B Graph B contains only the graph A points in the ellipse. Correlation is reduced in graph B. Thus: correlations for the same quantities X and Y may be quite different in different study populations.

  6. Regression • Again: Y (vertical) by X (horizontal) scatterplot, as with correlation. See next slide. • X and Y now assume different roles: • Y is an outcome, response, output, dependent variable • X is an input, predictor, independent variable • Regression analysis is used to: • Fit a straight line through the scatterplot. • Measure X-Y association, as does correlation. • Predict Y from X, and assess the precision of the prediction.

  7. Regression Example

  8. X-Y Association If slope=0 then X and Y are not associated. But the slope measured from a sample will never be 0. How different from 0 does a measured slope need to be to claim X and Y are associated? Test H0: slope=0 vs. HA: slope≠0, with the rule: Claim association (HA) if tc=|slope/SE(slope)| > t ≈ 2. There is a 5% chance of claiming an X-Y association that really does not exist. Note similarity to t-test for means: tc=|mean/ SE(mean)|. Formula for SE(slope) is in statistics books.

  9. X-Y Association, Continued Refer to the graph of the example, 2 slides back. We are 95% sure that the true line for the X-Y association is within the inner ..… band about the estimated line from our limited sample data. If our test of H0: slope=0 vs. HA: slope≠0 results in claiming HA, then the inner ..… band does not include the horizontal line, and vice-versa. X and Y are significantly associated. We can also test H0: ρ=0 vs. HA: ρ ≠0 , where ρ is the true correlation estimated by r. The result is identical to that for the slope. Thus, correlation and regression are equivalent methods for measuring whether two variables are linearly associated.

  10. Prediction from Regression • Again, Refer to the graph of the example, 3 slides back. • The regression line (e.g., y=81.6 + 2.16x) is used for: • Predicting y for an individual with a known value of x. We are 95% sure that the individual’s true y is between the outer (---) band endpoints vertically above x. This interval is analogous to mean±2SD. • Predicting the mean y for “all” subjects with a known value of x. We are 95% sure that this mean is between the inner (….) band endpoints vertically above x. This interval is analogous to mean±2SE.

  11. Example Software Output The regression equation is: Y = 81.6 + 2.16 X Predictor Coeff StdErr T P Constant 81.64 11.47 7.12 <0.0001 X 2.1557 0.1122 19.21 <0.0001 S = 21.72 R-Sq = 79.0% Predicted Values: X: 100 Fit: 297.21 SE(Fit): 2.17 95% CI: 292.89 - 301.52 95% PI: 253.89 - 340.52 19.21=2.16/0.112 should be between ~ -2 and 2 if slope=0. Predicted y = 81.6 + 2.16(100) Range of Ys with 95% assurance for: Mean of all subjects with x=100. Individual with x=100.

  12. Regression Issues • We are assuming that the relation is linear. • We can generalize to more complicated non-linear associations. • Transformations, e.g., logarithmic, can be made to achieve linearity. • The vertical distances between the actual y’s and the predicted y’s (on the line) are called “residuals”. Their magnitude should not depend on the value of x (e.g., should not tend to be larger for larger x), and should be symmetrically distributed about 0. If not, transformations can often achieve this.

  13. Multiple Regression: Geometric View “Multiple” refers to using more than one X (say X1 and X2) simultaneously to predict Y. Geometrically, this is fitting a slanted plane to a cloud of points: Graph from the readings. LHCY is the Y (homocysteine) to be predicted from the two X’s: LCLC (folate) and LB12 (B12). LHCY = b0 + b1LCLC + b2LB12

  14. Multiple Regression: More General • More than 2 predictors can be used. The equation is for a “hyperplane”: y = b0 + b1x1 + b2x2 + … + bkxk. • A more realistic functional form, more complex than a plane, can be used. For example, to fit curvature for x2, use y = b0 + b1x1 + b2x2 + b3x22 . • If predictors themselves are highly correlated, then the fitted equation is imprecise. [This is because the x1 and x2 data then lie in almost a line in the x1-x2 plane, so the fitted plane is like an unstable tabletop with the table legs not well-spaced.] • How many and which variables to include? Prediction strategies (e.g, stepwise) differ from “significance” of factors.

  15. Reading Example: HDL Cholesterol Parameter Std Standardized Estimate Error T Pr > |t| Estimate Intercept 1.16448 0.28804 4.04 <.0001 0 AGE -0.00092 0.00125 -0.74 0.4602 -0.05735 BMI -0.01205 0.00295 -4.08 <.0001 -0.35719 BLC 0.05055 0.02215 2.28 0.0239 0.17063 PRSSY -0.00041 0.00044 -0.95 0.3436 -0.09384 DIAST 0.00255 0.00103 2.47 0.0147 0.23779 GLUM -0.00046 0.00018 -2.50 0.0135 -0.18691 SKINF 0.00147 0.00183 0.81 0.4221 0.07108 LCHOL 0.31109 0.10936 2.84 0.0051 0.20611 The predictors are age, body mass index, blood vitamin C, systolic and diastolic blood pressures, skinfold thickness, and the log of total cholesterol. LHDL=1.16-0.00092Age+…+0.311LCHOL

  16. Reading Example: Coefficients • Interpretation of coefficients (“parameter estimates”) from output LHDL=1.16-0.00092Age+…+0.311LCHOL on previous slide: • Need to use entire equation for making predictions. • Each coefficient measures the difference in expected LHDL between 2 subjects if the factor differs by 1 unit between the two subjects, and if all other factors are the same. E.g., expected LHDL is 0.00092 lower in a subject whose BMI is 1 unit greater, but on other factors the same as another subject. • Situation in (2) may be unrealistic, or impossible.

  17. Reading Example: Predictors • P-values measure “independent” effect of the factor; i.e., whether it is associated with the outcome (LHDL here) after accounting for all of the other effects in the model. • Which factors should be included in the equation? Remove those that are not significant (p<0.05)? • In general, it depends on the goal: • For prediction, more predictors → less bias, but less precision. “Stepwise methods” balance this. • For importance of a particular factor, need to include that factor and other factors that are either biologically or statistically related to the outcome.

More Related