Correlation and Regression

Correlation and Regression Slides by Brad Evanoff, MD, MPH Talk by Brian Gage, MD, MSc

Overview of Correlation and Regression Correlation seeks to establish whether a relationship exists between two variables Regression seeks to use one variable to predict another variable Both measure the extent of a linear relationship between two variables Statistical tests are used to determine the strength of the relationship

Nondependent and Dependent Relationships • Types of Relationship • Nondependent (correlation) -- neither one of variables is target Example: protein and fat intake • Dependent (regression) -- value of one variable is used to predict value of another variable. Example: ACT and MCAT scores for medical applicants, MCAT is the dependent and ACT is the independent variable • Statistical Expressions • Correlation Coefficient -- index of nondependent relationship • Regression Coefficient -- index of dependent relationship

Example • Measure the daily fecal lipid and fecal energy for 20 children with cystic fibrosis • Plot each individual as a point on a graph which has fecal lipid on one axis and fecal energy on the other axis • What does the distribution of these values look like?

Pearson’s Product Moment Correlation Coefficient • The correlation coefficient, r, is a measure of the interdependent relationship between two continuous variables • For two variables, x and y, the correlation coefficient measures the extent to which greater values of x are associated with greater values of y

The value of r can range from -1 to +1 • Absolute values close to 1, with either sign, will represent a close correlation • Values close to 0 will represent little or no correlation

r = ?

Importance of Scatterplots and Examining the Data Scatterplot F shows the relationship between temperature and number of nerve fiber discharges The scatterplot demonstrates a strong relationship However, the correlation coefficient, which only measures a linear relationship, has a value of zero (Note that scatterplot E also has an r value of zero but clearly no relationship exists between the two variables)

r values can be tested to see if an observed correlation is statistically significant • The same distinction between magnitude of effect and statistical significance must be made as for other tests - a large sample may make small correlations statistically significant yet clinically meaningless

Coefficient of Determination, r2 • To understand the strength of the relationship between two variables • The correlation coefficient, r, is squared • r2 shows how much of the variation in one measure (say, fecal energy) is accounted for by knowing the value of the other measure (fecal lipid loss)

For the cystic fibrosis patients, r= .42 and r2 = .18 • 18% of the variation in fecal energy may be accounted for by knowing fecal lipid loss (or vice versa)

Caveats • Correlation does not imply causation • Correlation measures only linear association, and many biological systems are better described by curvilinear plots • This is one reason why data should always be looked at first (scatterplot)

Correlation coefficient assumes normally distributed data • The correlation coefficient is sensitive to extreme values • Non-normal distributions can be transformed (e.g., logarithmic transformation) or converted into ranks and non-parametric correlation test can be used (Spearman’s rank correlation)

Types of Coefficients Type of Data Continuous v. Continuous Continuous v. Ordinal Ordinal v. Ordinal Correlation Coefficient Pearson’s r Jaspen’s Multiserial Coefficient (M) Spearman’s r (Rho) Kendall’s t (Tau)

Linear Regression • Used when the goal is to predict the value of one characteristic from knowledge of another • Assumes a straight-line, or linear, relationship between two variables • But the variable can be transformed 1st • When term simple is used with regression, it refers to situation where one explanatory (independent) variable is used to predict another • Multiple regression is used for more than one explanatory variable

If the point at which the line intercepts or crosses the Y-axis is a and the slope of the line is denoted as b, then Y = ß1X + ß0 • Like y = mx + b • The slope is a measure of how much Y changes for a one-unit change in X

Because the points rarely fall along a perfect straight line, there is also an error term e • The formula then becomes Y= ß1X+ ß0 + e • The error term is a measure of the amount that the actual Y values depart from the Y values predicted by the equation • Regression lines are fitted using a measure called least squares, which attempts to find the line which minimizes the sum of these errors (each of which is squared in the equations)

Example • Investigators want to be able to predict a potential medical school applicant’s MCAT scores from his or her previous ACT examination score • Create scatterplot of ACT and MCAT test scores • Calculate the regression equation for ACT scores and MCAT scores

r = ?

Y´= -1.61 + 0.406X, where Y´ is the predicted MCAT score and X is the ACT score R = 0.62

This model of simple linear regression can be extended to situations where there is more than one independent variable of interest • The equation below shows a model which predicts Y based on three independent variables, X1 ,X2 , andX3

Multiple Regression • Just like simple linear regression, but with more variables • Allows the independent effects of several variables to be studied at once; can examine contribution of any variable while controlling for effects of other variables • Useful when predictor (independent) variables and the outcome (dependent) variable are numerical (continuous) e.g., weight, age, Hct.

Multiple Regression Y = estimated value for dependent (outcome) variable ß0 = intercept ß1= partial regression coefficients: indicate how much Y changes for each unit of change in X, when all other variables in the model held constant Xi = independent (predictor) variables

Multiple regression: R • Multiple R = correlation coefficient; indicates correspondence between Y values predicted by the model and Y values observed. • R2 = amount of variability in Y explained by variation in the X variables contained in the model • Model calculates partial R values - correlation coefficient of individual variables - as well as R for the whole model

Results of Stepwise Regression Predicting Resident Performance

Building A Multiple Regression Model • Usual case: picking a few “significant” variables from many candidate variables • Variables can be included because of clinical significance (“forced” into the model) or because of statistical significance • Statistical significance usually determined by a stepwise process

Forward Selection • Picks the X variable with the highest R, puts in the model • Then looks for the X variable which will increase R2 by the highest amount • Test for statistical significance performed (using the F test) • If statistically significant, the new variable is included in the model, and the variable with the next highest R2 is tested • The selection stops when no variable can be added which significantly increases R2

Backwards Elimination • Starts with all variables in the model • Removes the X variable which results in the smallest change in R2 • Continues to remove variables from the model until removal produces a statistically significant drop in R2

Stepwise regression • Similar to forward selection, but after each new X added to the model, all X variables already in the model are re-checked to see if the addition of the new variable has effected their significance • Bizarre, but unfortunately true: running forward selection, backward elimination, and stepwise regression on the same data often gives different answers

Multiple Regression: Caveats • Try not to include predictor variables which are highly correlated with each other • One X may force the other out, with strange results • Overfitting: too many variables make for an unstable model • Common rule of thumb: need > 10 subjects (or events) for each X variable • Model assumes normal distribution for Y variable • widely skewed data may give misleading results

Table. Multivariate Analysis: Independent Predictors of Warfarin Dose Entry into Model Variable Coefficient Change in Warfarin Dose, % (95% CI) P value 0 Intercept +0.404 - - 1 Age, per decade –0.0084 –8 (–5 to –11) <0.001 2 BSA, per SD +0.50 +14 (+8 to +18) <0.001 3 SNPs, per allele –0.25 –22 (–16 to –28) <0.001 4 Amiodarone –0.34 –29 (–16 to –40) 0.001 5 Target INR, per 0.5 increase +0.38 +21 (+9 to +34) <0.001 6 Simvastatin –0.13 –13 (–2 to –22) 0.03 7 White race –0.123 –12 (–3 to –20) 0.01

Correlation and Regression