890 likes | 995 Views
Learn about linear regression, its application in hypothesis testing, parametric and nonparametric analysis, correlation, and interpretation of coefficients. Explore the connection between regression and ANOVA with big-picture insights. Discover how linear regression analyzes continuous outcomes with various predictors and understand its importance in statistical research.
E N D
Linear regression Brian Healy, PhD BIO203
Previous classes • Hypothesis testing • Parametric • Nonparametric • Correlation
What are we doing today? • Linear regression • Continuous outcome with continuous, dichotomous or categorical predictor • Equation: • Interpretation of coefficients • Connection between regression and • correlation • t-test • ANOVA
Big picture • Linear regression is the most commonly used statistical technique. It allows the comparison of dichotomous, categorical and continuous predictors with a continuous outcome. • Extensions of linear regression allow • Dichotomous outcomes- logistic regression • Survival analysis- Cox proportional hazards regression • Repeated measures • Amazingly, many of the analyses we have learned can be completed using linear regression
Example • Yesterday, we investigated the association between age and BPF using a correlation coefficient • Can we fit a line to this data?
Quick math review • As you remember from high school math, the basic equation of a line is given by y=mx+b where m is the slope and b is the y-intercept • One definition of m is that for every one unit increase in x, there is an m unit increase in y • One definition of b is the value of y when x is equal to zero
Picture • Look at the data in this picture • Does there seem to be a correlation (linear relationship) in the data? • Is the data perfectly linear? • Could we fit a line to this data?
What is linear regression? • Linear regression tries to find the best line (curve) to fit the data • The method of finding the best line (curve) is least squares, which minimizes the sum of the distance from the line for each of points
How do we find the best line? • Let’s look at three candidate lines • Which do you think is the best? • What is a way to determine the best line to use?
Residuals • The actual observations, yi, may be slightly off the population line because of variability in the population. The equation is yi = b0 + b1xi + ei, where ei is the deviation from the population line (See picture). • This is called the residual This is the distance from the line for patient 1, e1
Least squares • The method employed to find the best line is called least squares. This method finds the values of b that minimize the squared vertical distance from the line to each of the point. This is the same as minimizing the sum of the ei2
Estimates of regression coefficients • Once we have solved the least squares equation, we obtain estimates for the b’s, which we refer to as • The final least squares equation is where yhat is the mean value of y for a value of x1
Assumptions of linear regression • Linearity • Linear relationship between outcome and predictors • E(Y|X=x)=b0 + b1x1 + b2x22is still a linear regression equation because each of the b’s is to the first power • Normality of the residuals • The residuals, ei, are normally distributed, N(0, s2) • Homoscedasticity of the residuals • The residuals, ei, have the same variance • Independence • All of the data points are independent • Correlated data points can be taken into account using multivariate and longitudinal data methods
Linearity assumption • One of the assumptions of linear regression is that the relationship between the predictors and the outcomes is linear • We call this the population regression line E(Y | X=x) = my|x = b0+ b1x • This equation says that the mean of y given a specific value of x is defined by the b coefficients • The coefficients act exactly like the slope and y-intercept from the simple equation of a line from before
Normality and homoscedasticity assumption • Two other assumptions of linear regression are related to the ei’s • Normality- the distribution of the residuals are normal. • Homoscedasticity- the variance of y given x is the same for all values of x Distribution of y-values at each value of x is normal with the same variance
Example • Here is a regression equation for the comparison of age and BPF
Results • The estimated regression equation
Estimated slope Estimated intercept
Interpretation of regression coefficients • The final regression equation is • The coefficients mean • the estimate of the mean BPF for a patient with an age of 0 is 0.957 (b0hat) • an increase of one year in age leads to an estimated decrease of 0.0029 in mean BPF (b1hat)
Unanswered questions • Is the estimate of b1 (b1hat) significantly different than zero? In other words, is there a significant relationship between the predictor and the outcome? • Have the assumptions of regression been met?
Estimate of variance for bhat’s • In order to determine if there is a significant association, we need an estimate of the variance of b0hat and b1hat • sy|x is the residual variance in y after accounting for x (standard deviation from regression, root mean square error)
Test statistic • For both regression coefficients, we use a t-statistic to test any specific hypothesis • Each has n-2 degrees of freedom (This is the sample size-number of parameters estimated) • What is the usual null hypothesis for b1?
Hypothesis test • H0: b1=0 • Continuous outcome, continuous predictor • Linear regression • Test statistic: t=-3.67 (27 dof) • p-value=0.0011 • Since the p-value is less than 0.05, we reject the null hypothesis • We conclude that there is a significant association between age and BPF
Estimated slope p-value for slope Estimated intercept
Comparison to correlation • In this example, we found a relationship between the age and BPF. We also investigated this relationship using correlation • We get the same p-value!! • Our conclusion is exactly the same!! • There are other relationships we will see later
Confidence interval for b1 • As we have done previously, we can construct a confidence interval for the regression coefficients • Since we are using a t-distribution, we do not automatically use 1.96. Rather we use the cut-off from the t-distribution • Interpretation of confidence interval is same as we have seen previously
Intercept • STATA also provides a test statistic and p-value for the estimate of the intercept • This is for Ho: b0 = 0, which is often not a hypothesis of interest because this corresponds to testing whether the BPF is equal to zero at age of 0 • Since BPF can’t be 0 at age 0, this test is not really of interest • We can center covariates to make this test important
Prediction • Beyond determining if there is a significant association, linear regression can also be used to make predictions • Using the regression equation, we can predict the BPF for patients with specific age values • Ex. A patient with age=40 • The expected BPF for a patient of age 40 based on our experiment is 0.841
Extrapolation • Can we predict the BPF for a patient with age 80? What assumption would we be making?
Confidence interval for prediction • We can place a confidence interval around our predicted mean value • This corresponds to the plausible values for the mean BPF at a specific age • To calculate a confidence interval for the predicted mean value, we need an estimate of variability in the predicted mean
Confidence interval • Note that the standard error equation has a different magnitude based on the x value. In particular, the magnitude is the least when x=the mean of x • Since the test statistic is based on the t-distribution, our confidence interval is • This confidence interval is rarely used for hypothesis testing because
Prediction interval • A confidence interval for a mean provides information regarding the accuracy of a estimated mean value for a sample size • Often, we are interested in how accurate our prediction would be for a single observation, not the mean of a group of observations. This is called a prediction interval • What would you estimate as the value for a single new observation? • Do you think a prediction interval is narrower or wider?
Prediction interval • Confidence interval always tighter than prediction intervals • The variability in the prediction of a single observation contains two types of variability • Variability of the estimate of the mean (confidence interval) • Variability around the estimate of the mean (residual variability)
Conclusions • Prediction interval is always wider than confidence interval • Common to find significant differences between groups but not be able to predict very accurately • To predict accurately for a single patient, we need limited overlap of the distribution. The benefit of an increased sample size decreasing the standard error does not help
How good is our model? • Although we have found a relationship between age and BPF, linear regression also allows us to assess how well our model fits the data • R2=coefficient of determination=proportion of variance in the outcome explained by the model • When we have only one predictor, it is the proportion of the variance in y explained by x
R2 • What if all of the variability in y was explained by x? • What would R2 equal? • What does this tell you about the correlation between x and y? • What if the correlation between x and y is negative? • What if none of the variability in y is explained by x? • What would R2 equal? • What is the correlation between x and y in this case?
r vs. R2 • R2=(Pearson’s correlation coefficient)2=r2 • Since r is between -1 and 1, R2 is always less than r • r=0.1, R2=0.01 • r=0.5, R2=0.25
Evaluation of model • Linear regression required several assumptions • Linearity • Homoscedasticity • Normality • Independence-usually from study design • We must determine if the model assumptions were reasonable or a different model may have been needed • Statistical research has investigated relaxing each of these assumptions
Scatter plot • A good first step in any regression is to look at the x vs. y scatter plot. This allows us to see • Are there any outliers? • Is the relationship between x and y approximately linear? • Is the variance in the data approximately constant for all values of x?
Tests for the assumptions • There are several different ways to test the assumptions of linear regression. • Graphical • Statistical • Many of the tests use the residuals, which are the distances from the fitted line and the outcomes
Residual plot If the assumptions of linear regression are met, we will observe a random scatter of points
Investigating linearity • Scatter plot of predictor vs outcome • What do you notice here? • One way to handle this is to transform the predictor to include a quadratic or other term
Aging • Research has shown that the decrease in BPF in normal people is pretty slow up until age 65 and then there is a more steep drop
Fitted line Note how the majority of the values are above the fitted line in the middle and below the fitted line on the two ends
What if we fit a line for this? • Residual plot shows a non-random scatter because the relationship is not really linear
What can we do? • If the relationship between x and y is not linear, we can try a transformation of the values • Possible transformations • Add a quadratic term • Fit a spline. This is when there is a slope for a certain part of the curve and a different slope for the rest of the curve