1 / 48

# Biostat Review - PowerPoint PPT Presentation

Biostat Review. November 29, 2012. Objectives. Review hw#8 Review of last two lectures Linear regression Simple and multiple Logistic regression. Review hw#8. Simple linear regression.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Biostat Review

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Biostat Review

November 29, 2012

### Objectives

• Review hw#8

• Review of last two lectures

• Linear regression

• Simple and multiple

• Logistic regression

### Simple linear regression

• The objective of regression analysis is to predict or estimate the value of the response(outcome) that is associated with a fixed value of the explanatory variable (predictor).

### Simple linear regression

• The regression line equation is

• The “best” line is the one that finds the α and β that minimize the sum of the squared residuals Σei2 (hence the name “least squares”)

• We are minimizing the sum of the squares of the residuals

• The slope  is the change in the mean value of y that corresponds to a one-unit increase in x

### Assumptions of the linear model

• conditional mean of the outcome is linear

• observed outcomes are independent

• residuals (ε) follow a standard normal distribution

• constant variance (σ2)

• predictors are measured without error

### Simple linear regression example: Regression of age on FEVFEV= α̂ + β̂ age

regress yvarxvar

. regress fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 872.18

Model | 280.919154 1 280.919154 Prob > F = 0.0000

Residual | 210.000679 652 .322086931 R-squared = 0.5722

Total | 490.919833 653 .751791475 Root MSE = .56753

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .222041 .0075185 29.53 0.000 .2072777 .2368043

_cons | .4316481 .0778954 5.54 0.000 .278692 .5846042

------------------------------------------------------------------------------

β̂ ̂ = Coef for age

α̂ = _cons (short for constant)

### Interpretation of coefficients

β̂ ̂ = Coef. for age

• For every one increase unit in age there is an increase in mean FEV of 0.22

α̂ = _cons (short for constant)

• When age = 0, the mean FEV is 0.431, which is also equal to the mean FEV

### Model Fit

• R2 represents the portion of the variability that is removed by performing the regression on X

• Remember that the R2 square tells us the fit of the model with values closer to 1 having a better fit

regress fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 872.18

Model | 280.919154 1 280.919154 Prob > F = 0.0000

Residual | 210.000679 652 .322086931 R-squared = 0.5722

Total | 490.919833 653 .751791475 Root MSE = .56753

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .222041 .0075185 29.53 0.000 .2072777 .2368043

_cons | .4316481 .0778954 5.54 0.000 .278692 .5846042

------------------------------------------------------------------------------

=.75652

### Model fit

• Residuals are the difference between the observed y values and the regression line for each value of x ( yi-ŷi)

• If all the points lie along a straight line, the residuals are all 0

• If there is a lot of variability at each level of x, the residuals are large

• The sum of the squared residuals is what was minimized in the least squares method of fitting the line

### Use of residual plots for model fit

• Residual plot is a scatter plot

• Y-axis residuals

• X-axis outcome variable

• Stata code to get residual plot:

regress fev age

rvfplot

rvfplot, title(Fitted values versus residuals for regression of FEV on age)

### Why look at residual plot

• The spread of the residuals increase s with fitted in FEV values increases,– suggesting heteroscedasticity

• Heteroscedasticityreduces the precision of the estimates (hence reduces power) -makes your standard errors larger

• Homoscedasticity: constant variability across all values of x (same standard deviation for each value of y) -constant variance (σ2) assumption

### Residual plots

• Of note

• rvfplot ** gives you Residuals vs. Fitted (outcome)

• rvpplotht ** gives you Residuals vs. Predictor (predictor)

### Data transformation

• So if you have heterostatisticity in your data, can transform your data

• Something to note

• Transforming you data does not inherently change your data

• Log transformation is the most common way to deal with heterostatisticity

### Log transformation of FEV data

• Do we still have heterostatisticity?

### Log transformation stata output

. regress ln_fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 961.01

Model | 43.2100544 1 43.2100544 Prob > F = 0.0000

Residual | 29.3158601 652 .044962976 R-squared = 0.5958

Total | 72.5259145 653 .111065719 Root MSE = .21204

------------------------------------------------------------------------------

ln_fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0870833 .0028091 31.00 0.000 .0815673 .0925993

_cons | .050596 .029104 1.74 0.083 -.0065529 .1077449

-------------------------------------------------------------

### Interpretation of regression coefficients for transformed y value

• The regression equation is:

ln(FEV) = ̂ + ̂ age

= 0.051 + 0.087 age

• So a one year change in age corresponds to a .087 change in ln(FEV)

• The change is on a multiplicative scale, so if you exponentiate, you get a percent change in y

• e0.087 = 1.09 – so a one year change in age corresponds to a 9% increase in FEV

### Categorical variable/predictor

• Previous example was of a predictor that was continuous

• Can also perform regression with a categorical predictor/variable

• If dichotomous

• Convention use 0 vs. 1

• ie is dichotomous: 0 for female, 1 for male

### Categorical independent variable

• Remember that the regression equation is

μy|x = α +  x

• The only variables x can take are 0 and 1

• μy|0 = αμy|1 = α + 

• So the estimated mean FEV for females is ̂ and the estimated mean FEV for males is ̂ + ̂

• When we conduct the null hypothesis test that=0

• Similar to a -T-test

### Categorical variable/predictor

• What if you have more than two categories within a predictor (non-dichotomous)?

• One is set to be the reference category.

### Categorical independent variables

• Then the regression equation is:

y =  + 1 xAsian/PI + 2 xOther+ ε

• For race group= White (reference)

ŷ = ̂ +v ̂10+ ̂20 = ̂

• For race group= Asian/PI

ŷ = ̂ + ̂11 + ̂20 = ̂ + ̂1

• For race group= Other

ŷ = ̂ + ̂10 + ̂21 = ̂ + ̂2

### Categorical independent variables

• For stata you just place an “i.variable” to identify it as categorical variable

• Stata takes the lowest number as the reference group

• You can change this by the prefix “b#. variable” where # is the number value of the group that you want to be the reference group.

### Multiple regression

• Additional explanatory variables might add to our understanding of a dependent variable

• We can posit the population equation

μy|x1,x2,...,xq = α + 1x1 + 2x2 + ... + qxq

• αis the mean of y when all the explanatory variables are 0

• iis the change in the mean value of y the corresponds to a 1 unit change in xiwhen all the other explanatory variables are held constant

### Multiple regression

• regress outcomevar predictorvar1 predictorvar2…

### Multiple regression

. regress fev age ht

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 2, 651) = 1067.96

Model | 376.244941 2 188.122471 Prob > F = 0.0000

Residual | 114.674892 651 .176151908 R-squared = 0.7664

Total | 490.919833 653 .751791475 Root MSE = .4197

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0542807 .0091061 5.96 0.000 .0363998 .0721616

ht | .1097118 .0047162 23.26 0.000 .100451 .1189726

_cons | -4.610466 .2242706 -20.56 0.000 -5.050847 -4.170085

------------------------------------------------------------------------------

• R2 will always increase as you add more variables into the model

• The Adj R-squared accounts for the addition of variables and is comparable across models with different numbers of parameters

• Note that the beta for age decreased

### How do you interpret the coefficients?

• Age

• Whenheight is held constant for every 1 unit (in this case year) increase in age you will have a 0.054 unit increase in FEV

### You can fit both continuous and categorical predictors

. regress fev age smoke

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 2, 651) = 443.25

Model | 283.058247 2 141.529123 Prob > F = 0.0000

Residual | 207.861587 651 .319295832 R-squared = 0.5766

Total | 490.919833 653 .751791475 Root MSE = .56506

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .2306046 .0081844 28.18 0.000 .2145336 .2466755

smoke | -.2089949 .0807453 -2.59 0.010 -.3675476 -.0504421

_cons | .3673731 .0814357 4.51 0.000 .2074647 .5272814

------------------------------------------------------------------------------

• The model is fêv = α̂ + β̂1 age + β̂2Xsmoke

• So for non-smokers, we have fêv= α̂ + β̂1 age (b/c Xsmoke=0)

• For smokers, fêv = α̂ + β̂1 age + β̂2(b/c Xsmoke= 1)

• So β̂2 is the mean difference in FEV for smokers versus non-smokers at each age

• When you have one continuous variable and one dichotomous variable, you can think of fitting two lines that only differ in y intercept by the coefficient of the dichotomous variable (in this case smoke)

• E.g. β̂2=-.209

### Linear regression summary

• Intercept is the mean value of outcome for an individual with other values equal to zero

• Mean change in the outcome per unit change in the predictor

• Mean change in the outcome per unit change in predictor holding other variables constant

• R-squared is the proportion of total variance in the outcome explained by the regression model

• Adjusted R-squared accounts for the number of predictors in the model

### Logistic regression

• Linear regression

• Continuous outcome

• Logistic regression

• Dichotomous outcome

• Eg disease or no disease or Alive/Dead

• Model the probability of the disease

### Logistic regression

• Need an equation that will follow rules of probability

• Specifically that probability needs to be between 0-1

• A model of the form p= α + βx would be able to take on negative values or values more than 1

• p=e α + βx is an improvement because it cannot be negative , but it still could be greater than 1

### Logistic regression

• This function =.5 when α + βx =0

• The function models the probability slowly increasing over the value of x, until there is a steep rise, and another leveling off

### Logistic regression

• ln(p/(1-p)) = α + bx

• So instead of assuming that the relationship between x and p is linear , we are assuming that the relationship between ln(p/(1-p)) and x is linear.

• ln(p/(1-p)) is called the logit function

• It is a transformation

• While the outcome is not linear, the other side of the equation α + bx is linear

### Logistic regression

• Stata code

• logistic outcomevarpredictorvar 1 predictorvar2…, coef

• Coef command gives you coefficient, β

• This β, when you are interpreting is actually ln(OR)

• To get the odds ratio, need to raise β to e

• Odds ratio = e

• Or you could just use this stata code instead (don’t use coeff)

• logistic outcomevarpredictorvar 1 predictorvar2…,

### Interpret these coefficients

. logistic coldany i.rested_mostly, coef

Logistic regression Number of obs = 504

LR chi2(1) = 19.71

Prob > chi2 = 0.0000

Log likelihood = -323.5717 Pseudo R2 = 0.0296

------------------------------------------------------------------------------

coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.rested_m~y | -.9343999 .2187794 -4.27 0.000 -1.3632 -.5056001

_cons | -.2527658 .1077594 -2.35 0.019 -.4639704 -.0415612

------------------------------------------------------------------------------

### Interpret these coefficients

• Cold data (from previous slide)

• β = -0.934

• The natural log of the odds of someone who was rested of getting a cold to someone who is rested is -0.934

• If you raise it to the power of e, you get 0.39

• Therefore another way of interpreting this is that the odds of someone who was rested of getting a cold compared to someone who is not rested is 0.39

### Or get stata to calculate the odds ratio for you!

logistic depvarindepvar

. logistic coldanyi.rested_mostly

Logistic regression Number of obs = 504

LR chi2(1) = 19.71

Prob > chi2 = 0.0000

Log likelihood = -323.5717 Pseudo R2 = 0.0296

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.rested_m~y | .3928215 .0859413 -4.27 0.000 .2558409 .6031435

------------------------------------------------------------------------------

=e

### Interpretation when you have a continuous variable

.

. logistic coldany age

Logistic regression Number of obs = 504

LR chi2(1) = 23.77

Prob > chi2 = 0.0000

Log likelihood = -322.05172 Pseudo R2 = 0.0356

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .9624413 .0081519 -4.52 0.000 .9465958 .9785521

------------------------------------------------------------------------------

• Interpretation of the coefficients: The odds ratio is for a one unit change in the predictor

• For this example the 0.962 is the odds ratio for a year difference in age

### Continuous explanatory variable

• To find the OR for a 10-year change in age

.

. logistic coldany age, coef

Logistic regression Number of obs = 504

LR chi2(1) = 23.77

Prob > chi2 = 0.0000

Log likelihood = -322.05172 Pseudo R2 = 0.0356

------------------------------------------------------------------------------

coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | -.0382822 .00847 -4.52 0.000 -.0548831 -.0216813

_cons | .906605 .3167295 2.86 0.004 .2858265 1.527383

------------------------------------------------------------------------------

OR for a 10-year change in age = exp(10*-.0382) = 0.682

### Or you can also generate a new variable

• To find the OR for a 10-year change in age

. gen age_10=age/10

(2 missing values generated)

. logistic coldany age_10

Logistic regression Number of obs = 504

LR chi2(1) = 23.77

Prob > chi2 = 0.0000

Log likelihood = -322.05172 Pseudo R2 = 0.0356

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age_10 | .6819344 .0577599 -4.52 0.000 .5776247 .8050807

------------------------------------------------------------------------------

This is nice because stata will calculate your confidence interval as well!

### Interpret this output

. logistic coldany age_10 i.smoke

Logistic regression Number of obs = 504

LR chi2(2) = 23.89

Prob > chi2 = 0.0000

Log likelihood = -321.99014 Pseudo R2 = 0.0358

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age_10 | .6835216 .0580647 -4.48 0.000 .5786864 .807349

1.smoke | 1.128027 .3863511 0.35 0.725 .5764767 2.20728

------------------------------------------------------------------------------

.

### Correct interpretations

For this example the 0.684 is the odds ratio for a ten-year difference in age when you hold smoking status constant

1.13 is the odds ratio for smoking when you hold age constant

. logistic sex fev

Logistic regression Number of obs = 654

LR chi2(1) = 29.18

Prob > chi2 = 0.0000

Log likelihood = -438.47993 Pseudo R2 = 0.0322

------------------------------------------------------------------------------

sex | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

fev | 1.660774 .1617468 5.21 0.000 1.372176 2.01007

_cons | .279198 .0742534 -4.80 0.000 .1657805 .4702094

------------------------------------------------------------------------------

the z (Wald) test statistic in the logistic results is the ratio of the estimated regression coefficient for the predictor (fev)to its standard error , and follows (approximately) a standard normal distribution

the log-likelihood is a measure of support of the data for the model (the larger the likelihood and/or log-likelihood, the better the support).

the statistic "chi2" is the likelihood ratio statistic for comparing this model including arcus to the simpler one (presented below) containing no predictors

### Summary Logistic regression

• The log-odds of the outcome is linear in x, with intercept αand slope β1 .

• The "intercept" coefficient αgives the log-odds of the outcome for x = 0.

• The "slope" coefficient β1 gives the change in log-odds of the outcome for a unit increase in x. This is the log odds ratio associated with a unit increase in x.

• Outcome risk (P) is between 0 and 1 for all values of x