- 54 Views
- Uploaded on
- Presentation posted in: General

Biostat Review

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Biostat Review

November 29, 2012

- Review hw#8
- Review of last two lectures
- Linear regression
- Simple and multiple

- Logistic regression

- The objective of regression analysis is to predict or estimate the value of the response(outcome) that is associated with a fixed value of the explanatory variable (predictor).

- The regression line equation is
- The “best” line is the one that finds the α and β that minimize the sum of the squared residuals Σei2 (hence the name “least squares”)
- We are minimizing the sum of the squares of the residuals
- The slope is the change in the mean value of y that corresponds to a one-unit increase in x

- conditional mean of the outcome is linear
- observed outcomes are independent
- residuals (ε) follow a standard normal distribution
- constant variance (σ2)
- predictors are measured without error

regress yvarxvar

. regress fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 872.18

Model | 280.919154 1 280.919154 Prob > F = 0.0000

Residual | 210.000679 652 .322086931 R-squared = 0.5722

-------------+------------------------------ Adj R-squared = 0.5716

Total | 490.919833 653 .751791475 Root MSE = .56753

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .222041 .0075185 29.53 0.000 .2072777 .2368043

_cons | .4316481 .0778954 5.54 0.000 .278692 .5846042

------------------------------------------------------------------------------

β̂ ̂ = Coef for age

α̂ = _cons (short for constant)

β̂ ̂ = Coef. for age

- For every one increase unit in age there is an increase in mean FEV of 0.22
α̂ = _cons (short for constant)

- When age = 0, the mean FEV is 0.431, which is also equal to the mean FEV

- R2 represents the portion of the variability that is removed by performing the regression on X
- Remember that the R2 square tells us the fit of the model with values closer to 1 having a better fit

regress fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 872.18

Model | 280.919154 1 280.919154 Prob > F = 0.0000

Residual | 210.000679 652 .322086931 R-squared = 0.5722

-------------+------------------------------ Adj R-squared = 0.5716

Total | 490.919833 653 .751791475 Root MSE = .56753

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .222041 .0075185 29.53 0.000 .2072777 .2368043

_cons | .4316481 .0778954 5.54 0.000 .278692 .5846042

------------------------------------------------------------------------------

=.75652

- Residuals are the difference between the observed y values and the regression line for each value of x ( yi-ŷi)
- If all the points lie along a straight line, the residuals are all 0
- If there is a lot of variability at each level of x, the residuals are large
- The sum of the squared residuals is what was minimized in the least squares method of fitting the line

- Residual plot is a scatter plot
- Y-axis residuals
- X-axis outcome variable

- Stata code to get residual plot:
regress fev age

rvfplot

rvfplot, title(Fitted values versus residuals for regression of FEV on age)

- The spread of the residuals increase s with fitted in FEV values increases,– suggesting heteroscedasticity
- Heteroscedasticityreduces the precision of the estimates (hence reduces power) -makes your standard errors larger
- Homoscedasticity: constant variability across all values of x (same standard deviation for each value of y) -constant variance (σ2) assumption

- Of note
- rvfplot ** gives you Residuals vs. Fitted (outcome)
- rvpplotht ** gives you Residuals vs. Predictor (predictor)

- So if you have heterostatisticity in your data, can transform your data
- Something to note
- Transforming you data does not inherently change your data

- Log transformation is the most common way to deal with heterostatisticity

- Do we still have heterostatisticity?

. regress ln_fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 961.01

Model | 43.2100544 1 43.2100544 Prob > F = 0.0000

Residual | 29.3158601 652 .044962976 R-squared = 0.5958

-------------+------------------------------ Adj R-squared = 0.5952

Total | 72.5259145 653 .111065719 Root MSE = .21204

------------------------------------------------------------------------------

ln_fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0870833 .0028091 31.00 0.000 .0815673 .0925993

_cons | .050596 .029104 1.74 0.083 -.0065529 .1077449

-------------------------------------------------------------

- The regression equation is:
ln(FEV) = ̂ + ̂ age

= 0.051 + 0.087 age

- So a one year change in age corresponds to a .087 change in ln(FEV)
- The change is on a multiplicative scale, so if you exponentiate, you get a percent change in y
- e0.087 = 1.09 – so a one year change in age corresponds to a 9% increase in FEV

- Previous example was of a predictor that was continuous
- Can also perform regression with a categorical predictor/variable
- If dichotomous
- Convention use 0 vs. 1
- ie is dichotomous: 0 for female, 1 for male

- Remember that the regression equation is
μy|x = α + x

- The only variables x can take are 0 and 1
- μy|0 = αμy|1 = α +
- So the estimated mean FEV for females is ̂ and the estimated mean FEV for males is ̂ + ̂
- When we conduct the null hypothesis test that=0
- Similar to a -T-test

- What if you have more than two categories within a predictor (non-dichotomous)?
- One is set to be the reference category.

- Then the regression equation is:
y = + 1 xAsian/PI + 2 xOther+ ε

- For race group= White (reference)
ŷ = ̂ +v ̂10+ ̂20 = ̂

- For race group= Asian/PI
ŷ = ̂ + ̂11 + ̂20 = ̂ + ̂1

- For race group= Other
ŷ = ̂ + ̂10 + ̂21 = ̂ + ̂2

- For stata you just place an “i.variable” to identify it as categorical variable
- Stata takes the lowest number as the reference group
- You can change this by the prefix “b#. variable” where # is the number value of the group that you want to be the reference group.

- Additional explanatory variables might add to our understanding of a dependent variable
- We can posit the population equation
μy|x1,x2,...,xq = α + 1x1 + 2x2 + ... + qxq

- αis the mean of y when all the explanatory variables are 0
- iis the change in the mean value of y the corresponds to a 1 unit change in xiwhen all the other explanatory variables are held constant

- Stata command (just add the additional predictors)
- regress outcomevar predictorvar1 predictorvar2…

. regress fev age ht

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 2, 651) = 1067.96

Model | 376.244941 2 188.122471 Prob > F = 0.0000

Residual | 114.674892 651 .176151908 R-squared = 0.7664

-------------+------------------------------ Adj R-squared = 0.7657

Total | 490.919833 653 .751791475 Root MSE = .4197

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0542807 .0091061 5.96 0.000 .0363998 .0721616

ht | .1097118 .0047162 23.26 0.000 .100451 .1189726

_cons | -4.610466 .2242706 -20.56 0.000 -5.050847 -4.170085

------------------------------------------------------------------------------

- R2 will always increase as you add more variables into the model
- The Adj R-squared accounts for the addition of variables and is comparable across models with different numbers of parameters
- Note that the beta for age decreased

- Age
- Whenheight is held constant for every 1 unit (in this case year) increase in age you will have a 0.054 unit increase in FEV

. regress fev age smoke

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 2, 651) = 443.25

Model | 283.058247 2 141.529123 Prob > F = 0.0000

Residual | 207.861587 651 .319295832 R-squared = 0.5766

-------------+------------------------------ Adj R-squared = 0.5753

Total | 490.919833 653 .751791475 Root MSE = .56506

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .2306046 .0081844 28.18 0.000 .2145336 .2466755

smoke | -.2089949 .0807453 -2.59 0.010 -.3675476 -.0504421

_cons | .3673731 .0814357 4.51 0.000 .2074647 .5272814

------------------------------------------------------------------------------

- The model is fêv = α̂ + β̂1 age + β̂2Xsmoke
- So for non-smokers, we have fêv= α̂ + β̂1 age (b/c Xsmoke=0)
- For smokers, fêv = α̂ + β̂1 age + β̂2(b/c Xsmoke= 1)
- So β̂2 is the mean difference in FEV for smokers versus non-smokers at each age

- When you have one continuous variable and one dichotomous variable, you can think of fitting two lines that only differ in y intercept by the coefficient of the dichotomous variable (in this case smoke)
- E.g. β̂2=-.209

- Intercept is the mean value of outcome for an individual with other values equal to zero
- Mean change in the outcome per unit change in the predictor
- Mean change in the outcome per unit change in predictor holding other variables constant
- R-squared is the proportion of total variance in the outcome explained by the regression model
- Adjusted R-squared accounts for the number of predictors in the model

- Linear regression
- Continuous outcome

- Logistic regression
- Dichotomous outcome
- Eg disease or no disease or Alive/Dead

- Model the probability of the disease

- Dichotomous outcome

- Need an equation that will follow rules of probability
- Specifically that probability needs to be between 0-1

- A model of the form p= α + βx would be able to take on negative values or values more than 1
- p=e α + βx is an improvement because it cannot be negative , but it still could be greater than 1

- How about the function?
- This function =.5 when α + βx =0
- The function models the probability slowly increasing over the value of x, until there is a steep rise, and another leveling off

- ln(p/(1-p)) = α + bx
- So instead of assuming that the relationship between x and p is linear , we are assuming that the relationship between ln(p/(1-p)) and x is linear.
- ln(p/(1-p)) is called the logit function
- It is a transformation
- While the outcome is not linear, the other side of the equation α + bx is linear

- Stata code
- logistic outcomevarpredictorvar 1 predictorvar2…, coef
- Coef command gives you coefficient, β
- This β, when you are interpreting is actually ln(OR)
- To get the odds ratio, need to raise β to e
- Odds ratio = e

- Or you could just use this stata code instead (don’t use coeff)
- logistic outcomevarpredictorvar 1 predictorvar2…,

- logistic outcomevarpredictorvar 1 predictorvar2…, coef

. logistic coldany i.rested_mostly, coef

Logistic regression Number of obs = 504

LR chi2(1) = 19.71

Prob > chi2 = 0.0000

Log likelihood = -323.5717 Pseudo R2 = 0.0296

------------------------------------------------------------------------------

coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.rested_m~y | -.9343999 .2187794 -4.27 0.000 -1.3632 -.5056001

_cons | -.2527658 .1077594 -2.35 0.019 -.4639704 -.0415612

------------------------------------------------------------------------------

- Cold data (from previous slide)
- β = -0.934
- The natural log of the odds of someone who was rested of getting a cold to someone who is rested is -0.934
- If you raise it to the power of e, you get 0.39
- Therefore another way of interpreting this is that the odds of someone who was rested of getting a cold compared to someone who is not rested is 0.39

logistic depvarindepvar

. logistic coldanyi.rested_mostly

Logistic regression Number of obs = 504

LR chi2(1) = 19.71

Prob > chi2 = 0.0000

Log likelihood = -323.5717 Pseudo R2 = 0.0296

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.rested_m~y | .3928215 .0859413 -4.27 0.000 .2558409 .6031435

------------------------------------------------------------------------------

=e

.

. logistic coldany age

Logistic regression Number of obs = 504

LR chi2(1) = 23.77

Prob > chi2 = 0.0000

Log likelihood = -322.05172 Pseudo R2 = 0.0356

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .9624413 .0081519 -4.52 0.000 .9465958 .9785521

------------------------------------------------------------------------------

- Interpretation of the coefficients: The odds ratio is for a one unit change in the predictor
- For this example the 0.962 is the odds ratio for a year difference in age

- To find the OR for a 10-year change in age
.

. logistic coldany age, coef

Logistic regression Number of obs = 504

LR chi2(1) = 23.77

Prob > chi2 = 0.0000

Log likelihood = -322.05172 Pseudo R2 = 0.0356

------------------------------------------------------------------------------

coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | -.0382822 .00847 -4.52 0.000 -.0548831 -.0216813

_cons | .906605 .3167295 2.86 0.004 .2858265 1.527383

------------------------------------------------------------------------------

OR for a 10-year change in age = exp(10*-.0382) = 0.682

- To find the OR for a 10-year change in age
. gen age_10=age/10

(2 missing values generated)

. logistic coldany age_10

Logistic regression Number of obs = 504

LR chi2(1) = 23.77

Prob > chi2 = 0.0000

Log likelihood = -322.05172 Pseudo R2 = 0.0356

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age_10 | .6819344 .0577599 -4.52 0.000 .5776247 .8050807

------------------------------------------------------------------------------

This is nice because stata will calculate your confidence interval as well!

. logistic coldany age_10 i.smoke

Logistic regression Number of obs = 504

LR chi2(2) = 23.89

Prob > chi2 = 0.0000

Log likelihood = -321.99014 Pseudo R2 = 0.0358

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age_10 | .6835216 .0580647 -4.48 0.000 .5786864 .807349

1.smoke | 1.128027 .3863511 0.35 0.725 .5764767 2.20728

------------------------------------------------------------------------------

.

For this example the 0.684 is the odds ratio for a ten-year difference in age when you hold smoking status constant

1.13 is the odds ratio for smoking when you hold age constant

. logistic sex fev

Logistic regression Number of obs = 654

LR chi2(1) = 29.18

Prob > chi2 = 0.0000

Log likelihood = -438.47993 Pseudo R2 = 0.0322

------------------------------------------------------------------------------

sex | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

fev | 1.660774 .1617468 5.21 0.000 1.372176 2.01007

_cons | .279198 .0742534 -4.80 0.000 .1657805 .4702094

------------------------------------------------------------------------------

the z (Wald) test statistic in the logistic results is the ratio of the estimated regression coefficient for the predictor (fev)to its standard error , and follows (approximately) a standard normal distribution

the log-likelihood is a measure of support of the data for the model (the larger the likelihood and/or log-likelihood, the better the support).

the statistic "chi2" is the likelihood ratio statistic for comparing this model including arcus to the simpler one (presented below) containing no predictors

- The log-odds of the outcome is linear in x, with intercept αand slope β1 .
- The "intercept" coefficient αgives the log-odds of the outcome for x = 0.
- The "slope" coefficient β1 gives the change in log-odds of the outcome for a unit increase in x. This is the log odds ratio associated with a unit increase in x.
- Outcome risk (P) is between 0 and 1 for all values of x