Biostat review
This presentation is the property of its rightful owner.
Sponsored Links
1 / 48

Biostat Review PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Biostat Review. November 29, 2012. Objectives. Review hw#8 Review of last two lectures Linear regression Simple and multiple Logistic regression. Review hw#8. Simple linear regression.

Download Presentation

Biostat Review

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Biostat review

Biostat Review

November 29, 2012


Objectives

Objectives

  • Review hw#8

  • Review of last two lectures

  • Linear regression

    • Simple and multiple

  • Logistic regression


Review hw 8

Review hw#8


Simple linear regression

Simple linear regression

  • The objective of regression analysis is to predict or estimate the value of the response(outcome) that is associated with a fixed value of the explanatory variable (predictor).


Simple linear regression1

Simple linear regression

  • The regression line equation is

  • The “best” line is the one that finds the α and β that minimize the sum of the squared residuals Σei2 (hence the name “least squares”)

  • We are minimizing the sum of the squares of the residuals

  • The slope  is the change in the mean value of y that corresponds to a one-unit increase in x


Assumptions of the linear model

Assumptions of the linear model

  • conditional mean of the outcome is linear

  • observed outcomes are independent

  • residuals (ε) follow a standard normal distribution

  • constant variance (σ2)

  • predictors are measured without error


Simple linear regression example regression of age on fev fev age

Simple linear regression example: Regression of age on FEVFEV= α̂ + β̂ age

regress yvarxvar

. regress fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 872.18

Model | 280.919154 1 280.919154 Prob > F = 0.0000

Residual | 210.000679 652 .322086931 R-squared = 0.5722

-------------+------------------------------ Adj R-squared = 0.5716

Total | 490.919833 653 .751791475 Root MSE = .56753

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .222041 .0075185 29.53 0.000 .2072777 .2368043

_cons | .4316481 .0778954 5.54 0.000 .278692 .5846042

------------------------------------------------------------------------------

β̂ ̂ = Coef for age

α̂ = _cons (short for constant)


Interpretation of coefficients

Interpretation of coefficients

β̂ ̂ = Coef. for age

  • For every one increase unit in age there is an increase in mean FEV of 0.22

    α̂ = _cons (short for constant)

  • When age = 0, the mean FEV is 0.431, which is also equal to the mean FEV


Model fit

Model Fit

  • R2 represents the portion of the variability that is removed by performing the regression on X

  • Remember that the R2 square tells us the fit of the model with values closer to 1 having a better fit


Biostat review

regress fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 872.18

Model | 280.919154 1 280.919154 Prob > F = 0.0000

Residual | 210.000679 652 .322086931 R-squared = 0.5722

-------------+------------------------------ Adj R-squared = 0.5716

Total | 490.919833 653 .751791475 Root MSE = .56753

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .222041 .0075185 29.53 0.000 .2072777 .2368043

_cons | .4316481 .0778954 5.54 0.000 .278692 .5846042

------------------------------------------------------------------------------

=.75652


Model fit1

Model fit

  • Residuals are the difference between the observed y values and the regression line for each value of x ( yi-ŷi)

  • If all the points lie along a straight line, the residuals are all 0

  • If there is a lot of variability at each level of x, the residuals are large

  • The sum of the squared residuals is what was minimized in the least squares method of fitting the line


Use of residual plots for model fit

Use of residual plots for model fit

  • Residual plot is a scatter plot

    • Y-axis residuals

    • X-axis outcome variable

  • Stata code to get residual plot:

    regress fev age

    rvfplot


Biostat review

rvfplot, title(Fitted values versus residuals for regression of FEV on age)


Why look at residual plot

Why look at residual plot

  • The spread of the residuals increase s with fitted in FEV values increases,– suggesting heteroscedasticity

  • Heteroscedasticityreduces the precision of the estimates (hence reduces power) -makes your standard errors larger

  • Homoscedasticity: constant variability across all values of x (same standard deviation for each value of y) -constant variance (σ2) assumption


Residual plots

Residual plots

  • Of note

    • rvfplot ** gives you Residuals vs. Fitted (outcome)

    • rvpplotht ** gives you Residuals vs. Predictor (predictor)


Data transformation

Data transformation

  • So if you have heterostatisticity in your data, can transform your data

  • Something to note

    • Transforming you data does not inherently change your data

  • Log transformation is the most common way to deal with heterostatisticity


Log transformation of fev data

Log transformation of FEV data

  • Do we still have heterostatisticity?


Log transformation stata output

Log transformation stata output

. regress ln_fev age

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 1, 652) = 961.01

Model | 43.2100544 1 43.2100544 Prob > F = 0.0000

Residual | 29.3158601 652 .044962976 R-squared = 0.5958

-------------+------------------------------ Adj R-squared = 0.5952

Total | 72.5259145 653 .111065719 Root MSE = .21204

------------------------------------------------------------------------------

ln_fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0870833 .0028091 31.00 0.000 .0815673 .0925993

_cons | .050596 .029104 1.74 0.083 -.0065529 .1077449

-------------------------------------------------------------


Interpretation of regression coefficients for transformed y value

Interpretation of regression coefficients for transformed y value

  • The regression equation is:

    ln(FEV) = ̂ + ̂ age

    = 0.051 + 0.087 age

  • So a one year change in age corresponds to a .087 change in ln(FEV)

  • The change is on a multiplicative scale, so if you exponentiate, you get a percent change in y

  • e0.087 = 1.09 – so a one year change in age corresponds to a 9% increase in FEV


Categorical variable predictor

Categorical variable/predictor

  • Previous example was of a predictor that was continuous

  • Can also perform regression with a categorical predictor/variable

  • If dichotomous

    • Convention use 0 vs. 1

    • ie is dichotomous: 0 for female, 1 for male


Categorical independent variable

Categorical independent variable

  • Remember that the regression equation is

    μy|x = α +  x

  • The only variables x can take are 0 and 1

  • μy|0 = αμy|1 = α + 

  • So the estimated mean FEV for females is ̂ and the estimated mean FEV for males is ̂ + ̂

  • When we conduct the null hypothesis test that=0

  • Similar to a -T-test


Categorical variable predictor1

Categorical variable/predictor

  • What if you have more than two categories within a predictor (non-dichotomous)?

  • One is set to be the reference category.


Categorical independent variables

Categorical independent variables

  • Then the regression equation is:

    y =  + 1 xAsian/PI + 2 xOther+ ε

  • For race group= White (reference)

    ŷ = ̂ +v ̂10+ ̂20 = ̂

  • For race group= Asian/PI

    ŷ = ̂ + ̂11 + ̂20 = ̂ + ̂1

  • For race group= Other

    ŷ = ̂ + ̂10 + ̂21 = ̂ + ̂2


Categorical independent variables1

Categorical independent variables

  • For stata you just place an “i.variable” to identify it as categorical variable

  • Stata takes the lowest number as the reference group

  • You can change this by the prefix “b#. variable” where # is the number value of the group that you want to be the reference group.


Multiple regression

Multiple regression

  • Additional explanatory variables might add to our understanding of a dependent variable

  • We can posit the population equation

    μy|x1,x2,...,xq = α + 1x1 + 2x2 + ... + qxq

  • αis the mean of y when all the explanatory variables are 0

  • iis the change in the mean value of y the corresponds to a 1 unit change in xiwhen all the other explanatory variables are held constant


Multiple regression1

Multiple regression

  • Stata command (just add the additional predictors)

    • regress outcomevar predictorvar1 predictorvar2…


Multiple regression2

Multiple regression

. regress fev age ht

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 2, 651) = 1067.96

Model | 376.244941 2 188.122471 Prob > F = 0.0000

Residual | 114.674892 651 .176151908 R-squared = 0.7664

-------------+------------------------------ Adj R-squared = 0.7657

Total | 490.919833 653 .751791475 Root MSE = .4197

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0542807 .0091061 5.96 0.000 .0363998 .0721616

ht | .1097118 .0047162 23.26 0.000 .100451 .1189726

_cons | -4.610466 .2242706 -20.56 0.000 -5.050847 -4.170085

------------------------------------------------------------------------------

  • R2 will always increase as you add more variables into the model

  • The Adj R-squared accounts for the addition of variables and is comparable across models with different numbers of parameters

  • Note that the beta for age decreased


How do you interpret the coefficients

How do you interpret the coefficients?

  • Age

    • Whenheight is held constant for every 1 unit (in this case year) increase in age you will have a 0.054 unit increase in FEV


You can fit both continuous and categorical predictors

You can fit both continuous and categorical predictors

. regress fev age smoke

Source | SS df MS Number of obs = 654

-------------+------------------------------ F( 2, 651) = 443.25

Model | 283.058247 2 141.529123 Prob > F = 0.0000

Residual | 207.861587 651 .319295832 R-squared = 0.5766

-------------+------------------------------ Adj R-squared = 0.5753

Total | 490.919833 653 .751791475 Root MSE = .56506

------------------------------------------------------------------------------

fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .2306046 .0081844 28.18 0.000 .2145336 .2466755

smoke | -.2089949 .0807453 -2.59 0.010 -.3675476 -.0504421

_cons | .3673731 .0814357 4.51 0.000 .2074647 .5272814

------------------------------------------------------------------------------

  • The model is fêv = α̂ + β̂1 age + β̂2Xsmoke

  • So for non-smokers, we have fêv= α̂ + β̂1 age (b/c Xsmoke=0)

  • For smokers, fêv = α̂ + β̂1 age + β̂2(b/c Xsmoke= 1)

    • So β̂2 is the mean difference in FEV for smokers versus non-smokers at each age


Biostat review

  • When you have one continuous variable and one dichotomous variable, you can think of fitting two lines that only differ in y intercept by the coefficient of the dichotomous variable (in this case smoke)

  • E.g. β̂2=-.209


Linear regression summary

Linear regression summary

  • Intercept is the mean value of outcome for an individual with other values equal to zero

  • Mean change in the outcome per unit change in the predictor

  • Mean change in the outcome per unit change in predictor holding other variables constant

  • R-squared is the proportion of total variance in the outcome explained by the regression model

  • Adjusted R-squared accounts for the number of predictors in the model


Logistic regression

Logistic regression

  • Linear regression

    • Continuous outcome

  • Logistic regression

    • Dichotomous outcome

      • Eg disease or no disease or Alive/Dead

    • Model the probability of the disease


Logistic regression1

Logistic regression

  • Need an equation that will follow rules of probability

    • Specifically that probability needs to be between 0-1

  • A model of the form p= α + βx would be able to take on negative values or values more than 1

  • p=e α + βx is an improvement because it cannot be negative , but it still could be greater than 1


Logistic regression2

Logistic regression

  • How about the function?

  • This function =.5 when α + βx =0

  • The function models the probability slowly increasing over the value of x, until there is a steep rise, and another leveling off


Logistic regression3

Logistic regression

  • ln(p/(1-p)) = α + bx

  • So instead of assuming that the relationship between x and p is linear , we are assuming that the relationship between ln(p/(1-p)) and x is linear.

  • ln(p/(1-p)) is called the logit function

  • It is a transformation

  • While the outcome is not linear, the other side of the equation α + bx is linear


Logistic regression4

Logistic regression

  • Stata code

    • logistic outcomevarpredictorvar 1 predictorvar2…, coef

      • Coef command gives you coefficient, β

      • This β, when you are interpreting is actually ln(OR)

      • To get the odds ratio, need to raise β to e

        • Odds ratio = e

      • Or you could just use this stata code instead (don’t use coeff)

        • logistic outcomevarpredictorvar 1 predictorvar2…,


Interpret these coefficients

Interpret these coefficients

. logistic coldany i.rested_mostly, coef

Logistic regression Number of obs = 504

LR chi2(1) = 19.71

Prob > chi2 = 0.0000

Log likelihood = -323.5717 Pseudo R2 = 0.0296

------------------------------------------------------------------------------

coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.rested_m~y | -.9343999 .2187794 -4.27 0.000 -1.3632 -.5056001

_cons | -.2527658 .1077594 -2.35 0.019 -.4639704 -.0415612

------------------------------------------------------------------------------


Interpret these coefficients1

Interpret these coefficients

  • Cold data (from previous slide)

  • β = -0.934

    • The natural log of the odds of someone who was rested of getting a cold to someone who is rested is -0.934

    • If you raise it to the power of e, you get 0.39

    • Therefore another way of interpreting this is that the odds of someone who was rested of getting a cold compared to someone who is not rested is 0.39


Or get stata to calculate the odds ratio for you

Or get stata to calculate the odds ratio for you!

logistic depvarindepvar

. logistic coldanyi.rested_mostly

Logistic regression Number of obs = 504

LR chi2(1) = 19.71

Prob > chi2 = 0.0000

Log likelihood = -323.5717 Pseudo R2 = 0.0296

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

1.rested_m~y | .3928215 .0859413 -4.27 0.000 .2558409 .6031435

------------------------------------------------------------------------------

=e


Interpretation when you have a continuous variable

Interpretation when you have a continuous variable

.

. logistic coldany age

Logistic regression Number of obs = 504

LR chi2(1) = 23.77

Prob > chi2 = 0.0000

Log likelihood = -322.05172 Pseudo R2 = 0.0356

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .9624413 .0081519 -4.52 0.000 .9465958 .9785521

------------------------------------------------------------------------------

  • Interpretation of the coefficients: The odds ratio is for a one unit change in the predictor

  • For this example the 0.962 is the odds ratio for a year difference in age


Continuous explanatory variable

Continuous explanatory variable

  • To find the OR for a 10-year change in age

    .

    . logistic coldany age, coef

    Logistic regression Number of obs = 504

    LR chi2(1) = 23.77

    Prob > chi2 = 0.0000

    Log likelihood = -322.05172 Pseudo R2 = 0.0356

    ------------------------------------------------------------------------------

    coldany | Coef. Std. Err. z P>|z| [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    age | -.0382822 .00847 -4.52 0.000 -.0548831 -.0216813

    _cons | .906605 .3167295 2.86 0.004 .2858265 1.527383

    ------------------------------------------------------------------------------

    OR for a 10-year change in age = exp(10*-.0382) = 0.682


Or you can also generate a new variable

Or you can also generate a new variable

  • To find the OR for a 10-year change in age

    . gen age_10=age/10

    (2 missing values generated)

    . logistic coldany age_10

    Logistic regression Number of obs = 504

    LR chi2(1) = 23.77

    Prob > chi2 = 0.0000

    Log likelihood = -322.05172 Pseudo R2 = 0.0356

    ------------------------------------------------------------------------------

    coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

    -------------+----------------------------------------------------------------

    age_10 | .6819344 .0577599 -4.52 0.000 .5776247 .8050807

    ------------------------------------------------------------------------------

    This is nice because stata will calculate your confidence interval as well!


Interpret this output

Interpret this output

. logistic coldany age_10 i.smoke

Logistic regression Number of obs = 504

LR chi2(2) = 23.89

Prob > chi2 = 0.0000

Log likelihood = -321.99014 Pseudo R2 = 0.0358

------------------------------------------------------------------------------

coldany | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age_10 | .6835216 .0580647 -4.48 0.000 .5786864 .807349

1.smoke | 1.128027 .3863511 0.35 0.725 .5764767 2.20728

------------------------------------------------------------------------------

.


Correct interpretations

Correct interpretations

For this example the 0.684 is the odds ratio for a ten-year difference in age when you hold smoking status constant

1.13 is the odds ratio for smoking when you hold age constant


Biostat review

. logistic sex fev

Logistic regression Number of obs = 654

LR chi2(1) = 29.18

Prob > chi2 = 0.0000

Log likelihood = -438.47993 Pseudo R2 = 0.0322

------------------------------------------------------------------------------

sex | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

fev | 1.660774 .1617468 5.21 0.000 1.372176 2.01007

_cons | .279198 .0742534 -4.80 0.000 .1657805 .4702094

------------------------------------------------------------------------------

the z (Wald) test statistic in the logistic results is the ratio of the estimated regression coefficient for the predictor (fev)to its standard error , and follows (approximately) a standard normal distribution

the log-likelihood is a measure of support of the data for the model (the larger the likelihood and/or log-likelihood, the better the support).

the statistic "chi2" is the likelihood ratio statistic for comparing this model including arcus to the simpler one (presented below) containing no predictors


Summary logistic regression

Summary Logistic regression

  • The log-odds of the outcome is linear in x, with intercept αand slope β1 .

  • The "intercept" coefficient αgives the log-odds of the outcome for x = 0.

  • The "slope" coefficient β1 gives the change in log-odds of the outcome for a unit increase in x. This is the log odds ratio associated with a unit increase in x.

  • Outcome risk (P) is between 0 and 1 for all values of x


  • Login