1 / 37

Introduction to L ogistic R egression

Introduction to L ogistic R egression. Jean-Claude Desenclos, Rachid Salmi, Alain Moren, Thomas Grein. Content. Simple and multiple linear regression Simple logistic regression The logistic function Estimation of parameters Interpretation of coefficients Multiple logistic regression

errin
Download Presentation

Introduction to L ogistic R egression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Logistic Regression Jean-Claude Desenclos, Rachid Salmi, Alain Moren, Thomas Grein

  2. Content • Simple and multiple linear regression • Simple logistic regression • The logistic function • Estimation of parameters • Interpretation of coefficients • Multiple logistic regression • Interpretation of coefficients • Coding of variables • Model building

  3. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women

  4. SBP (mm Hg) Age (years) adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974

  5. Simple linear regression • Relation between 2 continuous variables (SBP and age) • Regression coefficient b1 • Measures associationbetween y and x • Amount by which y changes on average when x changes by one unit • Least squares method y Slope x

  6. Multiple linear regression • Relation between a continuous variable and a setofi continuous variables • Partial regression coefficients bi • Amount by which y changes on average when xi changes by one unit and all the other xis remain constant • Measures association between xi and y adjusted for all other xi • Example • SBP versus age, weight, height, etc

  7. Multiple linear regression Predicted Predictor variables Response variable Explanatory variables Outcome variable Covariables Dependent Independent variables

  8. General linear models • Family of regression models • Outcome variable determines choice of model • Uses • Estimate force of association between outcome and covariates • Control of confounding • Model building, risk prediction Outcome Model Continuous Linear regression Counts Poisson regression Survival Cox model Binomial Logistic regression

  9. Logistic regression • Models relationship betweenset of variables xi • dichotomous (yes/no) • categorical (social class, ...) • continuous (age, ...) and • dichotomous (binary) variable Y • Dichotomous outcome very common situation in biology and epidemiology

  10. Logistic regression (1) Table 2 Age and signs of coronary heart disease (CD)

  11. How can we analyse these data? • Compare mean age of diseased and non-diseased • Non-diseased: 38.6 years • Diseased: 58.7 years (p<0.0001) • Linear regression?

  12. Dot-plot: Data from Table 2

  13. Logistic regression (2) Table 3Prevalence (%) of signs of CD according to age group

  14. Dot-plot: Data from Table 3 Diseased % Age group

  15. Logistic function (1) Probability ofdisease x

  16. { logit of P(y|x) Logistic transformation

  17. Advantages of Logit • Properties of a linear regression model • Logit between -  and +  • Probability (P) constrained between 0 and 1 • Directly related to odds of disease

  18. Interpretation of coefficient b

  19. Interpretation of coefficient b • b=increase in logarithm of odds ratio for a one unit increase in x • Test of the hypothesis that b=0 (Wald test) • Interval testing

  20. Example • Risk of developing coronary heart disease (CD) by age (<55 and 55+ years)

  21. Logistic Regression Model

  22. Fitting equation to the data • Linear regression: Least squares • Logistic regression: Maximum likelihood • Likelihood function • Estimates parameters a and b with property that likelihood (probability) of observed data is higher than for any other values • Practically easier to work with log-likelihood

  23. Maximum likelihood • Iterative computing • Choice of an arbitrary value for the coefficients (usually 0) • Computing of log-likelihood • Variation of coefficients’ values • Reiteration until maximisation (plateau) • Results • Maximum Likelihood Estimates (MLE) for  and  • Estimates of P(y) for a given value of x

  24. Multiple logistic regression • More than one independent variable • Dichotomous, ordinal, nominal, continuous … • Interpretation of bi • Increase in log-odds for a one unit increase in xi with all the other xis constant • Measures association between xi and log-odds adjusted for all other xi

  25. Effect modification • Effect modification • Can be modelled by including interaction terms

  26. Statistical testing • Question • Does model including given independent variable provide more information about dependent variable than model without this variable? • Three tests • Likelihood ratio statistic (LRS) • Wald test • Score test

  27. Likelihood ratio statistic • Compares two nested models Log(odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1) Log(odds) =  + 1x1 + 2x2 (model 2) • LR statistic -2 log (likelihood model 2 / likelihood model 1) = -2 log (likelihood model 2) minus -2log (likelihood model 1) LR statistic is a 2 with DF = number of extra parameters in model

  28. Example P Probability for cardiac arrest Exc 1= lack of exercise, 0 = exercise Smk 1= smokers, 0= non-smokers adapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998

  29. Interaction between smoking and exercise? • Product term b3 = -0.4604 (SE 0.5332) Wald test = 0.75 (1df) -2log(L) = 342.092 with interaction term = 342.836 without interaction term  LR statistic = 0.74 (1df), p = 0.39  No evidence of any interaction

  30. Coding of variables (1) • Dichotomous variables: yes = 1, no = 0 • Continuous variables • Increase in OR for a one unit change in exposure variable • Logistic model is multiplicative OR increases exponentially with x • If OR = 2 for a oneunit change in exposure and x increases from 2 to 5: OR = 2 x 2 x 2 = 23 = 8 • Verify that OR increases exponentially with x. When in doubt, treat as qualitative variable

  31. Continuous variable? • Relationship between SBP>160 mmHg and body weight • Introduce BW as continuous variable? • Code weight as single variable, eg. 3 equal classes: 40-60 kg =0, 60-80 kg =1, 80-100 kg =2 • Compatible with assumption of multiplicative model • If not compatible, use indicator variables

  32. Coding of variables (2) • Nominal variables or ordinal with unequal classes: • Tobacco smoked: no=0, grey=1, brown=2, blond=3 • Model assumes that OR for blond tobacco = OR for grey tobacco3 • Use indicator variables (dummy variables)

  33. Indicator variables: Type of tobacco • Neutralises artificial hierarchy between classes in the variable "type of tobacco" • No assumptions made • 3 variables (3 df) in model using same reference • OR for each type of tobacco adjusted for the others in reference to non-smoking

  34. Model Building • Depends of study objectives : • single hypothesis testing : estimate the effect of a given variable adjusted for all potential available confounders • exploratory study : identify a set of variables independently associated to the outcome • to predict : predict the outcome with the least varieble possible (parsimony principle) • Judgement criteria • conditional vs non conditional (matching ?) • statistical (p) • what is the hypothesis, what is (are) the variable(s) of interest, what are the confounding variables... • any colinear variables… • biologic pathway and plausibilty…

  35. Model building strategy • Hosmer & Lemeshow approach • Start with a saturated model including variables «biologically » important + variables with p<0.2 • Define variables to keep in the model (forced) • Stepwise backward elimination • Exclusion of least « significant variables » base on statitical test and no important variation of coefficient until obtaining a « satifactory » model • Add and test interation terms • Final model with interaction term if any • Check model fit (residual analysis, diagnotic procedures, Lemeshow test…)

  36. Be careful • Make sure you choosed the correct model • Be careful of the black box ! • So easy to do with computers ! • Coding of variables is a fundamental issue • You should know where you want to go (you are the investigator !) • Be careful of automatic procedures • You should be able to justify why you did one way or another • Do not base your stratgy on statistical test • Ask for advice !

  37. Reference++++ • Hosmer DW, Lemeshow S. Applied logistic regression. Wiley & Sons, New York, 1989

More Related