1 / 35

Statistical Analysis SC504/HS927 Spring Term 2008

Statistical Analysis SC504/HS927 Spring Term 2008. Introduction to Logistic Regression Dr. Daniel Nehring. Outline. Preliminaries: The SPSS syntax Linear regression and logistic regression OLS with a binary dependent variable Principles of logistic regression

hoyt-mccray
Download Presentation

Statistical Analysis SC504/HS927 Spring Term 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical AnalysisSC504/HS927Spring Term 2008 Introduction to Logistic Regression Dr. Daniel Nehring

  2. Outline • Preliminaries: The SPSS syntax • Linear regression and logistic regression • OLS with a binary dependent variable • Principles of logistic regression • Interpreting logistic regression coefficients • Advanced principles of logistic regression (for self-study) • Source: http://privatewww.essex.ac.uk/~dfnehr

  3. PRELIMINARIES

  4. The SPSS syntax • Simple programming language allowing access to all SPSS operations • Access to operations not covered in the main interface • Accessible through syntax windows • Accessible through ‘Paste’ buttons in every window of the main interface • Documentation available in ‘Help’ menu

  5. Using SPSS syntax files • Saved in a separate file format through the syntax window • Run commands by highlighting them and pressing the arrow button. • Comments can be entered into the syntax. • Copy-paste operations allow easy learning of the syntax. • The syntax is preferable at all times to the main interface to keep a log of work and identify and correct mistakes.

  6. PART I

  7. Simple linear regression • Relation between 2 continuous variables Regression coefficient b1 • Measures associationbetween y and x • Amount by which y changes on average when x changes by one unit • Least squares method y Slope x

  8. Multiple linear regression • Relation between a continuous variable and a setof i continuous variables • Partial regression coefficients bi • Amount by which y changes on average when xi changes by one unit and all the other xis remain constant • Measures association between xi and y adjusted for all other xi

  9. Multiple linear regression Predicted Predictor variables Response variable Explanatory variables Dependent Independent variables

  10. OLS with a binary dependent variable • Binary variables can take only 2 possible values: • yes/no (e.g. educated to degree level, smoker/non-smoker) • success/failure (e.g. of a medical treatment) • Coded 1 or 0 (by convention 1=yes/ success) • Using OLS for a binary dependent variable  predicted values can be interpreted as probabilities; expected to lie between 0 and 1 • But nothing to constrain the regression model to predict values between 0 and 1; less than 0 & greater than 1 are possible and have no logical interpretation • Approaches which ensure that predicted values lie between 0 & 1 are required such as logistic regression

  11. Fitting equation to the data • Linear regression: Least squares • Logistic regression: Maximum likelihood • Likelihood function • Estimates parameters with property that likelihood (probability) of observed data is higher than for any other values • Practically easier to work with log-likelihood

  12. Maximum Likelihood Estimation (MLE) • OLS cannot be used for logistic regression since the relationship between the dependent and independent variable is non-linear • MLE is used instead to estimate coefficients on independent variables (parameters) • Of all possible values of these parameters, MLE chooses those under which the model would have been most likely to generate the observed sample

  13. Logistic regression • Models relationship betweenset of variables xi • dichotomous (yes/no) • categorical (social class, ...) • continuous (age, ...) and • dichotomous (binary) variable Y

  14. PART II

  15. Logistic regression (1) • ‘Logistic regression’ or ‘logit’ • p is the probability of an event occurring • 1-p is the probability of the event not occurring • p can take any value from 0 to 1 • the odds of the event occurring = • the dependent variable in a logistic regression is the natural log of the odds:

  16. Logistic regression (2) • ln (.) can take any value, p will always range from 0 to 1 • the equation to be estimated is:

  17. { logit of P(y|x) Logistic regression (3) Logistic transformation

  18. Predicting p let then to predict p for individual i,

  19. Logistic function (1) Probability ofevent y x

  20. PART III

  21. Interpreting logistic regression coefficients • intercept is value of ‘log of the odds’ when all independent variables are zero • each slope coefficient is the change in log odds from a 1-unit increase in the independent variable, controlling for the effects of other variables • two problems: • log odds not easy to interpret • change in log odds from 1-unit increase in one independent depends on values of other independent variables • but the exponent of b (eb) is not dependent on values of other independent variables and is the odds ratio

  22. Odds ratio • odds ratio for coefficient on a dummy variable, e.g. female=1 for women, 0 for men • odds ratio = ratio of the odds of event occurring for women to the odds of its occurring for men • odds for women are eb times odds for men

  23. General rules for interpreting logistic regression coefficients if b1 > 0, X1 increases p if b1 < 0, X1 decreases p if odds ratio >1, X1 increases p if odds ratio < 1, X1 decreases p if CI for b1 includes 0, X1 does not have a statistically significant effect on p if CI for odds ratio includes 1, X1 does not have a statistically significant effect on p

  24. An example: modelling the relationship between disability, age and income in the 65+ population • dependent variable = presence of disability (1=yes,0=no) • independent variables: X1 age in years (in excess of 65 i.e. 650, 70  5) X2 whether has low income (in lowest 3rd of the income distribution) • data: Health Survey for England, 2000

  25. Example: logistic regression estimate for probability of being disabled, people aged 65+

  26. PART IV

  27. Odds, log odds, odds ratios and probabilities

  28. Odds, odd ratios and probabilities • pj= 0.2 i.e. a 20% probability • oddsj = 0.2/(1-0.2) = 0.2/0.8 = 0.25 • pk = 0.4 • oddsk= 0.4/0.6 = 0.67 • relative probability/risk pj/pk = 0.2/0.4 = 0.5 • odds ratio, oddsi/oddsj = 0.25/0.67 = 0.37 • odds ratio is not equal to relative probability/risk • exceptapproximately if pj and pk are small………

  29. Points to note from logit example.xls • if you see an odds ratio of e.g. 1.5 for a dummy variable indicating female, beware of saying ‘women have a probability 50% higher than men’. Only if both p’s are small can you say this. • better to calculate probabilities for example cases and compare these

  30. Predicting p let then to predict p for individual i,

  31. E.g.: Predicting a probability from our model • Predict disability for someone on low income aged 75: • Add up the linear equation a(=-.912) + [age over 65 i.e.]10*0.078+1*-0.27 =-0.402 • Take the exponent of it to get to the odds of being disabled =.669 • Put the odds over 1+the odds to give the probability =c.0.4 – or a 40 per cent chance of being disabled

  32. Goodness of fit in logistic regressions • based on improvements in the likelihood of observing the sample • use a chi-square test with the test statistic = • where R and U indicate restricted and unrestricted models • unrestricted – all independent variables in model • restricted – all or a subset of variables excluded from the model (their coefficients restricted to be 0)

  33. Statistical significance of coefficient estimates in logistic regressions • Calculated using standard errors as in OLS • for large n, t > 1.96 means that there is a 5% or lower probability that the true value of the coefficient is 0. or p  0.05

  34. 95% confidence intervals for logistic regression coefficient estimates • For CIs of odds ratios calculate CIs for coefficients and take their exponents

More Related