- 48 Views
- Uploaded on
- Presentation posted in: General

Statistical Analysis SC504/HS927 Spring Term 2008

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Statistical AnalysisSC504/HS927Spring Term 2008

Introduction to Logistic Regression

Dr. Daniel Nehring

- Preliminaries: The SPSS syntax
- Linear regression and logistic regression
- OLS with a binary dependent variable
- Principles of logistic regression
- Interpreting logistic regression coefficients
- Advanced principles of logistic regression (for self-study)
- Source:
http://privatewww.essex.ac.uk/~dfnehr

- Simple programming language allowing access to all SPSS operations
- Access to operations not covered in the main interface
- Accessible through syntax windows
- Accessible through ‘Paste’ buttons in every window of the main interface
- Documentation available in ‘Help’ menu

- Saved in a separate file format through the syntax window
- Run commands by highlighting them and pressing the arrow button.
- Comments can be entered into the syntax.
- Copy-paste operations allow easy learning of the syntax.
- The syntax is preferable at all times to the main interface to keep a log of work and identify and correct mistakes.

- Relation between 2 continuous variables
Regression coefficient b1

- Measures associationbetween y and x
- Amount by which y changes on average when x changes by one unit
- Least squares method

y

Slope

x

- Relation between a continuous variable and a setof i continuous variables
- Partial regression coefficients bi
- Amount by which y changes on average when xi changes by one unit and all the other xis remain constant
- Measures association between xi and y adjusted for all other xi

PredictedPredictor variables

Response variableExplanatory variables

DependentIndependent variables

- Binary variables can take only 2 possible values:
- yes/no (e.g. educated to degree level, smoker/non-smoker)
- success/failure (e.g. of a medical treatment)

- Coded 1 or 0 (by convention 1=yes/ success)
- Using OLS for a binary dependent variable predicted values can be interpreted as probabilities; expected to lie between 0 and 1
- But nothing to constrain the regression model to predict values between 0 and 1; less than 0 & greater than 1 are possible and have no logical interpretation
- Approaches which ensure that predicted values lie between 0 & 1 are required such as logistic regression

- Linear regression: Least squares
- Logistic regression: Maximum likelihood
- Likelihood function
- Estimates parameters with property that likelihood (probability) of observed data is higher than for any other values
- Practically easier to work with log-likelihood

- OLS cannot be used for logistic regression since the relationship between the dependent and independent variable is non-linear
- MLE is used instead to estimate coefficients on independent variables (parameters)
- Of all possible values of these parameters, MLE chooses those under which the model would have been most likely to generate the observed sample

- Models relationship betweenset of variables xi
- dichotomous (yes/no)
- categorical (social class, ...)
- continuous (age, ...)
and

- dichotomous (binary) variable Y

- ‘Logistic regression’ or ‘logit’
- p is the probability of an event occurring
- 1-p is the probability of the event not occurring
- p can take any value from 0 to 1
- the odds of the event occurring =
- the dependent variable in a logistic regression is the natural log of the odds:

- ln (.) can take any value, p will always range from 0 to 1
- the equation to be estimated is:

{

logit of P(y|x)

Logistic transformation

let

then to predict p for individual i,

Probability ofevent y

x

- intercept is value of ‘log of the odds’ when all independent variables are zero
- each slope coefficient is the change in log odds from a 1-unit increase in the independent variable, controlling for the effects of other variables
- two problems:
- log odds not easy to interpret
- change in log odds from 1-unit increase in one independent depends on values of other independent variables

- but the exponent of b (eb) is not dependent on values of other independent variables and is the odds ratio

- odds ratio for coefficient on a dummy variable, e.g. female=1 for women, 0 for men
- odds ratio = ratio of the odds of event occurring for women to the odds of its occurring for men
- odds for women are eb times odds for men

if b1 > 0, X1 increases p

if b1 < 0, X1 decreases p

if odds ratio >1, X1 increases p

if odds ratio < 1, X1 decreases p

if CI for b1 includes 0, X1 does not have a statistically significant effect on p

if CI for odds ratio includes 1, X1 does not have a statistically significant effect on p

- dependent variable = presence of disability (1=yes,0=no)
- independent variables:
X1 age in years (in excess of 65 i.e. 650, 70 5)

X2 whether has low income (in lowest 3rd of the income distribution)

- data: Health Survey for England, 2000

- pj= 0.2 i.e. a 20% probability
- oddsj = 0.2/(1-0.2) = 0.2/0.8 = 0.25
- pk = 0.4
- oddsk= 0.4/0.6 = 0.67
- relative probability/risk pj/pk = 0.2/0.4 = 0.5
- odds ratio, oddsi/oddsj = 0.25/0.67 = 0.37
- odds ratio is not equal to relative probability/risk
- exceptapproximately if pj and pk are small………

- if you see an odds ratio of e.g. 1.5 for a dummy variable indicating female, beware of saying ‘women have a probability 50% higher than men’. Only if both p’s are small can you say this.
- better to calculate probabilities for example cases and compare these

let

then to predict p for individual i,

- Predict disability for someone on low income aged 75:
- Add up the linear equation
a(=-.912) + [age over 65 i.e.]10*0.078+1*-0.27

=-0.402

- Take the exponent of it to get to the odds of being disabled
=.669

- Put the odds over 1+the odds to give the probability
=c.0.4 – or a 40 per cent chance of being disabled

- based on improvements in the likelihood of observing the sample
- use a chi-square test with the test statistic =
- where R and U indicate restricted and unrestricted models
- unrestricted – all independent variables in model
- restricted – all or a subset of variables excluded from the model (their coefficients restricted to be 0)

- Calculated using standard errors as in OLS
- for large n, t > 1.96 means that there is a 5% or lower probability that the true value of the coefficient is 0.
or p 0.05

- For CIs of odds ratios calculate CIs for coefficients and take their exponents