**1. **1 Logistic Regression and the new:Residual Logistic Regression F. Berenice Baez-Revueltas
Wei Zhu

**2. **2 Outline Logistic Regression
Confounding Variables
Controlling for Confounding Variables
Residual Linear Regression
Residual Logistic Regression
Examples
Discussion
Future Work

**3. **1. Logistic Regression Model
In 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a binary response variable.

**4. **A popular model for categorical response variable Logistic regression model is the most popular model for binary data.
Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuous or categorical).
Y = 1 (true, success, YES, etc.) or
Y = 0 ( false, failure, NO, etc.)
Logistic regression model can be extended to model a categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the ?binomial? logistic regression for a binary response variable.)

**5. **More on the rationale of the logistic regression model Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x) as a linear function of the predictor.
This model can be rewritten as
E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x. The following linear model may violate this condition sometimes:
P(Y=1|x) =

**6. **More on the properties of the logistic regression model
In the simple logistic regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.
For multiple predictor variables, the logistic regression model is

**7. **Logistic Regression, SAS Procedure http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm
Proc Logistic
This page shows an example of logistic regression with footnotes explaining the output. The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies.?The response variable is high writing test score (honcomp), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded from http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.
data logit;
set "c:\temp\hsb2";
honcomp = (write >= 60);
run;
proc logistic data= logit descending;
model honcomp = female read science;
run; 7

**8. **Logistic Regression, SAS Output 8

**9. **9 2. Confounding Variables Correlated with both the dependent and independent variables
Represent major threat to the validity of inferences on cause and effect
Add to multicollinearity
Can lead to over or underestimation of an effect, it can even change the direction of the conclusion
They add error in the interpretation of what may be an accurate measurement

**10. **10 For a variable to be a confounder it needs to have
Relationship with the exposure
Relationship with the outcome even in the absence of the exposure (not an intermediary)
Not on the causal pathway
Uneven distribution in comparison groups

**11. **11

**12. **12 3. Controlling for Confounding Variables In study designs
Restriction
Random allocation of subjects to study groups to attempt to even out unknown confounders
Matching subjects using potential confounders

**13. **13 In data analysis
Stratified analysis using Mantel Haenszel method to adjust for confounders
Case-control studies
Cohort studies
Restriction (is still possible but it means to throw data away)
Model fitting using regression techniques

**14. **14 Pros and Cons of Controlling Methods Matching methods call for subjects with exactly the same characteristics
Risk of over or under matching
Cohort studies can lead to too much loss of information when excluding subjects
Some strata might become too thin and thus insignificant creating also loss of information
Regression methods, if well handled, can control for confounding factors

**15. **15 4. Residual Linear Regression Consider a dependant variable Y and a set of n independent covariates, from which the first k (k<n) of them are potential confounding factors
Initial model treating only the confounding variables as follows
Residuals are calculated from this model, let

**16. **16 The residuals are with the following properties:
Zero mean
Homoscedasticity
Normally distributed
,
This residual will be considered the new dependant variable. That is, the new model to be fitted is
which is equivalent to:

**17. **17 The Usual Logistic Regression Approach to ?Control for? Confounders Consider a binary outcome Y and n covariates where the first k (k<n) of them being potential confounding factors
The usual way to ?control for? these confounding variables is to simply put all the n variables in the same model as:

**18. **18 5. Residual Logistic Regression Each subject has a binary outcome Y
Consider n covariates, where the first k (k<n) are potential confounding factors
Initial model with as the probability of success where only confounding effect is analyzed

**19. **19 Method 1 The confounding variables effect is retained and plugged in to the second level regression model along with the variables of interest following the residual linear regression approach.
That is, let
The new model to be fitted is

**20. **20 Method 2 Pearson residuals are calculated from the initial model using the Pearson residual (Hosmer and Lemeshow, 1989)
where is the estimated probability of success based on the confounding variables alone:
The second level regression will use this residual as the new dependant variable.

**21. **21 Therefore the new dependant variable is Z, and because it is not dichotomous anymore we can apply a multiple linear regression model to analyze the effect of the rest of the covariates.
The new model to be fitted is a linear regression model

**22. **22 6. Example 1 Data: Low Birth Weight
Dow. Indicator of birth weight less than 2.5 Kg
Age: Mother?s age in years
Lwt: Mother?s weight in pounds
Smk: Smoking status during pregnancy
Ht: History of hypertension

**23. **23 Potential confounding factor: Age
Model for (probability of low birth weight)
Logistic regression
Residual logistic regression
initial model
Method 1
Method 2

**24. **24 Results

**25. **25 Example 2 Data: Alzheimer patients
Decline: Whether the subjects cognitive capabilities deteriorates or not
Age: Subjects age
Gender: Subjects gender
MMS: Mini Mental Score
PDS: Psychometric deterioration scale
HDT: Depression scale

**26. **26 Potential confounding factors: Age, Gender
Model for (probability of declining)
Logistic regression
Residual logistic regression
initial model
Method 1
Method 2

**27. **27 Results

**28. **28 7. Discussion The usual logistic regression is not designed to control for confounding factors and there is a risk for multicollinearity.
Method 1 is designed to control for confounding factors; however, from the given examples we can see Method 1 yields similar results to the usual logistic regression approach
Method 2 appears to be more accurate with some SE significantly reduced and thus the p-values for some regressors are significantly smaller. However it will not yield the odds ratios as Method 1 can.

**29. **29 8. Future Work We will further examine the assumptions behind Method 2 to understand why it sometimes yields more significant results.
We will also study residual longitudinal data analysis, including the survival analysis, where one or more time dependant variable(s) will be taken into account.

**30. **30 Selected References Menard, S. Applied Logistic Regression Analysis. Series: Quantitative Applications in the Social Sciences. Sage University Series
Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83, 348-356
Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets Logistic Regression. Biometrics 45, 1265-1270. 1989.
Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4), 705-724. 1981.

**31. **31 Questions?