Slide 1 Logistic Regression and the new:Residual Logistic Regression

F. Berenice Baez-Revueltas

Wei Zhu

Slide 2 ### Outline

- Logistic Regression
- Confounding Variables
- Controlling for Confounding Variables
- Residual Linear Regression
- Residual Logistic Regression
- Examples
- Discussion
- Future Work

Slide 3 In 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a binary response variable.

### 1. Logistic Regression Model

Slide 4 ### A popular model for categorical response variable

- Logistic regressionmodel is the most popular model for binary data.
- Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuous or categorical).
Y = 1 (true, success, YES, etc.) or

Y = 0 ( false, failure, NO, etc.)

- Logistic regressionmodel can be extended to model a categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the ‘binomial’ logistic regression for a binary response variable.)

Slide 5 ### More on the rationale of the logistic regression model

Slide 6 ### More on the properties of the logistic regression model

- In the simple logistic regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.
- For multiple predictor variables, the logistic regression model is

Slide 7 ### Logistic Regression, SAS Procedure

Slide 8 ### Logistic Regression, SAS Output

Slide 9 ### 2. Confounding Variables

- Correlated with both the dependent and independent variables
- Represent major threat to the validity of inferences on cause and effect
- Add to multicollinearity
- Can lead to over or underestimation of an effect, it can even change the direction of the conclusion
- They add error in the interpretation of what may be an accurate measurement

Slide 10 For a variable to be a confounder it needs to have

- Relationship with the exposure
- Relationship with the outcome even in the absence of the exposure (not an intermediary)
- Not on the causal pathway
- Uneven distribution in comparison groups

Exposure

Outcome

Third variable

Slide 11 Birth order

Down Syndrome

Maternal Age

Alcohol

Lung Cancer

Smoking

Confounding

Maternal age is correlated with birth order and a risk factor for Down Syndrome, even if Birth order is low

No Confounding

Smoking is correlated with alcohol consumption and is a risk factor for Lung Cancer even for persons who don’t drink alcohol

Slide 12 ### 3. Controlling for Confounding Variables

- In study designs
- Restriction
- Random allocation of subjects to study groups to attempt to even out unknown confounders
- Matching subjects using potential confounders

Slide 13 - In data analysis
- Stratified analysis using Mantel Haenszel method to adjust for confounders
- Case-control studies
- Cohort studies
- Restriction (is still possible but it means to throw data away)
- Model fitting using regression techniques

Slide 14 ### Pros and Cons of Controlling Methods

- Matching methods call for subjects with exactly the same characteristics
- Risk of over or under matching
- Cohort studies can lead to too much loss of information when excluding subjects
- Some strata might become too thin and thus insignificant creating also loss of information
- Regression methods, if well handled, can control for confounding factors

Slide 15 ### 4. Residual Linear Regression

- Consider a dependant variable Y and a set of n independent covariates, from which the first k(k<n) of them are potential confounding factors
- Initial model treating only the confounding variables as follows
- Residuals are calculated from this model, let

Slide 16 The residuals are with the following properties:

Slide 17 ### The Usual Logistic Regression Approach to ‘Control for’ Confounders

- Consider a binary outcome Y and n covariates where the first k(k<n) of them being potential confounding factors
- The usual way to ‘control for’ these confounding variables is to simply put all the n variables in the same model as:

Slide 18 ### 5. Residual Logistic Regression

- Each subject has a binary outcome Y
- Consider n covariates, where the first k(k<n) are potential confounding factors
- Initial model with as the probability of success where only confounding effect is analyzed

Slide 19 ### Method 1

- The confounding variables effect is retained and plugged in to the second level regression model along with the variables of interest following the residual linear regression approach.
- That is, let
- The new model to be fitted is

Slide 20 Slide 21 Therefore the new dependant variable is Z, and because it is not dichotomous anymore we can apply a multiple linear regression model to analyze the effect of the rest of the covariates.

The new model to be fitted is a linear regression model

Slide 22 ### 6. Example 1

- Data: Low Birth Weight
- Dow. Indicator of birth weight less than 2.5 Kg
- Age: Mother’s age in years
- Lwt: Mother’s weight in pounds
- Smk: Smoking status during pregnancy
- Ht: History of hypertension

Correlation matrix with alpha=0.05

Slide 23 - Potential confounding factor: Age
- Model for (probability of low birth weight)
- Logistic regression
- Residual logistic regression
initial model

Slide 24 ### Results

RLR Method 2

Conf. factors

Slide 25 ### Example 2

Correlation matrix with alpha=0.05

Slide 26 - Potential confounding factors: Age, Gender
- Model for (probability of declining)
- Logistic regression
- Residual logistic regression
initial model

Slide 27 ### Results

RLR Method 2

Conf. factors

Slide 28 ### 7. Discussion

- The usual logistic regression is not designed to control for confounding factors and there is a risk for multicollinearity.
- Method 1 is designed to control for confounding factors; however, from the given examples we can see Method 1 yields similar results to the usual logistic regression approach
- Method 2 appears to be more accurate with some SE significantly reduced and thus the p-values for some regressors are significantly smaller. However it will not yield the odds ratios as Method 1 can.

Slide 29 ### 8. Future Work

We will further examine the assumptions behind Method 2 to understand why it sometimes yields more significant results.

We will also study residual longitudinal data analysis, including the survival analysis, where one or more time dependant variable(s) will be taken into account.

Slide 30 ### Selected References

- Menard, S. Applied Logistic Regression Analysis. Series: Quantitative Applications in the Social Sciences. Sage University Series
- Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83, 348-356
- Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets Logistic Regression. Biometrics 45, 1265-1270. 1989.
- Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4), 705-724. 1981.

Slide 31