C onfounders and Interactions: An Introduction

Confounders and Interactions: An Introduction Manoranjan Pal Indian Statistical Institute Kolkata, India Manoranjan.pal@gmail.com mano@isical.ac.in

An Example • Data were collected from some students in department of an university on the following variables: • No. of times visited theatre per month (z) • Scores in the final examination (y) • The simple correlation coefficient (ryz) between y and z was calculated to be 0.20 which was significant because the sample size was moderately large. • The same experiment was repeated for other departments in the university. Every time it was positive and significant. • Interpretation: As you visit theatre more and more, your result will improve. An interpretation which was hard to believe.

An Example (Continued) • Statisticians were puzzled. After a long investigations they found that who visited theatre more are more intelligent students. So they need less time to study and thus spend more time on other things. • From the same set of students in the department experiments were carried out to find the IQ of the students (x). The results of the computation were as follows: rxy= 0.8, ryz = 0.2and rxz = 0.6. • Still the paradox was not solved.

Solution - 1 • One statistician suggested the following: • Let us fix IQ and take correlation coefficient between x and z for each IQ. • It was not practicable as such. Sample size was too less for such experiment. • Sample size was increased and the correlation coefficient between x and z was found for each IQ. • Each time the value was negative, but different.

Solution - 2 • The effect of x from both y and z was eliminated and the correlation coefficient between y and z was found. It was negative. • How do we eliminate the effect of x? • We assume that linear relations exists between these variables, i.e., y = a + b x and z = c + d x (apart from the errors in the equations). The regressions were fitted and the residuals of y and z were found and then the correlations were found between the residuals. This is the correlation coefficient between y and z after eliminating the effect of x and this was negative. • This is known as the partial correlation coefficient.

Discussions • Fortunately, it is not necessary to do all these steps to find out the partial correlation coefficient. We can use the following formula: • The result is ryz.x≈– 0.58. It is clearly a negative value. • Solution 1 gives different values of the estimates of the correlation coefficients. • If we assume that the correlation coefficient is same for each stratum (i.e., fixed value of x) then the estimates will be more or less close and close to – 0.58 for this example. • If x, y and z is a trivariate normal distribution then theoretically the value of the correlation coefficient will be same for each x. • Thus Solution 1 does not need any distributional assumptions but gives multiple answers whereas solution 2 is unique but valid under restrictive assumptions.

Partial Correlation to Regression • Correlations and regression coefficients are related. In the equation y = a + b x, b is positive if and only if rxy is positive. Testing for significance of b is same as testing for significance of rxy. • In the equation y = a + b x + c z, c is positive if and only if ryz.xis positive. Testing for significance of c is same as testing for significance of ryz.x. • If we want to find the relation between y and z; and the variable x has effect on both then we should take both the variables as regressors and proceed. • This is why the regression coefficients in a multiple linear regression are known as partial regression coefficients. • x is called the confounding variable. Not all such variables are confounding variables. The confounding variable should be the true cause of variation of the explained variable.

Another Illustration of Confounding • Diabetes is associated with hypertension. • Does diabetes cause hypertension? • Does hypertension causes diabetes? • Another way in which diabetes and hypertension may be related is when both variables are caused by FACTOR X. For hypertension and diabetes, Factor X might be obesity. • We should not conclude that diabetes causes hypertension. In fact, they had no true causal relationship. We should rather say that: • The relationship between hypertension and diabetes is confoundedby obesity. Obesity would be termed as a confounding variable in this relationship.

Confounders are true causes of disease.

Definition of Confounding • A confounder: • 1) Is associated with exposure • 2) Is associated with disease • 3) Is NOT a consequence of exposure (i.e. not occurring between exposure and disease)

MEDIATING VARIABLE (SYNONYM: INTERVENING VARIABLE) EXPOSURE MEDIATOR DISEASE AN EXPOSURE THAT PRECEDES A MEDIATOR IN A CAUSAL CHAIN IS CALLED AN ANTECEDENTVARIABLE.

Mediation • A mediation effect occurs when the third variable (mediator, M) carries the influence of a given independent variable (X) to a given dependent variable (Y). • Mediation models explain how an effect occurred by hypothesizing a causal sequence. • .

Confounding Vs. Mediation • Exposure occurs first and then Mediator and outcome, and conceptually follows an experimental design). • Confounders are often demographic variables that typically cannot be changed in an experimental design. Mediators are by definition capable of being changed and are often selected based on flexibility.

Another Example: No Confounding

A Different Example • A group of scientists wanted to find the effect of IQ and the time spent on studying for examination on the result of examination. The linear model taken by them was yt= α+ xt+ zt+ et. • They fitted the data and the fitting was good. However, one of the scientists noticed that the residuals did not show random pattern when the data were arranged in increasing order of values of IQ. Then they started investigating the behaviour of the data more closely. They could do so because the sample size was large. • They fixed the value of IQ at different points and plotted the scatter diagram of result against study hours. Every time the scatter diagram showed linear relation, but the slope changed every time the value of IQ was changed. And surprisingly, it had a systematic increasing pattern as the value of IQ increased.

The Revised Model • Now look at the model again yt= α + xt+ zt + et. • We interpret  as the change in the value of y on the average as the value of x is increased by one unit keeping the value of z fixed. But why should the value of  change as the value of z is increased to some other fixed value. Ideally the intercept parameter, α, should absorb ztand thus the intercept term should change and not the slope parameter. • It means that the selection of model was wrong. If  changes/increases as z increases then  is not a constant. We may take  to ( + zt) and get yt= α + ( + zt)xt+ zt + et , and get yt= α + xt+ zt + xtzt + et . • This phenomenon is known as the interaction effect between x and z. It is symmetric. One may arrive at the same by varying coefficient of zt appropriately.

No interaction Vs. Interaction • No Interaction: Disease increases with age and this association is the same for both, male and female. • Interaction: gender interacts with age if the effect of age on disease is not the same in each gender. • .

Examples • Aspirin protects against heart attacks, but only in men and not in women. We say then that gender moderates the relationship between aspirin and heart attacks, because the effect is different in the different sexes. We can also say that there is aninteractionbetween sex and aspirin in the effect of aspirin on heart disease. • In individuals with high cholesterol levels, smoking produces a higher relative risk of heart disease than it does in individuals with low cholesterol levels. Smoking interacts with cholesterol in its effects on heart disease.

The Implications • The implication is that, when x or z is increased there is an additional change in the expected value of y apart from the linear effect. • If x is increased by one unit for fixed z then the change in y is +zt instead of  only, and if z is increased by one unit for fixed x then the change in y is +xt . If both x and z are increased by one unit then the change in y is ++ xt+zt+. • For binary variables taking only 0 and 1 values the corresponding changes in y are ,  and ++  respectively assuming that x and z both were in position 0. This is clear from the following table: Expected values of y at different values of x and z .

The Implications • Since y measures the effect i.e., disease, say, of exposures x and/or z, the number of cases of y in each stage will reflect the same. The odds ratios will be different. • Interaction between two variables (with respect to a response variable) is said to exist when the association between one of these variables (may be called the exposure variable) and the response variable (generally measured by the odds ratio or relative risk) is different at different levels of the other exposure variable. • For example, the odds ratio that measures the association between cigarette smoking and lung cancer may be smaller among individuals who consume large quantities of beta carotene in their food when compared to the analogous odds ratio among persons who consume little or no beta carotene in their food.

THE INTERACTING OR EFFECT-MODIFYING VARIABLE IS ALSO KNOWN AS A MODERATOR VARIABLE MODERATOR EXPOSURE DISEASE A moderator variable is one that moderates or modifies the way in which the exposure and the disease are related. When an exposure has different effects on disease at different values of a variable, that variable is called a modifier.

Methods to reduce confounding • during study design: • Randomization • Restriction • Matching • during study analysis: • Stratified analysis • Mathematical regression

Randomized controlled trial • Randomized controlled trial: A method where the study population is divided randomly in order to mitigate the chances of self-selection by participants or bias by the study designers. Before the experiment begins, the testers will assign the members of the participant pool to their groups, using a randomization process such as the use of a random number generator. • For example, in a study on the effects of exercise, the conclusions would be less valid if participants were given a choice if they wanted to belong to the control group which would not exercise or the intervention group which would be willing to take part in an exercise program. The study would then capture other variables besides exercise, such as pre-experiment health levels and motivation to adopt healthy activities. From the observer’s side, the experimenter may choose candidates who are more likely to show the results the study wants to see or may interpret subjective results (more energetic, positive attitude) in a way favorable to their desires.

Case-Control Studies • In a case-control study the researcher retrospectively determines which individuals were exposed to the agent or treatment or the prevalence of a variable in each of the study groups. The researcher assigns confounders to both groups, cases and controls, equally. For example if somebody wanted to study the cause of myocardial infarct and thinks that the age is a probable confounding variable, each 67 years old infarct patient will be matched with a healthy 67 year old "control" person. In case-control studies, matched variables most often are the age and sex. • Drawback: Case-control studies are feasible only when it is easy to find controls, i.e., persons whose status vis-à-vis all known potential confounding factors is the same as that of the case's patient: Suppose a case-control study attempts to find the cause of a given disease in a person who is 1) 45 years old, 2) African-American, 3) from Alaska, 4) an avid football player, 5) vegetarian, and 6) working in education. A theoretically perfect control would be a person who, in addition to not having the disease being investigated, matches all these characteristics and has no diseases that the patient does not also have — but finding such a control would be an enormous task.

An Hypothetical Example

Cohort studies • Cohort studies: A group of people is chosen who do not have the outcome of interest (for example, myocardial infarction). The investigator then measures a variety of variables that might be relevant to the development of the condition. Over a period of time the people in the sample are observed to see whether they develop the outcome of interest (that is, myocardial infarction). • Internal Controls: In single cohort studies those people who do not develop the outcome of interest are used as internal controls. • External Controls: Where two cohorts are used, one group has been exposed to or treated with the agent of interest and the other has not, thereby acting as an external control. • A degree of matching is also possible in cohort studies, creating a cohort of people who share similar characteristics and thus all cohorts are comparable in regard to the possible confounding variable. For example, if age and sex are thought to be confounders, only 40 to 50 years old males would be involved in a cohort study that would assess the myocardial infarct risk in cohorts that either are physically active or inactive. • Drawback: In cohort studies, the over-exclusion of input data may lead researchers to define too narrowly the set of similarly situated persons for whom they claim the study to be useful. Similarly, "over-stratification" of input data within a study may reduce the sample size in a given stratum to the point.

Double blinding • Double blinding conceals from the trial population and the observes the experiment group membership of the participants. By preventing the participants from knowing if they are receiving treatment or not, the placebo effect should be the same for the control and treatment groups. By preventing the observers from knowing of their membership, there should be no bias from researchers treating the groups differently or from interpreting the outcomes differently.

Stratification • Stratification: As in the example above, physical activity is thought to be a behaviour that protects from myocardial infarct; and age is assumed to be a possible confounder. The data sampled is then stratified by age group – this means, the association between activity and infarct would be analyzed per each age group. If the different age groups (or age strata) yield much different risk ratios, age must be viewed as a confounding variable. There exist statistical tools, among them Mantel–Haenszel methods, that account for stratification of data sets.

Stratification of Confounding Variable • While ascertaining association between 2 factors, we have Exposure and disease • Both Discrete: 2 levels of exposure/disease: 2x2 table • Both Discrete: More levels of exposure/disease: r x c • Level of disease continuous and exposure discrete or continuous: Usual regression • Level of disease discrete and exposure discrete or continuous: Regression, but needs special attention • A 3rd variable is considered: May be considered as an additional regressor variable or one may use stratification • Repeat analysis within every level of that variable • E.g. gender, age, breed, farm etc. • Stratification solves the problem of confounding as well as interaction

The Problem with Stratification as a Solution to Confounding • Stratification sometimes may cause bias. Consider the situation of a pair of dice, die A and die B. Of course, you know that they must be independent. In other words, if you roll one, it tells you nothing about the roll of the other. What if we stratify upon the sum of the dice? • What happens if we stratify? Let’s look in the stratum where the sum is, for example, 7. In this stratum, if we know A (say, 1) then we know B. If A is 3, B must be 4. • Earlier, we said that A and B were independent. Now, however, once we stratify upon the sum, if we know A, we know B. We have induced a relationship between A and B that otherwise did not exist.

Holding the Extraneous Variable Constant • For example, if you want to control for gender using this strategy, you would only include females in your research study (or you would only include males in your study). If there is still a relationship between the variables say motivation and test grades, you will be able to tell that the relationship is not due to gender because you have made it a constant (by only including one gender in your study).

Statistical Control • Statistical Control: It's based on the following logic: examine the relationship between the variables at each level of the control/extraneous variable; actually, the computer will do it for you, but that’s what it does. • One type of statistical control is called partial correlation. This technique shows the correlation between two quantitative variables after statistically controlling for one or more quantitative control/extraneous variables. • A second type of statistical control is called ANCOVA (or analysis of covariance). This technique shows the relationship between the variables after statistically controlling for one or more quantitative control/extraneous variables.

LOGISTIC REGRESSION A Note Compiled by MANORANJAN PAL ECONOMIC RESEARCH UNIT INDIAN STATISTISTICAL INSTITUTE 203 BARRACKPUR TRUNK ROAD KOLKATA – 700 108

Binary Dependent Variable • In this case the dependent variable takes only one of two values for each unit/individual. • Often individual economic agent must choose one out of two alternatives as follows: • A household must decide whether to buy or rent a suitable dwelling; • A consumer must choose which of two types of shopping areas to visit. • A person must choose one of two modes of transportation available; • A person must decide whether or not to attend college.

The Linear Probability Model (LPM) yi = 1 if an event A occurs = 0 if the event does not occur Suppose the probability that it occurs is Pi. Then E(yi) = 1× Pi + 0×(1 – Pi) = Pi. We assume that Pi depends on the explanatory variable xi, which is a vector. Thus yi = Pi + ei = xi' + ei, i = 1, 2, …, T. …(01) Where T is the size of the sample. For a given xi, we now have, --------------------------------------- yiei Pr(ei) --------------------------------------- …(02) 1 1 - xi' xi' 0 - xi' 1 - xi' ---------------------------------------

Problems with LPMs • E(yi)= Pi = xi' may not be within the unit interval • Var(ei) = (-xi')2 (1- xi') + (1- xi')2 (xi') = (xi') (1- xi') = (Eyi) (1-Eyi)  Introduces heteroscedasticity • ei takes only two values (-xi') and(1- xi')  Normality assumption is violated However, • E(ei) = (1 - xi') (xi') + (- xi') (1 - xi') = 0  The only solace

GLS Estimation of LPM Thus all T observations are written as y = X + e. It follows that the covariance matrix of e is Cov(e) = E(ee') = , where  is a diagonal matrix with ith diagonal element Eyi(1-Eyi). If the number of choice outcomes yi observed for each xi', say ni, is just one. That is ni = 1. In that case, feasible GLS can be carried out by estimating it by OLS which, though inefficient, is consistent and constructing to be a diagonal matrix with element Since is diagonal, feasible GLS is easily applied using WLS (Weighted Least Squares). That is, multiplication of each observation on the dependent and independent variables by the square root of the reciprocal of the variance of the error yields a transformed model, OLS estimation of which produces feasible GLS estimates. Caution: Weighted GLS estimation in this case does not have an intercept term.

The Problem with GLS Estimation of LPM While this estimation procedure is consistent, an obvious difficulty exists. If xi' falls outside the (0,1) interval, the matrix has negative or undefined elements on its diagonal. If this occurs one must modify either by deleting the observations for which the problem occurs or setting the value of xi'‘ to 0.01 or 0.99, say, and proceeding accordingly. While this does not affect the asymptotic properties of the feasible GLS procedure, it is clearly an awkward position to be in, especially since predictions based on the feasible GLS estimates, = xi' , may also fall outside the (0,1) interval.

The Case of Repeated Observations Let ni 1. The sample proportion of the number of occurrences of the event is pi = yi/ni, where yi is the number of successes out of ni. Since E(pi) = Pi = x'i, the model can be rewritten as pi = Pi + ei = xi' + ei, i = 1,2, …, T, where ei is now the difference between pi and its expectation Pi. The full set of T observations is then written as p = X + e. Since the sample proportions pi are related to the true proportions Pi by pi = Pi + ei, i = 1,2, …, T, the error term ei has zero mean and variance Pi(1-Pi)/ni, the same as the sample proportion based on ni Bernoulli trials.

Estimation Under the Case of Repeated Observations The covariance matrix of e is and the appropriate estimator for  is the GLS estimator. If the true proportions Pi are not known the a feasible GLS estimator is

Some Alternative Estimations

Questionable Value of R2 as a Measure of Goodness of Fit • The conventionally computed R2 is of limited value in the dichotomous response models. To see why, consider the following figure. Corresponding to a given X, Y is either 0 or 1. Therefore, all the Y values will either lie along the X axis or along the line corresponding to 1. Therefore, generally no LPM is expected to fit such a scatter well. As a result, the conventionally computed R2 is likely to be much lower than 1 for such models. In most practical applications the R2 ranges between 0.2 to 0.6. R2 in such models will be high, say, in excess of 0.8 only when the actual scatter is very closely clustered around points A and B (say), for in that case it is easy to fix the straight line by joining the two points A and B. In this case the predicted yi will be very close to either 0 or 1. • Thus, use of the coefficient of determination as a summary statistic should be avoided in models with qualitative dependent variable.

LPM: The case of High R2

The difficulty with the linear probability model Unfortunately, the predictor obtained from feasible GLS estimation can fall outside the zero-one interval. To ensure that the predicted proportion of successes will fall within the unit interval, at least over a range of xi of interest, one may employ inequality restrictions of the form 0  xi' or the number of repetitions ni must be large enough so that the sample proportion pi is a reliable estimate of the probability Pi. The situation is illustrated in the following figure for the case when xi' = 1 + 2xi2. 0 Figure 1 : Linear and non-linear probability models.

The difficulty with the linear probability model • As we have seen, the LPM is plagued by several problems, such as (1) nonnormality of ui, (2) heteroscedasticity of ui, (3) possibility of values lying outside the 0–1 range, and (4) the generally lower R2 values. Some of these problems are surmountable. For example, we can use WLS to resolve the heteroscedasticity problem or increase the sample size to minimize the non-normality problem. By resorting to restricted least-squares or mathematical programming techniques we can even make the estimated probabilities lie in the 0–1 interval. • But even then the fundamental problem with the LPM is that it is not logically a very attractive model because it assumes that Pi = P(y=1|x) increases linearly with x, that is, the marginal or incremental effect of x remains constant throughout. This seems patently unrealistic. In reality one would expect that Pi is nonlinearly related to xi.

Alternatives to LPM As an alternative to the linear probability model, the probabilities Pi must assume a nonlinear function of these explanatory variables. In the next sections two particular nonlinear probability models are discussed – the cumulative density functions of normal and logistic random variables Two kinds of estimation procedures are applied – feasible GLS when repeated observations are available and ML when ni = 1, or is small.

Probit and Logit Models Two choices of the nonlinear function Pi = g(xi) are the cumulative density functions of normal and logistic random variables. The former gives rise to the probit model and the latter to the logit model. The logit model is based on the logistic cumulative distribution (CDF) functions.

The Logit Model

C onfounders and Interactions: An Introduction