Applied Epidemiologic Analysis

Applied Epidemiologic Analysis Patricia Cohen, Ph.D. Henian Chen, M.D., Ph. D. Teaching Assistants Julie Kranick Sylvia Taylor Chelsea Morroni Judith Weissman

Lecture 7 Categorical analysis Conditional logistic regression Unconditional logistic regression Introduction to stratifiers

Objectives • To understand the basic assumptions of analyses of case-control and cohort data • To see how assumptions about the predictor variables differ between categorical analyses and some regression models • To see the connection between stratified analyses and analyses incorporating all stratifiers as predictors

Categorical analysis:Analyses of tables of frequencies Assumptions / requirements • Adequate sample size in each table cell and in total • Independence of outcomes • no contagion effects • single event per person • For rates, homogeneity: probability of outcome is uniform for all time units in a stratum • e.g., doesn’t matter if 6 people are observed for 10 years or 10 people are observed for 6 years

Categorical analysis Does not assume that distributions of exposure and other predictors are fixed. In contrast, ordinary regression analysis assumes that distributions of independent variables are fixed (selected or created by the researchers, rather than whatever distributions happen to characterize the sampled population). Ordinary or “unconditional” logistic regression also assumes that independent variables are fixed.

Cateforical analysis of incidence rates:A single group in comparison to some expected rate Incidence per time unit in exposed group Incidence per time unit expected in the reference population for the same distribution of person-time (e.g., based on morbidity rates for equivalent age groups)

Categorical analysis:Single group in comparison to some expected rate Since person-time distribution is contant: ratio = standardized morbidity ratio Confidence limits on this rate ratio employ the Poisson distribution and maximum likelihood estimation. Should be adequate when E > 5.

Categorical analysis of 2 groups, exposed and unexposed Maximum likelihood estimates using the Poisson model are used to estimate rate ratios and rate difference or risk ratios and risk differences. Hand calculation of these estimates is rare, partly because of the inclusion of multiple confounders and/or exposures in the models.

Selecting an analytic model forcase-control (or cohort) data The ordinary least squares (OLS) method of analyzing dichotomous outcomes is problematic because the formal assumptions of the model (homoscedasticity) are necessarily violated. Nevertheless, for case-control data with similar sample sizes in the two groups, conclusions from OLS and logistic regression may well be similar.

Ordinary Least Squares This model uses as a link function the “identity” function: a difference in the value of the predictor is (linearly) related to a difference in the value of the outcome. When the outcome is disease or non-disease, this is equivalent to a difference in the proportion with the disease (incidence or prevalence). For a binary exposure the B = difference in proportion, or risk difference.

Logistic Regression Model The link function estimated by the logistic regression model (using maximum likelihood methods) is the log odds or logit. In this model, for a binary exposure the B = difference in the log odds of the outcome (disease). It is equivalent to an exponential odds model, so taking the anti-log provides the odds ratio, an estimate of risk ratio.

Other models • Exponential risk models – a log- linear risk model (requires an estimate of risk in the source population) • Probit model – assumes a normal distribution underlying outcome; used in bioassay and economics Note: These alternative models are designed to provide apprpriate statistical tests, but do not necessarily match the actual biological mechanisms.

Stratification of case-control data • A means of equating for stratifiers • Most often on sex and age categories • Note: If there is a non-trivial age difference there will be a remaining mean difference within categories.

Stratifying variables: standardization Standardization of rates or risks with regard to a stratifying variable. Example: Control group = 40% male Case group = 60% male Can standardize the case group to the control by weighting every female case by 1.5 and every male case by .67. So the sum of weights still = N in the case group.

Stratifying variables: standardization Thus, for every 100 cases we have: 60 males * .67 (= 40) 40 females * 1.5 (=60) Weighted N = 100. Could, alternatively, weight both case and control groups to equal male and female sizes.

Weighting for rate or risk difference, or unconditional logistic regression This weighting to produce equality on predictors can be done for hand calculation of rate or risk differences or for computer analyses of data by conditional or unconditional logistic regression. Note: This is only one reason for weighting observations. Another common reason is to take into account sampling strategies with unequal probabilities for inclusion. Such strategies often over-sample certain strata in order to improve the statistical power for analyses of subgroups.

Weighting It is useful to see this as analogous to what the analytic program does when inclusion of a predictor “equates” groups by removing effects of counfounders. Simple standardization assumes a uniform effect of exposure across strata: each stratum provides an estimate of the same quantity. Statistical tests of homogeneity are commonly used to decide whether this assumption is warranted.

Mantel – Haenszel Estimation Mantel – Haenszel estimation of uniform rate differences (using weights as described above applied to person – time) Preferred when some strata have fewer than 10 cases Unbiased, unlike maximum likelihood estimates, but larger SE (much larger for rate difference, not much for rate ratio)

First Study : Wine drinking and risk of non-Hodgkin’s lymphoma among men in the United States: a population based case-control study Reference:Nathaniel C. Briggs, Robert S. Levine, Linda D. Bobo, William P. Haliburton, Edward A. Brann, and Charles H. Hennekens, American Journal of Epidemiology, 156, No. 5, 454-462

The problem: Lymphoma study Non-Hodgkin’s lymphoma (NHL) is the fifth most common cancer in the United States with etiology mostly unknown. Can exploration of protective factors help move toward etiological understanding? Specifically, will this study strengthen prior weak evidence of lower NHL in wine drinkers?

Population studied, study design, and sample size : Lymphoma study 960 cases of NHL males born 1929 – 1953 and diagnosed 1984 – 1988 (without specific known risks such as HIV) 1717 controls of males recruited through random digit dialing and matched geographically

Measurement issues: Lymphoma Data collected by interviews regarding life-time habits Selection and inclusion of predictors in the analysis: * All odds ratios (OR) are adjusted for age, race/ethnicity, cancer registry, smoking history, and education. Odds ratios for each alcohol beverage type are adjusted for the other types. All odds ratios are in reference to nondrinkers.

The effect being estimated: Lymphoma Odds ratios of NHL associated with alcohol consumption by type and quantity over the life-time Basic analysis to answer study questions: Logistic regression analysis Test for the significance of the trend (dose-response) in the OR as dose increases

Conclusions: Lymphoma “Among wine drinkers, there was a significant linear decrease in risk of NHL with increasing quantity of wine intake. A more than twofold decrease in risk was seen for consumption of one wine drink or more per day.” Note that the p for trend tests the dose-response aspect.

Conclusions: Lymphoma Early age of onset of drinking was associated with decreased risk of NHL specifically for wine drinkers. Discussed biologic plausibility, probable effects of self-report, and data limitations (biases generally would be expected to lower effects) and age-sex limitations of sample.

Second Study : Occupation and Adult Gliomas Reference:Susan E. Carozza, Margaret Wrensch, Rei Miike, Beth Newman, Andrew F. Olshan, David A. Savitz, Michael Yost and Marion Lee American Journal of Epidemiology, 152, No 9, 838 - 846.

The problem: Gliomas Gliomas are the most common form of primary malignant brain tumor in adults. The etiology is largely unknown but prior evidence implicates occupational exposures associated with certain chemically-exposed industrial, agricultural and blue-collar workers.

Population studied, study design, and sample size : Gliomas 492 incident cases in San Francisco bay area, age over 20 462 controls recruited through random digit dialing, matched by : 5 year age group gender ethnicity (Note: 1/3 declined to participate. Controls more educated because of participation bias.)

Measurement issues: Gliomas Control variables: age (20- 54 vs 55+), gender, years of education, race Because of rapid death of cases, many proxy informants needed to supply information. How might these interviews be biased? Are the controls likely to be adequate?

Analyses • Exposure measures: • All jobs held at least 6 months in lifetime • All jobs up to 10 years previously (assuming a 10 year latency) • Within each, • ever employed • < 10 years • => 10 years

Logic of study: Gliomas If real, the association should increase with longer exposure. Also, if real, the effect should be more apparent when the latency period is excluded.

Gliomas Odd Ratios Virtually no odds ratios were statistically significantly different from 1.0. Nevertheless, several were discussed. Is this sensible?

Applied Epidemiologic Analysis