How to use data to get the right answer
1 / 27

The-Right-Answer - PowerPoint PPT Presentation

  • Updated On :

How to use data to get “The Right Answer” Donna Spiegelman Departments of Epidemiology and Biostatistics Harvard School of Public Health [email protected] - Standard designs & analysis sometimes not adequately controlling for - confounding - information bias

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The-Right-Answer' - Ava

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
How to use data to get the right answer l.jpg

How to use data to get“The Right Answer”

Donna Spiegelman

Departments of Epidemiology and Biostatistics

Harvard School of Public Health

[email protected]

Slide2 l.jpg

- Standard designs & analysis sometimes not

adequately controlling for

- confounding

- information bias

- selection bias

Wrong answer?

- Agreed: We can be doing a better job

- Not agreed: HOW

Slide3 l.jpg


What do we do?

“industry standard” END of mainstream epi methods

collect data on known & suspected time-varying confounders

MSMs, G-causal algorithm

Confounding outstanding problems l.jpg
Confounding – outstanding problems

  • unmeasured confounding

    • known or suspected confounders

    • unknown confounders

Fact: ~ 47% of US breast cancer incidence explained by known risk factors (Madigan et al., JNCI, 1987:1681-1695)

r2 in most epi regressions (blood pressure, serum hormones) 20%-40% (Pediatric Task Force on BP Control in Children, Pediatrics, 2004; Hankinson, personal communication)

Undiscovered genes?

Unimagined environmental factors? Complex non-linear interactions?

Slide5 l.jpg

Solution to confounding by unknown risk factors: randomization

VERY limited applicability

Outstanding questions:

a few strong risk factors or many weak ones?

many rare ones or a few common ones?

modeling of scenarios: do biases cancel?


Slide6 l.jpg

Unmeasured confounding by known or suspected risk factors:

We can use the data to get ‘the right answer’!

Design: two-stage

Stage 1 (Di, Ei, C1i), i = 1, . . . , n

Stage 2 (Di, Ei, C1i, C2i), i = 1, . . . , n2

(Di, Ei, C1i, . ), i = n2+ 1, . . . , n1 + n2

n1 >> n2

Analysis: MLE of 2-stage likelihood


Weinberg & Wacholder, 1990; Zhao & Lipsitz, 1992;

Robins et al., 1994; + many others

Cain & Breslow, AJE, 1988

Slide7 l.jpg

f (D | E, C1, C2; β) pdf of complete data

Pr (I | D, E, C1), I = 1 if in stage 2, 0 otherwise

f (D, I | E, C1; β,θ) =

Pr (I | D, E, C1) f (D | E, C1, c2) f (c2 | E, C1) d c2

likelihood of 2-stage design =

Stage 1

log [f (D, I | E, C1; , θ)]

Stage 2

+log [f (D | E, C1,C2; )]

Stage 2

+ log [f (C2 | E, C1; θ ]

Slide8 l.jpg

Example: Kyle Steenland – retrospective cohort study of lung cancer in

(Steenland & Greenland, AJE 2004;160:384-392)

f (D | E, C); E = silica, C = smoking

f (D | E) = f (D | E, C = j) Pr (C = j | E)

Pr (C = j | Ei) = where

relation to occupational silica exposure

n1 silica workers in retrospective cohort study

n2 silica workers in 1987 smoking prevalence study

n3 NHIS participants on general population smoking rates in 1986

n4 ACS prospective cohort data on smoking & lung cancer

Likelihood (silica + 1987 smoking data + US smoking data + ACS lung cancer & smoking data)

silica 1987 silica smoking date

= log [f(Di | Ei)] + log



r=1,…, R levels of exposure

s=1,…, S levels of smoking

could treat as known

  • assume distribution of smoking during entire period ~ 1987

Slide9 l.jpg

Obstacles: lung cancer in

software?Offsets + weights in PROC GENMOD



Result: The right answer?

Is it worth it?

Slide10 l.jpg

INFORMATION BIAS: lung cancer in

What do we usually do?


What can we do?


main study/validation study measurement error methods



Carroll, Ruppert, Stefanski, 1995, Chapman + Hall

Rosner et al., AJE, 1990, 1992

Spiegelman, “Reliability studies”

“Validation studies”

Robins et al., JASA, 1994

Encyclopedia of Biostatistics

Slide11 l.jpg

EXAMPLE lung cancer in



- 1731 men free of CHD

(non-fatal MI, fatal CHD)

At exam 4

- Followed for 10 years for CHD

Incidence (163 events, cumulative incidence = 9.4%)


- 1346 men with all risk factors

information at exams 2+3 (subgroup of 1731 men)

-Risk factors in main study: Age, BMI, Serum Cholesterol, Serum Glucose, Smoking, SBP

- Risk factors in reproducibility study: Serum Cholesterol, BMI, Serum Glucose, SBP, Smoking

Slide12 l.jpg

Example: (from Rosner, Spiegelman, Willett; AJE, 1992) lung cancer in

Framingham Heart Study

Reliability study: (n = 1346 men)

Subject i’s observed valve at time j

Subject i’s true mean

Reliability Coefficients

CHOL 75%

GLUC 52%

BMI 95%

SBP 72%

Slide13 l.jpg

Assumptions lung cancer in

1. Measurement error model



2. Disease incidence model



  • Pr (Di) is small

  • Measurement error independent of disease status

4. Reliability substudy “representative” of main study

Slide14 l.jpg

The Procedure lung cancer in

― For one variable measured with unbiased, additive error

Z=X + U, where Corr (X,U) = 0 {simplest case}

Step 1. Run a logistic regression of D on Z, U in main study


Measured with

Measured without


error (>1)

Slide15 l.jpg

Step 2 lung cancer in. Estimate reliability coefficient from reliability substudy (n2 subjects,

r replicates)

Need same # of replicates per subject



within-person variance (estimated)

Slide16 l.jpg

Step 3 lung cancer in. Correct.





This contributes much less.

(Donner, Intl Stat Review, 1986)

95% C.I. for odds ratio:

= biological meaningful comparison, e.g. 90% percentile – 10% percentile

Slide17 l.jpg

10-year cumulative incidence of CHD (163 events / 1731 men) lung cancer in



2.91 (1.62, 5.24)

CHOL 2.21 (1343, 3.39)

= 100mg/dl

1.75 (0.87, 3.52)

GLUC 1.27 (0.97, 1.66)

= 34mg/dl

1.49 (0.92, 2.43)

BMI 1.64 (1.04, 2.58)

= 9.7kg/m2

3.93 (2.19, 7.05)

SBP 2.80 (1.85, 4.24)

= 49mmHg

1.69 (1.16, 2.47)

SMOKE 1.70 (1.17, 2.47)


= 30 cig/day

1.89 (1.16, 3.07)

AGE 2.05 (1.27, 3.33)


AGE 3.21 (1.95, 5.29)

2.85 (1.72, 4.74)


AGE 4.30 (2.06, 8.98)

3.73 (1.67, 8.35)


Slide18 l.jpg

General framework for estimation and inference in failure time regression models

  • Main study/validation study studies

The data:

(Di, Ti, Xi, Vi), i = 1, . . ., n1 main study subjects

(Di, Ti, xi, Xi, Vi), i = n1 + 1, . . ., n1 + n2 validation study subjects


Ti = survival time

Di = 1 if case at Ti, 0 o.w.

xi = perfect exposure measurement

Xi = surrogate exposure measurement for x

Vi = other perfectly measured covariate data

- assume sampling into validation study is at random

Spiegelman and Logan, submitted

Slide21 l.jpg

Effect of radon exposure on lung cancer mortality rates: time regression models

UNM uranium miners

Mortality RR(95% CI)

= 100 WLM 500 WLM

Uncorrected 3.52 (0.658) 1.4 (1.3, 1.6) 5.8 (3.1, 11)

EPL 5.00 (1.00) 1.7 (1.4, 2.0) 12 (4.6, 32)

  • > 30% attenuation in

  • policy implications for

Slide22 l.jpg

Nutritional epidemiology: time regression models

Tworoger SS, Eliassen AH, Rosner B, Sluss P, Hankinson SE. Plasma prolaction concentrations and risk of premenopausal breast cancer. In press, Cancer Research, 2004.

Hankinson SE, Willett WC, Michaud DS, Manson JE, Colditz GA, Longcope C, Rosner B, Speizer FE. Plasma prolaction levels and subsequent risk of breast cancer in postmenopausal women. Journal of the National Cancer Institute 1999; 91:629-634.

Smith-Warner SA, Spiegelman D, Adami H, Beeson L, van den Brandt P, Folsom A, Fraser G, Freudenheim J, Goldbohm R, Graham S, Kushi L, Miller A, Rohan T, Speizer FE, Toniolo P, Willett WC, Wolk A, Zeleniuch-Jacquotte A, Hunter DJ. Types of dietary fat and breast cancer: a pooled analysis of cohort studies. International Journal of Cancer 2001; 92:767-774.

Holmes MD, Stampfer MJ, Wolf AM, Jones CP, Spiegelman D, Manson JE, Coldditz GA. Can behavioral risk factors explain the difference in body mass index between African-American and European-American women? Ethnicity and Disease 1999; 8:331-339.

Rich-Edwards JW, Hu F, Michels K, Stampfer MJ, Manson JE, Rosner B, Willett WC. Breastfeeding in infancy and risk of cardiovascular disease in adult women. In press, Epidemiology, 2004.

Koh-Banerjee P, Chu NF, Spiegelman D, Rosner B, Colditz GA, Willett WC, Rimm EB. Prospective study of the association of changes in dietary intake, physical activity, alcohol consumption, and smoking with 9-year gain in wais circumference among 15,587 men. Am J Clin Nutr 2003; 78:719-727.

Koh-Banerjee P, Franz M, Sampson L, Liu S, Jacobs Jr. DR, Spiegelman D, Willett WC, Rimm EB. Changes in whole grain, bran and cereal fiber consumption in relation to 8-year weight gain among men. In press, Am J Clin Nutr, 2004.

Slide23 l.jpg

Environmental epidemiology time regression models

Keshaviah AP, Weller EA, Spiegelman D. Occupational exposure to methyl tertiary-butyl ether in relation to key health symptom prevalence: the effect of measurement error correction. Environmetrics, 2002; 14:573-582.

Thurston SW, Williams P, Hauser R, Hu H, Hernandez-Avila M, Spiegelman D. A comparison of regression calibration methods for measurement error in main study/internal validation study designs. In press, Journal of Statistical Planning and Inference, 2004.

Fetal lead exposure in relation to birth weight; MS/IVS; bone lead vs. cord lead (r=0.19)

Weller EA, Milton DK, Eisen EA, Spiegelman D. Regression calibration for logistic regression with multiple surrogates for one exposure. Submitted for publication, 2004.

Metal working fluids exposure in relation to lung function; MS/EVS; job characteristics vs. personal monitors (r=0.82)

Horick N, Milton DK, Gold D, Weller E, Spiegelman D. Household dust endotoxin exposure and respiratory effects in infants: correction for measurement error bias. In preparation.

Li R, Weller EA, Dockery DW, Neas LM, Spiegelman D. Association of indoor nitrogen dioxide with respiratory symptoms in children: the effect of measurement error correction with multiple surrogates. In preparation.

Slide24 l.jpg

SOFTWARE IS AVAILABLE! time regression models

  • http:/

SAS macros for regression calibration (Rosner et al., AJE, 1990, 1992; Spiegelman et al., AJCN, 1997; Spiegelman et al, SIM, 2001)

in main study/validation study designs

  • STATA (Carroll et al. SIMEX, regression calibration)

So why are methods under-utilized?

No validation data

Insufficient training of statisticians & epidemiologists

Either/or about assumptions

Slide25 l.jpg

Quantitative correction for selection bias: time regression models


main study/’selection’ study ML



large overlap w/ missing data literature when D is missing, potential for selection bias


Little & Rubin, Wiley, 1986 Scharfstein et al., 1998 Rotnitzky et al., 1997 Robins et al., 1995



Slide26 l.jpg

Basic idea: time regression models

Let I=1 if selected, 0 otherwise,

Pr (I | E, C) = selection probability

Selection study has data on those not in main study (Di, Ei, Ci = (Ci, Ui ), i=1, …, n2

Surrogates for D,

risk factors for D

Mail, phone, house visit to get data

IPW: Pr (Ii = 1 | Di, Ei, Ci)-1 = Wi

Use PROC GENMOD w/ robust variance + weights Wi; i=1, …, n1


For dependent censoring, (a.k.a. biased loss to follow-up)


Slide27 l.jpg

CONCLUSIONS time regression models


Methods EXIST for efficient study design and valid data analysis when standard design with standard analysis gives the wrong answer


Why do epidemiologists routinely adjust for one source of bias only?

(confounding by measured risk factors)


Barriers to utilization

  • software gaps

  • software unfriendly, no QC

  • inadequate training of students + practitioners (Epi & Biostat)

  • are two-stage designs fundable @ NIH?