Using and understanding numbers in health news and research Heejung Bang, PhD Department of Public Health Weill Medical College of Cornell University
A rationale for today’s talk • Coffee is bad yesterday, but good today and bad again tomorrow. • “It's the cure of the week or the killer of the week, the danger of the week.” says B. Kramer. • “I've seen so many contradictory studies with coffee that I've come to ignore them all.” says D. Berry. • What to believe? For a while, you may just keep drinking coffee.
Hardly a day goes by without a new headline about the supposed health risks or benefits of some thing… Are these headlines justified? Often, the answer is NO.
R. Peto phrases the nature of the conflict this way: “Epidemiology is so beautiful and provides such an important perspective on human life and death, but an incredible amount of rubbish is published,” by which he means the results of observational studies that appear daily in the news media and often become the basis of public-health recommendations about what we should or should not do.
3 major reasons for coffee-like situations • Confounding • Multiple testing • Faulty design/sample selection
Topics to be covered today • Numbers in press release • Lies, Damn Lies & Statistics • Association vs. Causation • Experiment (e.g., RCT) vs. Observational study • Replicate or Perish • Hierarchy of evidence and study design • Meta-analysis • Multiple testing • Same words, different meanings? • Data sharing • Other Take-Home messages
1. Numbers in press release • No p-value, no odds or hazards ratio in press release! -- Ask people on the street “what is p-value?” -- Only we may laugh if I make a statistical joke using 0.05, 1.96 and 95%, etc.
What is P-value? • In statistical hypothesis testing, the p-value is the probability of obtaining a result at least as extreme as a given data point, under the null hypothesis. -- If there is no hypothesis, there is no test and no p-value. • Current statistical training and practice, statistical testing/p-value are overly emphasized. • However, p-value (1 number, 0-1) can be useful to decision making. -- you cannot say “it depends” all the times although it can be true.
Numerator & denominator • Always try to check numerator and denominator (and when, how long) • Try to read footnotes under * -- 100% increase can be 1 → 2 cases -- 20% event rate can be 1 out of 5 samples
Large Number myths • With large N, one will more likely find a difference when a difference truly exists – notion of statistical power. • However, many fundamental problems (e.g., bias, confounding and wrong sample selection) CANNOT be cured by large N. (more later) • Combining multiple incorrect stories can create more serious problems than reporting a single incorrect story. (more later in meta) • N>200,000 needed to detect 20% reduction in mortality (Mann, Science 1990) • Means (and t-test) can be very dangerous b/c with large N, everything is significant -- Perhaps, for DNA and race, Watson should see the entire distribution or SD!
2. Lies, damned lies & statistics • There are three kinds of lies --B Disraeli & M Twain --- Title speaks for itself • “J Robins makes statistics tell the truth: Numbers in the service of health” (Harvard Gazette interview) • If numbers/statistics are properly generated and used, they can be the best piece of empirical evidence. --- some empirical evidence is almost always good to have --- it is hard to fight with numbers (and age)!
Some Advice • No statistics is better than bad statistics. • Just present your data (e.g., N=3) when statistics are not necessary. • Descriptive statistics vs. inferential statistics • If you use wrong stats, you can be on the news. See ‘Statistical flaw trips up study of bad stats’. Nature 2006
3. Association vs. Causation • #1 error in health news, Association=Causation • In 1748, D. Hume stated ‘we may define a cause to be an object followed by another… where, if the first object had not been, the second never had existed.’ ---this is a true cause! A more profound quote from Hume is ‘All arguments concerning existence are founded on the relation of cause and effect.’
Misuses and abuses of “causes” • You may avoid the words ‘cause’, ‘responsible’, ‘influence’, ‘impact’ or ‘effect’ in your paper or press release (esp., title), if results are obtained from observational studies. Instead you may use ‘association’ or ‘correlation’. • Often, “may/might” not enough. • Media misuses and public misunderstands this severely --- Every morning, we hear new causes of some disease are found.
50% risk reduction, 20% risk reduction, and so on. If you add up, by now all causes of cancer (& many other diseases) should have been identified. • Almost all are association, not causation. -- there are an exceedingly large number of associated and correlated factors, compared to true causes. -- a survey of 246 suggested coronary risk factors. Hopkins & Williams (1981) -- I believe cancer >1000 risk factors. ‘Too many don’t do’ is no better than ‘do anything’.
Why Association ≠ Causation? Confounders • aka, third variable(s) • Biggest threat to any observational studies. • Definition of ‘confound’: vt. Throw (things) into disorder; mix up; confuse. (Oxford Dictionary) • However, confounders CANNOT be defined in terms of statistical notions alone (Pearl)
Confounder samplers • Grey Hair vs. heart attack • Stork vs. birth rate • Rock & Roll vs. HIV • Eating late & weight gain? • Drinking (or match-carrying) & lung cancer • No father’s name & infant mortality • Long leg & skin cancer • Vitamins/HRT, too? Any remedy? -- first thing to do is ‘Use common sense’. Think about any other (hidden) factor or alternative explanation’.
Common sense & serendipity Common sense is the basis for most of the ideas for designing scientific investigations. --- M Davidian although we should not ignore the importance of serendipity in science
By the way, why ‘causes’ are so important? • If causes can be removed, susceptibility ceases to matter (Rose 1985) and the outcome will not occur. Neither associated nor correlated factors have this power. • Gladly, some efforts have been made: ‘Distinguishing Association from Causation: A Backgrounder for Journalists’ from American Council on Science and Health
Greenland’s Dictum (Science 1995) There is nothing sinful about going out and getting evidence, like asking people how much do you drink and checking breast cancer records. There’s nothing sinful about seeing if that evidence correlates. There’s nothing sinful about checking for confounding variables. The sin comes in believing a causal hypothesis is true because your study came up with a positive result, or believing the opposite because your study was negative.
Association to causation? In 1965, Hill proposed a set of the following causal criteria: • Strength • Consistency • Specificity • Temporality (i.e., cause before effect) • Biological gradient • Plausibility • Coherence • Experiment • Analogy However, Hill also said “None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non’.
Another big problem: bias and faulty design/samples • Selection bias: the distortion of a statistical analysis, due to the method of collecting samples. • The easiest way to cheat (intentionally or unintentionally) -- Make group1: group2 = healthy people: sick people. -- Oftentimes, treatment is bad in observational studies, why? -- Do a survey among your friends only -- People are different from the beginning?? (e.g., vegetarians vs. meat-lover, HRT users vs. non-users) • Case-control study & matching: easy to say but hard to do correctly. -- Vitamin C and cancer • For any comparison: FAIRNESS is most important! -- Numerous other biases exist
Would you believe these p-values? (Cameron and Pauling, 1976) This famous study has failed to replicate 16 or so times! Pauling received two Nobel.
4. Experiment vs. Observational study • Although the arguing from experiments and observations by induction be no demonstration of general conclusions, yet it is the best way of arguing which the nature of things admits of. --- I Newton • Newton’s "experimental philosophy" of science: Science should not, as Descartes argued, be based on fundamental principles discovered by reason, but based on fundamental axioms shown to be true by experiments.
Why clinical trials are important? • Randomized Controlled Trial (RCT) is the most common form of experiment on humans. • ‘Average causal effects’ can be estimated from experiment. -- To know the true effect of treatment within person, one should be treated and untreated at the same time. • Experimentation trumps observation. (power of coin-flip! Confounders disappear.) • Very difficult to cheat in RCTs (due to randomization and protocol). • “Causality: God knows but humans need a time machine. When God is busy and no time machine is available, a RCT would do.”
Problems/issues of RCTs • Restrictive settings • Human subjects under experiments • Can be unethical or infeasible • Short terms • 1-2 treatments, 1-2 doses only • Limited generalizability • Other issues: blinding, drop-up, compliance
Problems/issues of observational studies • Bias & confounding • Post-hoc arguments about biological plausibility must be viewed with some skepticism since the human imagination seems capable of developing a rationale for most findings, however unanticipated (Ware 2003). i.e., retrospective rationalization. • We are tempted to ‘Pick & Choose’! • Data-dredging, Fishing expedition, Significance-chasing (p<0.05) • Observational studies can overcomes some limitations of RCTs.
Ideal attitudes • RCTs and observational studies should be complementary each other, rather than competing. --because real life stories can be complicated. • When RCTs and observational studies conflict, generally (not always) go with RCTs. • Even if you conduct a observational study, try to think in a RCT way. (e.g., a priori 1-2 hypothesis, protocol, data analysis plan, ask yourself ‘Is this result likely to replicate in RCT?’)
Quotes for observational studies • The data are still no more than observational, no matter how sophisticated the analytic methodology – anonymous reviewer • Observational studies are not a substitute for clinical trials no matter how sophisticated the statistical adjustment may seem – D. Freedman • No fancy statistical analysis is better than the quality of the data. Garbage in, garbage out, as they say. So whether the data is good enough to need this level of improvement, only time will tell. – J. Robins Remark: However, advanced statistical technique, causal inference, may help.
Some studies are difficult • Diet/alcohol: Type/amount, How to measure? Do you remember what we ate last week? • Exercise/physical activity/SES: Can we measure? Do you tell the truth? -- people tend to say ‘yes’, ‘moderately’ • Long term cumulative effects • Positive thinking and spirituality? • Quality and value of life: How to define and measure -- priceless?
5. Replicate or perish • Publish or perish: Old era vs. Replicate or perish: New era Replicability of the scientific findings can never be overemphasized. Results being ‘significant’ or ‘predictive’ without being replicable misinform the public and needlessly expend time and resources, and they are no service to investigators and science –S. Young Given that we currently have too many findings, often with low credibility, replication and rigorous evaluation become as important as or even more important than discovery - J. Ioannidis (2006) -- Pay more attention to 2nd study!
Examples of highly cited heart-disease studies that were later contradicted (Ioannidis 2005) -- The Nurses Health Study, showing a 44% relative risk reduction in coronary disease in women receiving hormone therapy. Later refuted by Women's Health Initiative, which found that hormone treatment significantly increases the risk of coronary events. -- Two large cohort studies, the Health Professionals Follow-Up Study and the Nurses Health Study, and a RCT all found that vitamin E was associated with a significantly reduced risk of coronary artery disease. But larger randomized trials subsequently showed no benefit of vitamin E on coronary disease
More Ioannidis • Ioannidis (2005) serves as a reminder of the perils of small trials, nonrandomized trials, and those using surrogate markers. • He concludes "Evidence from recent trials, no matter how impressive, should be interpreted with caution when only one trial is available. It is important to know whether other similar or larger trials are still ongoing or being planned. Therefore, transparent and thorough trial registration is of paramount importance to limit premature claims [of] efficacy."
More Freedman Modeling, the search for significance, the preference for novelty, and lack of interest in assumptions --- these norms are likely to generate a flood of nonreproducible results
What goes on top? ANSWER is total evidence. RCT can provide strong evidence for a causal effect, especially if its findings are replicated by other studies
When you read the article, you may check the study design • Cross-sectional study: which is first? what is cause and what is effect? e.g., depression vs. obesity • Prospective cohort studies: much better but still not causal • Prospective is generally better than retrospective • RCT is better than non-RCT
7. Meta-analysis • Statistical technique for systematic literature review • There are 3 things you should not watch being made: law, sausage & meta-analysis • No data collection but Nothing is free. • Can you find all studies in the universe including ones in researchers’ file drawers? Or at least unbiased subsample? Google or pubmed can do? NO! • Publication bias (favoring positive studies) and language bias, etc. • Much bigger problem in obs studies than RCTs. • Combining multiple incorrect stories is worse than one incorrect story.
Funny (real) titles of papers about meta-analysis • Meta-analysis: apples and oranges, or fruitless • Apples and oranges (and pears, oh my!): the search for moderators in meta-analysis • Of apples and oranges, file drawers and garbage: why validity issues in meta-analysis will not go away • Meta analysis/shmeta-analysis • Meta-analysis of clinical trials: a consumer's guide. • Publication bias in situ
8. Multiple testing • Multiple testing/comparisons refers to the testing of more than one hypothesis at a time. • When many hypotheses are tested, and each test has a specified Type I error probability (α), the probability that at least 1 Type I error is committed increases with the number of hypotheses. • Bonferroni method: α=0.05/# of tests • Many researchers’ thorny issue. -- Bonferroni might be the most hated statistician in history. -- ‘Escaping the Bonferroni iron claw in ecological studies’ by Garcı´a et al. (2004)
Two errors • Type I (false positive: rejecting H0 when it is true) vs. Type II (false negative: accepting H0 when it is false) -- Controlling Type I is more important in stat and court. (e.g., innocent → guilty: disaster!) -- In other fields, Type 2 can be more important. • α=p=0.05 – is this the law in science? Only 5% error do you commit in your life? • α=5% seems reasonable to one research question/publication.
Multiple testing in different forms • Subgroup analyses -- You should always do subgroup analyses but never believe them. – R. Peto -- Multiple testing adjustment and cross-validation may be solutions. • Trying different cutpoints (e.g., tertiles, quintiles, etc.) -- A priori chosen cutpoints or multiple testing adjustment can be solutions. • Nothing is free. To look more, you have to pay.
Multiple testing (underlying mechanism) Lottery tickets should not be free. In random and independent events as the lottery, the probability of having a winning number depends on the N of tickets you have purchased. When one evaluates the outcome of a scientific work, attention must be given not only to the potential interest of the ‘significant’ outcomes but also to the N of ‘lottery tickets’ the authors have ‘bought’. Those having many have a much higher chance of ‘winning a lottery prize’ than of getting a meaningful scientific result. It would be unfair not to distinguish between significant results of well-planned, powerful, sharply focused studies, and those from ‘fishing expeditions’, with a much higher probability of catching an old truck tyre than of a really big fish. --- Garcı´a et al. (2004)
Multiple testing disaster I In the 1970s, every disease was reported to be associated with an HLA allele (schizophrenia, hypertension.... you name it!). Researchers did case control studies with 40 antigens, so there was a very high probability of at least one was significant result This result was reported without any mention of the fact that it was the most significant of 40 tests --- R. Elston
Multiple testing disaster II Association between reserpine (then a popular antihypertensive) and breast cancer. Shapiro (2004) gave the history. His team published initial results that were extensively covered by media with a huge impact on research community. When the results did not replicate, he confessed that the initial findings were chance due to thousands of comparisons involving hundreds of outcomes and hundreds of exposures. He hopes that we learned for the future from his mistake.
Multiple testing disaster III • You are what your mother eats (Mathews et al. 2008). • All over the places on the news and internet. Over 50,000 Google hits for 1st week. • Numerous comparisons were conducted • Sodium, calcium, potassium, etc. were significant (p<0.05), but sodium was dismissed claiming it is hard to measure accurately. --possible ‘pick and choose’! • Other problems: lack of biological credibility, difficulty in dietary data.
Leaving no trace (Shaffer 2007) Usually these attempts through which the experimenter passed don't leave any traces; the public will only know the result that has been found worth pointing out; and as a consequence, someone unfamiliar with the attempts which have led to this result completely lacks a clear rule for deciding whether the result can or can not be attributed to chance.
If you keep testing without controlling α • Everything is Dangerous – S. Young • It is fairly easy to find risk factors for premature morbidity or mortality. Indeed, given a large enough study and enough measured factors and outcomes, almost any potentially interesting variable will be linked to some health outcome – Christenfeld et al. 2004. • Even checking 1000 correlation can be a sin– S. Young The only thing to fear is fear itself……………………. …..………………………………and everything else
Multiple testing adjustment • In RCTs: mandatory (by FDA) -- If not, more (interim) looks would lead what you want • In genetic/genomic studies: almost mandatory -- think about # of genes! • In observational studies: almost infeasible Realistic strategies can be: • α=5% for one hypothesis. Adjust multiple testing or state clearly how many tests/comparisons you conducted. • Think and act in RCT ways.