EPI-820 Evidence-Based Medicine

EPI-820 Evidence-Based Medicine LECTURE 7: CLINICAL STATISTICAL INFERENCE Mat Reeves BVSc, PhD

Objectives • Understand the theoretical underpinnings and the flaws associated with the current approach to clinical statistical testing (the frequentist approach). • Understand the difference between testing and estimation • Understand the advantages of the CI and the CI functions. • Understand the logic of a Bayesian Approach

Personal Statistical History…. • Post-DVM • Clue-less. Sceptical of the role of statistics • Thinks research = the search for P < 0.05 • PhD Era: • Increasing obsession with stat methods • Lots of tools! SLR, ANOVA, MLR, LR, LL & Cox • Thinks statistics = “real science” • Post-PhD: • Healthy scepticism for the way stats are used • Stats = methods which have inherent limitations • Not a substitute for clear scientific thought or understanding the “scientific method”

Review of Significance Tests Substantive hypothesis: Cows on BST will tend to gain weight Null hypothesis (Ho): the mean body wt. of cows trt with BST is not different from the mean body wt. of control cows Ux = Uy Alternative hypothesis (Ha): the mean body wt. of cows trt with BST is different from the mean body wt. of control cows Ux  Uy

- Logically, if Ho is refuted Ha is confirmed - investigator seeks to 'nullify' Ho Review of Significance Tests Expt: 20 cows randomized to BST (X) and control (Y). Measure wt. gain. Calculate mean wt. change per group.

Assumptions: Review of Significance Tests i) Sample statistic (X - Y) is one instance of an infinitely large number of sample statistics obtained from an infinite number of replications of the expt., under the same conditions (frequentist assumption) ii) Populations are normally distributed, equal variance iii) The Ho is true

N (0, 1) df = (n1 – 1) (n2 – 1) Review of Significance Tests (t-test) Where: = standard error of the difference between two independent means. S2 = estimate of pooled population variance - t may take on any value, no value is logically inconsistent with Ho! Smaller t values are more consistent with Ho being true. - all else equal, larger n’s increase value of t (higher power).

Large values of t indicate: i) test assumptions are true, a rare event has occurred ii) one of the assumptions of the test is false, and by convention it is assumed that the Ho is not true. Review of Significance Tests - By convention, relative frequency of t where we decide to choose (ii) above as a logical conclusion is set to 5% (alpha level or significance level) - Expt: t = 2.55, p = 0.02, reject Ho - result is significant

- Type 1 error (alpha), occurs 5% of the time when Ho is true Review of Significance Tests - Type II error (beta), occurs B% of the time when Ho is false - Alpha and beta are inversely related - Fixing alpha at 5%, means Sp is 95% - Beta is not set 'a priori‘, hence Se (power) tends to be low - Scientific caution dictates that set alpha small - Scientific ignorance dictates we ignore beta!

Alpha and beta are inversely related  

DISEASE PRESENT (D+) ABSENT (D-) Relationship between diagnostic test result and disease status PVP= a a + b FP TP POSITIVE (T+) a b TEST d c PVN= d c + d FN TN NEGATIVE (T-) Sp= d/b + d Se= a/a + c Se= P(T+|D+) Sp= P(T-|D-)

TRUTH Ho False Ho True Relationship between significance test results and truth FP PVP= TP TP + FP TP REJECT Ho (1 - B) Type I (a) SIGNF. TEST PVN= TN TN + FN FN TN ACCEPT Ho (1 - a) Type II (B) Se= TP/TP + FN Sp= TN/TN + FP Se= Power (1 - B)

Power - Probability of rejecting Ho when Ho is false - Se = TP/(TP + FN) or (1 - B) - Power is a function of: i) Alpha (increase by making Ha one sided i.e., Ux > Uy) (consistent with changing the cut-off value) ii) Reliability (as measured by SE of the difference) - Power increases with decreasing SE - SE decreases with increasing sample size (= decr variance) iii) Size of treatment effect

The Consequences of Low Power i) difficult to interpret negative results - truly no effect - expt unable to detect true difference ii) increase proportion of type 1 errors in literature iii) fail to identify many important associations iv) low power means low precision (indicated by the confidence interval)

Questions? • What proportion of statistically significant findings published in the literature are false positive (Type 1) errors? • What well known measure is this proportion? and, what elements does this figure therefore depend on?

Hypothetical outcomes of 500 experiments, a= 0.05, Power= 0.50, and 20% prevalence of false Ho’s Ho FALSE Ho TRUE TRUTH PV+ = 50/70 = 71% 50 20 REJECT Ho SIGNF. TEST 50 380 ACCEPT Ho N = 500 100 400 Se = 50% Sp = 95% If all signf. results published, 29% are Type 1 errors

- probability of obtaining a value of the test statistic (X) at least as large as the one observed, given the Ho is true The P value • P (>=X | Ho true) Common Incorrect Interpretations • It is NOT P (Ho true|Data)!!! - We can never state the probability of a hypothesis being true! (under the frequentist approach) • The probability that the results were due to chance!

i) Decision vs Inference (Neyman-Pearson) Criticisms of Significance Tests - pioneers of modern statistics were interested in producing results that enabled decisions to be made - problem of automatic acceptance or rejection based on an arbitrary cutoff (P= 0.04 vs P=0.06) - results should adjust your degree of belief in a hypothesis rather than forcing you to accept an artificial dichotomy - "intellectual economy"

ii) Asymmetry of significance tests Criticisms of Significance Tests - frequently, the experimental data can be found to be consistent with a Ho of no effect or a Ho of a 20% increase - acceptance of both Ho's given the data leads to 2 very different conclusions! - asymmetry was recognized by Fisher, hence convention is to identify theory with the Ha but to test the Ho - Is there an effect? is the wrong question! Should ask: What is the size of the effect?

iii) Corroborative power of significance tests Criticisms of Significance Tests - Both Fisherian and Neyman-Pearson schools make no assumption about the prior probability of Ho - Both schools presume Ho is almost always false - rejection of Ho does nothing to illuminate which of the vast number of Ha’s are supported by the data! - Failing to reject Ho does not prove Ho is true (Popper: 'we can falsify hypotheses but not confirm them')

iv) Effect size and significance tests Criticisms of Significance Tests - Test statistics and p values are a function of both effect size and sample size - Cannot infer size of an effect by inspection of the P value reporting P< 0.00001 has no scientific merit! - Highly significant results may be derived from trivial effects if sample size is large. - Confidence intervals give plausible range for the unknown popl parameter (signf tests show what the parameter is not!)

Relationship between the Size of the Sample and the Size of the P Value • Example RCT: • Intervention: new a/b for pneumonia. • Outcome: Recovery Rate = % of patients in clinical recovery by 5 days • Facts: • Known = Existing drug of choice results in 35% recovery rate at 5 days • Unknown = New drug improves recovery rate by 5% (to 40%)

P values Generated by RCT by Sample Size

Significance testing should be abandoned and replaced with interval estimation (point estimate and CI)! Why? Conclusion? - not couched in pseudo-scientific hypothesis testing language - do not imply any decision making implications - give plausible range to unknown popl parameter - gives clue as to sample size (width of the CI) - avoids danger of inferring a large effect when result if highly significant

Interval estimation - view "experimentation" as a measurement exercise - want an unbiased, precise measure of effect - Point estimate: best estimate of the true effect, given the data (aka MLE) and it indicates the magnitude of effect (but is imprecise) - Confidence intervals indicate degree of precision of estimate. Represent a set of all possible values for the parameter that are consistent with the data - width of CI depends on variability and level of confidence (%)

Interval estimation - 90% CI: • 90% of such intervals will include the true unknown popl. parameter (necessary frequentist interpretation) • - it does not represent a 90% probability of including the true unknown popl. parameter within it - CIs indicate magnitude and precision. - CI are linked to alpha and hypothesis testing (1 - alpha) = 95%

OUTCOME Interval estimation - Example - + P(success)= 35% TRT A 13 20 7 20 P(success)= 70% 6 14 TRT B Significance test: P= 0.06 or NS! Interval estimation of difference: 35% (95%CI = -1,+71%)

- CI are non-uniform, true parameter is more likely to be located centrally than near to limits. Therefore precise location of boundary is irrelevant! Confidence Intervals - For a study to be reassuring about a lack of effect, boundaries of CI should be near the null value • CIs have clear advantages over the p-value but still suffer from • the necessary frequentist interpretation (a CI represents one member of a family of CIs produced by an infinite number of replications of the same experiment) - CI functions

Which is the more important study? Study B Study A larger effect null point

Importance of Beta (Type II error) and Sample Size in RCT’s (Freiman et al 1978) • Reviewed 71 “negative’ (P > 0.05) RCT published from 1960-77 • Assume 25% treatment effect: • 94% (N= 67) of trials had < 90% power • Only 15% (N= 10) had sufficient evidence to conclude no effect • Assume 50% treatment effect: • 70% (N= 50) of trials had < 90% power • Only 32% (N= 16) had sufficient evidence to conclude no effect

The P Value Fallacy - Goodman • Derives from the simultaneous application of the p-value as: • A long-run, error based, deductive tool (Neyman Pearson frequentist application), and • A short-run, evidential and inductive tool (i.e., what is the meaning of this particular result?) • The p-value was never designed to serve these two conflicting roles

The Bayes Factor - Goodman • Comparison of how well two hypotheses predict the data: P (Data | given the Ho) P (Data | given the Ha) • Allows explicitly the incorporation of external evidence (in terms of prior probability/belief) • Use of Bayesian statistics shows that weight of evidence against the Ho is not as strong as the p-value suggests (Table 2)

EPI-820 Evidence-Based Medicine