Statistical inference ii pitfalls of hypothesis testing confidence intervals effect sizes
Download
1 / 49

Statistical Inference II: Pitfalls of hypothesis testing; confidence intervals - PowerPoint PPT Presentation


  • 278 Views
  • Uploaded on

Statistical Inference II: Pitfalls of hypothesis testing; confidence intervals/effect sizes. Pitfall 1: over-emphasis on p-values. Statistical significance does not guarantee clinical significance.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Statistical Inference II: Pitfalls of hypothesis testing; confidence intervals' - afi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Pitfall 1 over emphasis on p values
Pitfall 1: over-emphasis on p-values confidence intervals/effect sizes

  • Statistical significance does not guarantee clinical significance.

  • Example: a study of about 60,000 heart attack patients found that those admitted to the hospital on weekdays had a significantly longer hospital stay than those admitted to the hospital on weekends (p<.03), but the magnitude of the difference was too small to be important: 7.4 days (weekday admits) vs. 7.2 days (weekend admits).

Ref: Kostis et al. N Engl J Med 2007;356:1099-109.


Pitfall 1 over emphasis on p values1
Pitfall 1: over-emphasis on p-values confidence intervals/effect sizes

Clinically unimportant effects may be statistically significant if a study is large (and therefore, has a small standard error and extreme precision).

Pay attention to effect sizes and confidence intervals (see end of this lecture).


Pitfall 2 association does not equal causation
Pitfall 2: association does not equal causation confidence intervals/effect sizes

  • Statistical significance does not imply a cause-effect relationship.

  • Interpret results in the context of the study design.


Pitfall 3 data dredging multiple testing
Pitfall 3: data dredging/multiple testing confidence intervals/effect sizes

  • In 1980, researchers at Duke randomized 1073 heart disease patients into two groups, but treated the groups equally.

  • Not surprisingly, there was no difference in survival.

  • Then they divided the patients into 18 subgroups based on prognostic factors.

  • In a subgroup of 397patients (with three-vessel disease and an abnormal leftventricular contraction) survival of those in “group 1” was significantly different from survival of those in “group 2” (p<.025).

  • How could this be since there was no treatment?

(Lee et al. “Clinical judgment and statistics: lessons from a simulated randomized trial in coronary artery disease,” Circulation, 61: 508-515, 1980.)


Pitfall 3 multiple testing
Pitfall 3: multiple testing confidence intervals/effect sizes

  • The difference resulted from thecombined effect of small imbalances in the subgroups


Multiple testing
Multiple testing confidence intervals/effect sizes

  • A significance level of 0.05 means that your false positive rate for one test is 5%.

  • If you run more than one test, your false positive rate will be higher than 5%.


Pitfall 3 multiple testing1
Pitfall 3: multiple testing confidence intervals/effect sizes

  • If we compare survival of “treatment” and “control” within each of 18 subgroups, that’s 18 comparisons.

  • If these comparisons were independent, the chance of at least one false positive would be…


Multiple testing1
Multiple testing confidence intervals/effect sizes

With 18 independent comparisons, we have 60% chance of at least 1 false positive.


Multiple testing2
Multiple testing confidence intervals/effect sizes

With 18 independent comparisons, we expect about 1 false positive.


Sources of multiple testing
Sources of multiple testing confidence intervals/effect sizes


Results from class survey
Results from Class survey… confidence intervals/effect sizes

  • My research question was to test whether or not being born on odd or even days predicted anything about your future.

  • I discovered that people who born on odd days wake up later and drink more alcohol than people born on even days; they also have a trend of doing more homework (p=.04, p<.01, p=.09).

  • Those born on odd days wake up 42 minutes later (7:48 vs. 7:06 am); drink 2.6 more drinks per week (1.1 vs. 3.7); and do 8 more hours of homework (22 hrs/week vs. 14).


Results from class survey1
Results from Class survey… confidence intervals/effect sizes

  • I can see the NEJM article title now…

  • “Being born on odd days predisposes you to alcoholism and laziness, but makes you a better med student.”


Results from class survey2
Results from Class survey… confidence intervals/effect sizes

  • Assuming that this difference can’t be explained by astrology, it’s obviously an artifact!

  • What’s going on?…


Results from class survey3
Results from Class survey… confidence intervals/effect sizes

  • After the odd/even day question, I asked you 25 other questions…

  • I ran 25 statistical tests (comparing the outcome variable between odd-day born people and even-day born people).

  • So, there was a high chance of finding at least one false positive!


P value distribution for the 25 tests

My significant p-values! confidence intervals/effect sizes

P-value distribution for the 25 tests…

Recall: Under the null hypothesis of no associations (which we’ll assume is true here!), p-values follow a uniform distribution…


Compare with
Compare with… confidence intervals/effect sizes

Next, I generated 25 “p-values” from a random number generator (uniform distribution). These were the results from two runs…


In the medical literature
In the medical literature… confidence intervals/effect sizes

  • Hypothetical example:

    • Researchers wanted to compare nutrient intakes between women who had fractured and women who had not fractured.

    • They used a food-frequency questionnaire and a food diary to capture food intake.

    • From these two instruments, they calculated daily intakes of all the vitamins, minerals, macronutrients, antioxidants, etc.

    • Then they compared fracturers to non-fracturers on all nutrients from both questionnaires.

    • They found a statistically significant difference in vitamin K between the two groups (p<.05).

    • They had a lovely explanation of the role of vitamin K in injury repair, bone, clotting, etc.


In the medical literature1
In the medical literature… confidence intervals/effect sizes

  • Hypothetical example:

    • Of course, they found the association only on the FFQ, not the food diary.

    • What’s going on? Almost certainly artifactual (false positive!).


Factors indicative of chance findings
Factors indicative of chance findings confidence intervals/effect sizes

*Sterne JA and Smith GD. Sifting through the evidence—what’s wrong with significance tests? BMJ 2001; 322: 226-31.


Pitfall 4 high type ii error low statistical power
Pitfall 4: high type II error (low statistical power) confidence intervals/effect sizes

  • Lack of statistical significance is not proof of the absence of an effect.

  • Example: A study of 36 postmenopausal women failed to find a significant relationship between hormone replacement therapy and prevention of vertebral fracture. The odds ratio and 95% CI were: 0.38 (0.12, 1.19), indicating a potentially meaningful clinical effect. Failure to find an effect may have been due to insufficient statistical power for this endpoint.

Ref: Wimalawansa et al. Am J Med 1998, 104:219-226.


Pitfall 4 high type ii error low statistical power1
Pitfall 4: high type II error (low statistical power) confidence intervals/effect sizes

Results that are not statistically significant should not be interpreted as "evidence of no effect,” but as “no evidence of effect”

Studies may miss effects if they are insufficiently powered (lack precision).

Design adequately powered studies and report approximate study power if results are null.


Pitfall 5 the fallacy of comparing statistical significance
Pitfall 5: the fallacy of comparing statistical significance confidence intervals/effect sizes

  • Presence of statistical significance in one group and lack of statistical significance in another group  a significant difference between the groups.

  • Example: In a placebo-controlled randomized trial of DHA oil for eczema, researchers found a statistically significant improvement in the DHA group but not the placebo group. The abstract reports: “DHA, but not the control treatment, resulted in a significant clinical improvement of atopic eczema.” However, the improvement in the treatment group was not significantly better than the improvement in the placebo group, so this is actually a null result.


Misleading significance comparisons
Misleading “significance comparisons” confidence intervals/effect sizes

Figure 3 from: Koch C, Dölle S, Metzger M, Rasche C, Jungclas H, Rühl R, Renz H, Worm M. Docosahexaenoic acid (DHA) supplementation in atopic eczema: a randomized, double-blind, controlled trial. Br J Dermatol. 2008 Apr;158(4):786-92. Epub 2008 Jan 30.


Within group vs between group significance
Within-group vs. between-group significance confidence intervals/effect sizes

Four hypothetical examples where within-group significance differs between two groups, but the between-group difference is not significant.*

*Within-group p-values are calculated using paired ttests; between-group p-values are calculated using two-sample ttests. Bolded inputs differ between the groups.


Within group vs between group significance1
Within-group vs. between-group significance confidence intervals/effect sizes

Examples of statistical tests used to evaluate within-group effects versus statistical tests used to evaluate between-group effects


Within subgroup significance vs interaction
Within-subgroup significance vs. interaction confidence intervals/effect sizes

  • Similarly, presence of statistical significance in one subgroup but not the other  a significant interaction

  • Interaction example: the effect of a drug differs significantly in different subgroups.


Within subgroup significance vs interaction1
Within-subgroup significance vs. interaction confidence intervals/effect sizes

Rates of biochemically verified prolonged abstinence at 3, 6, and 12 months from a four-arm randomized trial of smoking cessation*

*From Tables 2 and 3: Levine MD, Perkins KS, Kalarchian MA, et al. Bupropion and Cognitive Behavioral Therapy for Weight-Concerned Women Smokers. Arch Intern Med 2010;170:543-550.

**Interaction p-values were newly calculated from logistic regression based on the abstinence rates and sample sizes shown in this table.


Confidence intervals effect sizes
Confidence intervals/effect sizes confidence intervals/effect sizes


Confidence Intervals give: confidence intervals/effect sizes

*A plausible range of values for a population parameter.

*The precision of an estimate.(When sampling variability is high, the confidence interval will be wide to reflect the uncertainty of the observation.)

*Statistical significance (if the 95% CI does not cross the null value, it is significant at .05)


Confidence Intervals: Estimating the Size of the Effect confidence intervals/effect sizes

(Sample statistic) 

(measure of how confident we want to be)  (standard error)


Common levels of confidence

Confidence Level confidence intervals/effect sizes

Z value

80%

90%

95%

98%

99%

99.8%

99.9%

1.28

1.645

1.96

2.33

2.58

3.08

3.27

Common Levels of Confidence

  • Commonly used confidence levels are 90%, 95%, and 99%


The true meaning of a confidence interval
The true meaning of a confidence interval confidence intervals/effect sizes

  • A computer simulation:

  • Imagine that the true population value is 10.

  • Have the computer take 50 samples of the same size from the same population and calculate the 95% confidence interval for each sample.

  • Here are the results…


95% Confidence Intervals confidence intervals/effect sizes


95% Confidence Intervals confidence intervals/effect sizes

For a 95% confidence interval, you can be 95% confident that you captured the true population value.

3 misses=6% error rate


Confidence Intervals for antidepressant study confidence intervals/effect sizes

(Sample statistic) 

(measure of how confident we want to be)  (standard error)

95% confidence interval: 10%(1.96)  (.033)= 4%-16%

99% confidence interval: 10%(2.58)  (.033)= 2%-18%



Duality with hypothesis tests. than hypothesis tests…

Null value (no difference between cases and controls)

95% confidence interval

0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

Null hypothesis: Difference in proportion of cases and controls who used antidepressants is 0%

Alternative hypothesis: Difference in proportion of cases and controls who used antidepressants is not 0%

P-value < .05


Duality with hypothesis tests.. than hypothesis tests…

Null value (no difference between cases and controls)

99% confidence interval

0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

Null hypothesis: Difference in proportion of cases and controls who used antidepressants is 0%

Alternative hypothesis: Difference in proportion of cases and controls who used antidepressants is not 0%

P-value < .01


 Heart disease case than hypothesis tests…

Control

antidepressants

217

871

No exposure

716

4645

Odds Ratio example: Antidepressant use and Heart Disease

  • “Antidepressants as risk factor for ischaemic heart disease: case-control study in primary care”; Hippisley-Cox et al. BMJ 2001; 323; 666-669


From Table 2… than hypothesis tests…

Odds ratio (95% CI)

Any antidepressant drug ever

1.62 (1.41 to 1.99)


Null value of the odds ratio(no difference between cases and controls)

95% confidence interval

0.80 1.0 1.20 1.40 1.60 1.80 2.0 2.20

IS this a statistically significant association? YES

Null hypothesis: Proportions of cases who used antidepressants equals proportion of controls who used antidepressants.

Alternative hypothesis: Proportions are not equal.

P-value < .05


Review question 1

A 95% confidence interval for a mean: controls)

Is wider than a 99% confidence interval.

Is wider when the sample size is larger.

In repeated samples will include the population mean 95% of the time.

Will include 95% of the observations of a sample.

Review Question 1


Review question 11

A 95% confidence interval for a mean: controls)

Is wider than a 99% confidence interval.

Is wider when the sample size is larger.

In repeated samples will include the population mean 95% of the time.

Will include 95% of the observations of a sample.

Review Question 1


Review question 2
Review Question 2 controls)

Suppose we take a random sample of 100 people, both men and women. We form a 90% confidence interval of the true mean population height. Would we expect that confidence interval to be wider or narrower than if we had done everything the same but sampled only women?

  • Narrower

  • Wider

  • It is impossible to predict


Review question 21
Review Question 2 controls)

Suppose we take a random sample of 100 people, both men and women. We form a 90% confidence interval of the true mean population height. Would we expect that confidence interval to be wider or narrower than if we had done everything the same but sampled only women?

  • Narrower

  • Wider

  • It is impossible to predict

Standard deviation of height decreases, so standard error decreases.


Review question 3
Review Question 3 controls)

Suppose we take a random sample of 100 people, both men and women. We form a 90% confidence interval of the true mean population height. Would we expect that confidence interval to be wider or narrower than if we had done everything the same except sampled 200 people?

  • Narrower

  • Wider

  • It is impossible to predict


Review question 31
Review Question 3 controls)

Suppose we take a random sample of 100 people, both men and women. We form a 90% confidence interval of the true mean population height. Would we expect that confidence interval to be wider or narrower than if we had done everything the same except sampled 200 people?

  • Narrower

  • Wider

  • It is impossible to predict

N increases so standard error decreases.


Homework
Homework controls)

  • Reading: continue reading textbook

  • Reading: multiple testing article

  • Problem Set 4

  • Journal Article/article review sheet


ad