Statistical Power in Research: An Overview

Power Winnifred Louis 15 July 2009

Overview of Workshop Review of the concept of power Review of antecedents of power Review of power analyses and effect size calculations DL and discussion of write-up guide Intro to G-Power3 Examples of GPower3 usage

Power • Comes down to a “limitation” of Null hypothesis testing approach and concern with decision errors • Recall: • Significant differences are defined with reference to a criterion, (controlled/acceptable rate) for committing type-1 errors, typically .05 • the type-1 error finding a significant difference in the sample when it actually doesn’t exist in the population • type-1 error rate denoted  • However relatively little attention has been paid to the type-2 error • the type-2 error finding no significant difference in the sample when there is a difference in the population • type-2 error rate denoted 

Reality vs Statistical Decisions Reality: H0 H1 Statistical Decision: Reject H0 Retain H0

power • power is: • the probability of correctly rejecting a false null hypothesis • the probability that the study will yield significant results if the research hypothesis is true • the probability of correctly identifying a true alternative hypothesis

sampling distributions • the distribution of a statistic that we would expect if we drew an infinite number of samples (of a given size) from the population • sampling distributions have means and SDs • can have a sampling distribution for any statistic, but the most common is the sampling distribution of the mean

Recall: Estimating pop means from sample meansHere – Null hyp is true H0: 1 = 2 so if our test tells us - our sample of differences between means falls into the shaded areas, we reject the null hypothesis. But, 5% of the time, we will do so incorrectly. /2 = .025 /2 = .025 (type I error)  (type I error)

Here – Null hyp is false H1: 12 H0: 1 = 2 /2 = .025 /2 = .025 1 2

H1: 12 H0: 1 = 2 to the right of this line we reject the null hypothesis POWER : 1 - /2 = .025 /2 = .025 Don’t Reject H0 Reject H0

H1: 12 H0: 1 = 2 Correct decision: Rejection of H0 1 -  POWER Correct decision: Acceptance of H0 1 -  type 1 error () type 2 error ()

factors that influence power 1. level • remember the  level defines the probability of making a Type I error • the  level is typically .05 but the  level might change depending on how worried the experimenter is about type I and type II errors • the bigger the  the more powerful the test (but the greater the risk of erroneously saying there’s an effect when there’s not ... type I error) • E.g., use one-tail test

factors that influence power:  level H0: 1 = 2  = .025  = .025 (type I error) (type I error)

factors that influence power:  level H1: 12 H0: 1 = 2 POWER  = .025  = .025

factors that influence power:  level H1: 12 H0: 1 = 2  = .025  = .05  = .025

factors that influence power 2. the size of the effect (d) • the effect size is not something the experimenter can (usually) control - it represents how big the effect is in reality (the size of the relationship between the IV and the DV) • Independent of N (population level) • it stands to reason that with big effects you’re going to have more power than with small, subtle effects

factors that influence power: d H1: 12 H0: 1 = 2  = .025  = .025

factors that influence power 3. sample size (N) • the bigger your sample size, the more power you have • large sample size allows small effects to emerge • or … big samples can act as a magnifying glass that detects small effects

factors that influence power 3. sample size (N) • you can see this when you look closely at formulas • the standard error of the mean tells us how much on average we’d expect a sample mean to differ from a population mean just by chance. The bigger the N the smaller the standard error and … smaller standard errors = bigger z scores Std err

factors that influence power 4. smaller variance of scores in the population (2) • small standard errors lead to more power. N is one thing that affects your standard error • the other thing is the variance of the population (2) • basically, the smaller the variance (spread) in scores the smaller your standard error is going to be

factors that influence power: N & 2 H1: 12 H0: 1 = 2  = .025  = .025

outcomes of interest • power determination • N determination , effect size, N, and power related

Effect sizes Classic 1988 text In the library • Measures of group differences • Cohen’s d (t-test) • Cohen’s f (ANOVA) • Measures of association • Partial eta-squared (p2) • Eta-squared (2) • Omega-squared (2) • R-squared (R2)

Measures of difference - d • When there are only two groups d is the standardised difference between the two groups • to calculate an effect size (d) you need to calculate the difference you expect to find between means and divide it by the expected standard deviation of the population • conceptually, this tells us how many SD’s apart we expect the populations (null and alternative) to be

Cohen’s conventions for d

overlap of distributions H0: 1 = 2 H1: 12 Medium Small Large

Measures of association - Eta-Squared • Eta squared is the proportion of the total variance in the DV that is attributed to an effect. • Partial eta-squared is the proportion of the leftover variance in the DV (after all other IVs are accounted for) that is attributable to the effect • This is what SPSS gives you but dodgy (over estimates the effect)

Measures of association - Omega-squared • Omega-squared is an estimate of the dependent variable population variability accounted for by the independent variable. • For a one-way between groups design: • p=number of levels of the treatment variable, F = value and n= the number of participants per treatment level 2= SSeffect – (dfeffect)MSerror SStotal + Mserror

Measures of difference - f • Cohen’s (1988) f for the one-way between groups analysis of variance can be calculated as follows • Or can use eta sq instead of omega • It is an averaged standardised difference between the 3 or more levels of the IV (even though the above formula doesn’t look like that) • Small effect - f=0.10; Medium effect - f=0.25; Large effect - f=0.40

Measures of association - R-Squared • R2 is the proportion of variance explained by the model • In general R2 is given by • Can be converted to effect size f2 • F2 = R2/(1- R2) • Small effect – f2=0.02; • Medium effect - f2 =0.15; • Large effect - f2 =0.35

Summary of effect conventions From G*Power http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/user_manual/user_manual_02.html#input_val

estimating effect • prior literature • assessment of how great a difference is important • e.g., effect on reading ability only worth the trouble if at least increases half a SD • special conventions

side issues… recall the logic of calculating estimates of effect size(i.e., criticisms of significance testing) the tradition of significance testing is based upon an arbitrary rule leading to a yes/no decision power illustrates further some of the caveats with significance testing with a high N you will have enough power to detect a very small effect if you cannot keep error variance low a large effect may still be non-significant 38

side issues… on the other hand… sometimes very small effects are important by employing strategies to increase power you have a better chance at detecting these small effects 39

power Common constraints : Cell size too small B/c sample difficult to recruit or too little time / money Small effects are often a focus of theoretical interest (especially in social / clinical / org) DV is subject to multiple influences, so each IV has small impact “Error” or residual variance is large, because many IVs unmeasured in experiment / survey are influencing DV Interactions are of interest, and interactions draw on smaller cell sizes (and thus lower power) than tests of main effects [Cell means for interaction are based on n observations, while main effects are based on n x # of levels of other factors collapsed across] 40

determining power • sometimes, for practical reasons, it’s useful to try to calculate the power of your experiment before conducting it • if the power is very low, then there’s no point in conducting the experiment. basically, you want to make sure you have a reasonable shot at getting an effect (if one exists!) • which is why grant reviewers want them

Post hoc power calculations • Generally useless / difficult to interpret from the point of view of stats • Mandated within some fields • Examples of post hoc power write-ups online at http://www.psy.uq.edu.au/~wlouis

G*POWER • G*POWER is a FREE program that can make the calculations a lot easier http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/ Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175-191. G*Power computes: • power values for given sample sizes, effect sizes, and alpha levels (post hoc power analyses), • sample sizes for given effect sizes, alpha levels, and power values (a priori power analyses) • suitable for most fundamental statistical methods • Note – some tests assume equal variance across groups and assumes using pop SD (which are likely to be est from sample)

Ok, lets do it: BS t-test • two random samples of n = 25 • expect difference between means of 5 • two-tailed test,  = .05 • 1= 5 • 2= 10 • = 10

G*POWER

determining N • So, with that expected effect size and n we get power = ~.41 • We have a probability of correctly rejecting null hyp (if false) 41% of the time • Is this good enough? • convention dictates that researchers should be entering into an experiment with no less than 80% chance of getting an effect (presuming it exists) ~ power at least .80

Determine n • Calculate effect size • Use power of .80 (convention)

WS t-test • Within subjects designs more powerful than between subjects (control for individual differences) • WS t-test not very difficult in G*Power, but becomes trickier in ANOVA • Need to know correlation between timepoints (luckily SPSS paired t gives this) • Or can use the mean and SD of “difference” scores (also in SPSS output)

Screen clipping taken: 7/8/2008, 4:30 PM s Method 1 Difference scores

Dz = Mean Diff/ SD diff = .0167/.0718 = .233

Statistical Power in Research: An Overview

Statistical Power in Research: An Overview

Presentation Transcript

Power

Power

Power Generation Wind Power

Power

Power

POWER and POWER CONSUMPTION!

Power

Power

POWER

Power and electrical power

POWER!!

Power

POWER AND POWER FACTOR

Power

POWER

Power

Fluid Power/Power Transmission

Power and electrical power

Power

Power filter power board