A primer on power

I’d like to title this post “a power primer,” but that’s the title of a 1992 Psychological Bulletin article by Jacob Cohen (the god of power analysis, now deceased). So instead I’ve titled it “a primer on power.” By changing a few words around I’ve very cleverly gone from academic plagiarism to paying homage. (And it really is one: I think Cohen’s article, and his lengthier works on power, should be required reading for behavioral scientists of all stripes).

Power is one of the most misunderstood and/or underappreciated concepts in scientific research. Simply put, it refers to the probability of detecting an effect in your sample when it is in fact present in the population (i.e., when it’s ‘real’). If your study has, say, 90% power to detect a difference in the length of socks worn by basketball players as compared to soccer players, that means that if there really is a difference between basketball and soccer players’ sock length, there’s a 9 in 10 chance on average that you’ll be able to detect it in your sample.

In general, power is a good thing, and you want to have as much of it as you can. In an ideal world, scientific experiments would have 100% power to detect effects. Unfortunately, that doesn’t happen in the real world, because to have 100% power (i.e., complete certainty), you’d need to sample the entire population of interest, which isn’t very practical (that’s a lot of players, and twice as many socks). In practice, researchers’ sample sizes are constrained by resource considerations. And so, as a result, is power. Any time you conduct an experiment with a finite sample, you’re taking the risk that you might miss an effect even if it really does exist, simply because of blind (mis)fortune. And in general, the smaller your sample, the greater the probability of you missing an effect. This idea is intuitive enough to most people: it seems pretty obvious that if you want to know whether men are taller than women, you don’t want to base your judgment on the difference in height between just one man and one woman. If you did, you’d run the risk that you just happened to pick a particularly short man and/or a particularly tall woman. The more men and women you measure, the more the random variations from the mean average out, and the smaller the odds of mistakenly concluding that there’s no gender difference in height.

Where confusion starts to set in (and the impetus for this post) is that the intimate link between sample size and power often leads people (including many scientists) to suppose that there’s a single ‘right’ sample size for all research studies of a particular kind. It’s not uncommon to hear people say things like, “we can’t trust that study because it’s based on only 50 people! They need at least 300 to be able to say anything meaningful about the general population!” (Actually this sort of statement also betrays another kind of confusion that relates to the difference between Type I and Type II errors, but that’s a separate issue). The problem is that statistical power depends not only on sample size, but also on two other numbers: the size of the effect, and the stipulated false positive rate (also referred to as alpha, or the Type I error rate).

The importance of the first of these—effect size—is easy to see intuitively. Suppose that the average height difference between men and women was 2 feet rather than several inches. How hard would it be to detect that difference and conclude it exists? Not very. A group of curious alien taxonomists wouldn’t need to abduct very many humans before they figured the gender difference out, simply because the vast majority of men would be taller than the vast majority of women, and the difference would hit the aliens right between the antennae. On the other hand, if the mean height difference was only 1/10th of an inch, our aliens would need to abduct a lot of humans and measure them very carefully before they’d be in a good position to claim that a height difference exists. Simply put, if the effect you’re looking for is large, it takes fewer subjects in order to detect it. Or, more formally, one’s power to detect an effect increases in proportion to the magnitude of the effect, when holding sample size constant.

The second parameter, false positive rate, is somewhat less intuitive. The basic idea is that, since sampling is random and error necessarily creeps in, on rare occasions, researchers are going to end up concluding that an effect exists in the population even though it doesn’t really. Just how often such errors occur is typically a matter of stipulation: scientists will decide that they can accept a false positive occurring, say, 1 out of every 20 times, and adjust their statistical tests accordingly. Conventionally, the false positive rate is set to 5% (and significance tests are therefore conducted at p < .05). Because the convention is so strong, it’s often easy to overlook the false positive rate in power calculations and just default to the standard 5% level. Nonetheless, there is a relationship: the more conservative your statistical test (i.e., the smaller the false positive rate you're willing to accept is), the lower your power gets. In less technical terms, it's kind of like saying that if you only want to be fairly sure that an effect holds true, you don’t need to look very hard. But if you want to be really sure, you need to double and triple-check to make sure. And double and triple-checking requires more observations (i.e., more subjects.)

Given that power depends only on these two parameters (sample size and false positive rate), how much power is enough? It’s widely accepted that a reasonable level of power is 80-85%. I say ‘widely accepted’ because when people stop to think about what level of power they find acceptable, their answer tends to be in that ballpark (i.e., 4 times out of 5, your experiment will detect the effect you want if it really exists). But that’s not to say that most studies actually have that level of power in practice. One of the most remarkable findings (and one that’s been demonstrated over and over again) made by statisticians interested in power is that an absurdly large proportion of studies in many disciplines simply don’t have the necessary power to detect the effects they hypothesize. In the article I linked to at the beginning, Jacob Cohen points out that an analysis he conducted in 1960 indicated that the average social psychology study had only 48% power to detect moderate-sized effects. In Cohen’s words, “the chance of obtaining a significant result was about that of tossing a head with a fair coin” (p. 155). And that’s on average; presumably there are a good number of studies that have set out to identify effects they have no real chance of detecting even if they’re actually present in the population.

Cohen then went on to note that other statisticians conducting similar reviews have shown no improvement in the average level of power in the decades since. For anyone actively involved in research—or even to casual consumers of science—this should raise red flags all over the place. There really is no excuse for failing to do a simple power calculation before beginning to collect data. It’s not as though power calculation is a tedious process: all you have to do is plug two or three numbers into an online worksheet, and poof, you get your answer instantly. And yet many, maybe even most, scientists fail to do so.

In fairness, doing a power calculation isn’t quite that easy, because you rarely know the exact size of the effect you’re seeking. If you did, you probably wouldn’t need to do the study in the first place! While it’s easy to decide you’d like your study to have, say, 80% power, it’s not so easy to come up with a reasonable estimate of effect size.

Suppose for example that we want to know if there’s a correlation between people’s mood and the amount of television they watch daily. Let’s stipulate our power has to be around 80% (we don’t want to do our study if we don’t think there’s at least a 4 in 5 chance of detecting an effect), and we’ll test our hypothesis at the conventional level of p < .05. How many subjects do we need to collect data from? Well, depends. If the correlation between mood and television watching in the general population is large (canonically, around r = .5), we’re only going to need 29 people to have an 80% chance of detecting it. If it’s medium (say, r = .3), we’re going to have to round up 85 people. But if it’s only a small effect (say, r = .1, or an overlap of only 1% of the total variance in each measure), we’re faced with the daunting prospect of chasing down 785 subjects! Note that in all 3 of these cases, we’re assuming that there really is a correlation between mood and television-watching. The only difference is how strong that effect is.

Of course, power calculations don’t always have to mean bad news. For example, in my area of research (functional neuroimaging), power calculations are often quite comforting. It’s an interesting quirk that people often criticize imaging studies for having small samples, when in fact imaging studies probably don’t have lower power on average than other kinds of studies (at least for standard experimental, within-subject analyses). The knee-jerk reaction is understandable though, because many psychologists (particularly in social or personality psychology) are used to working with samples in the hundreds. If that’s your background, it’s no surprise that when you come across neuroimaging studies that used samples of only 15 subjects (a pretty standard size), you’re going to think something’s horribly wrong.

In fact, there’s nothing wrong, because it turns out (fortuitously!) that effect sizes in functional neuroimaging studies tend to be huge. It’s not uncommon to see effect sizes around d = 2 (d is a standardized measure of effect size popularized by Cohen; it’s measured in standard deviations, so a d of 2 means the difference in neural activation between two experimental conditions is around 2 standard deviations). Effects that large are unheard of in most other disciplines. Consider that Cohen himself considered anything above d = 0.8 a ‘large’ effect (this is just a heuristic of course—the meaning of ‘large’ differs considerably across research areas!).

A quick power calculation reveals that a study with 12 subjects has essentially 100% power to detect an effect size of 2 at p < .05. Basically, if the population effect really is that big, you’re not going to miss it. In fact, with only 2 subjects, you’d still have an 88% shot of detecting it. This explains why early neuroimaging studies that often had only 3 or 4 subjects were able to obtain replicable results. In the early days, when little was known about the relationship between specific cognitive tasks and neural activity in humans, researchers used very broad experimental task contrasts specifically intended to elicit very large, very obvious changes in activation (e.g., comparing activation during a working memory task to a passive resting state). The effects were (not surprisingly, in hindsight) enormous. As time goes on and our knowledge of the functional neuroanatomy of cognition builds up, hypotheses become more subtle, and effect sizes diminish, requiring larger samples.

Of course, imaging studies usually don’t test for effects at p < .05, for reasons I won’t go into here (mainly the need to correct for multiple comparisons). Still, even at p < .001, a study with 15 subjects has 70% power. That’s not great, but it’s a comparable level to what you’ll find in many behavioral studies. Bump the sample up to 20 subjects, and power is now 92%, which is more than acceptable.

Hopefully, these example make clear the importance of (a) conducting power calculations before starting to collect data, and (b) having some reasonable notion as to what the population effect size might be (e.g., based on related effects that have already been identified). Even if you’re never going to collect any data yourself, and just want to be an informed consumer of scientific literature, it pays to know something about power. Remember: effect size matters. The fact that a study only has 10 people doesn’t necessarily mean it’s too small to provide meaningful data. Conversely, a study can have thousands of subjects and still be underpowered.

Leave a Reply