Visualizing Power

Primer on Statistical Significance

Null-hypothesis significant testing (NHST) is a controversial approach to social science research. Although I will not visit the concerns in full, it’s important to have an understanding of concepts related to NHST so you can fully understand why this approach is criticized and its flaws are..well… flaws. One ubiquitous, yet, misunderstood concept is p-values. For you own knowledge:

…beginning with the assumption that the true effect is zero (i.e., the null hypothesis is true), a p-value indicates the proportion of test statistics, computed from hypothetical random samples, that are as extreme, or more extreme, then the test statistic observed in the current study.

(Spence & Stanley, 2018)

or, stated another way:

The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.

(Wesserstein & Lazar, 2016)

Simply, a p-values indicates the probability of your, or more extreme, test statistic, assuming the null: \(p(data | null)\). For ease, consider the following table:

Reality
No Effect (\(H_{o}\) True) Effect (\(H_{1}\) True)
Research Result No Effect (\(H_{o}\) True) Correctly Fail to Reject \(H_{o}\) Type II Error (\(\beta\))
Effect (\(H_{1}\) True) Type I Error (\(\alpha\)) Correctly Reject \(H_{o}\)

When we conduct NHST, we are assuming that the null hypothesis is indeed true, despite never truly knowing this. As noted, a p-value indicates the likelihood of your or more extreme data given the null.

If the null hypothesis is true, why would our data be extreme?

In inferential statistics we make inferences about population-level parameters based on sample-level statistics. For example, we infer that a sample mean is indicative of a population mean. In, NHST, we assume that population-level effects or associations (i.e., correlations, mean differences, etc.) are zero, null, nothing. If we sampled from a population whose true effect or association was \(0\) an infinite number of times, our sample statistics would form a distribution around the population parameter. Consider a population correlation of \(\rho=0\). If we took infinite samples of size \(n\) from this population, let’s say 20 people, we would get a distribution that looks something like:

The above shows the distribution of 10,000 correlation coefficients derived from simulated samples from a population that has \(\rho=0\). Although not quite an infinity number of samples, I hope you get the idea. The red shaded region shows the two tails that encompass 5% of the samples (2.5% per tail). These extreme 5% are at the correlation coefficients, \(r \ge .444\) or \(r \le -.444\). Thus, there is a low probability of getting a correlation beyond \(+/-.444\) (5% chance of less). If we ran a correlation test with sample \(n=20\) and \(\alpha = .05\), a correlation of \(r=.444\) would result in approximately \(p=.05\) (the actual value of \(p\) for \(n=20\) and $r = 44377 is \(p\) = .049996$, but there are rounding errors). Because the probability of this data is quite low, given \(H_{0}=0\), we often reject the null hypothesis. Our data is very unlikely given a true null hypothesis, therefore we reject the null.

The magnitude of the tails is arbitrary, for the most part, but has been set to a standard of 5% (corresponding \(\alpha=.05\)) since the early 20th century, which is a further criticism of NHST. However, we set our \(\alpha\) threshold/criteria, and as show in the previous paragraph, our level influences the correlation coefficient we would consider statistically significant. Imagine we wanted to reduce out criteria, such that \(\alpha=.01\). What do you think would happen to the red shaded region in the above graph and the critical correlation coefficient? If we wanted to adjust what we consider extreme so that it is less conservative, which is essentially what we are doing when we reduce our alpha, we would shift the red regions outward and our critical correlation coefficients would be larger in absolute magnitude. In this case, our graph would be:

The red region represents the extreme 1% of the distribution (0.5% per tail). The blue region is where the original regions were for \(\alpha=.05\).The proportion of samples in the blue region is? You guessed it, 4%. These 4% of samples have correlations, \(.561 > r \ge .444\). Remember, these are for a sample of \(n=20\). The critical correlation coefficients differ based on sample size because they influence the dispersion of the distribution (i.e., it gets skinnier or fatter).

So, we set the criterion of  a priori and our resultant p-value let us decide if our data are probably given \(H_{a}=0\). If it’s lower than our criterion we can reject \(H_{a}=0\), suggesting only that the population effect is not zero. It is often called statistically significance. Otherwise, is out p-value is larger than our criterion, we decide that our data is not that unlikely and fail to reject that \(H_{a}=0\).

In short, p-values are the probability of getting a set of data or more extreme given the null. We compare this to a criterion cut-off, \(\alpha\). If our data is very improbable given the null, so much that is is less than our proposed cutoff, we say it is statistically significant.

Power

Whereas p-values rest on the assumption that the null hypothesis is true, the contrary assumption, that the null hypothesis is false (i.e., an effect exists), is important for determining statistical power. Statistical power is defined as the probability of correctly rejecting the null hypothesis, or more formally:

Statistical power is the probability that a study will find p < IF an effect of a stated size exists. It’s the probability of rejecting \(H_{0}\) when \(H_{1}\) is true.

(Cumming & Calin-Jageman, 2016)

We can conclude from this definition that if your statistical power is low, you will not likely reject \(H_{0}\) regardless of if there is a population-level effect (i.e., \(H_{0}\ne0\)). As will be shown, underpowered studies are doomed from the start. I hope to achieve this demonstration through a useful visualization, one I wished I had during my undergraduate studies. Conversely, if a study is overpowered (i.e., extremely large sample size), you can get a statistically significant result for what might be a infinitesimal or meaningless effect size.

Simulating Data

We will simulate some data for this visualization. Consider a researcher who is interested in the association substance use (SU) and suicidal behaviors (SB) in Canadian high school students. Let’s assume that the true association between SU and SB is \(r = .3\); we do not know that this is the true value. Let’s simulate the data with the assumption that there are 100,000 high school students

Note: I will include my r script for anyone interested in following along or adjusting it themselves.

library(tidyverse)
library(simstudy)
#Setting the seed will allow you to get the same results, if you are following along.
set.seed(2434)
mat <- genCorMat(2, c(.3))
mean = c(100, 100)
sd = c(15, 15)
data <- genCorData(100000, mu=mean, sigma = sd, corMatrix = mat, cnames=c("SU", "SB"))

Using the above, we have generated a data set of the SU and SB scores of 100,000 high schoolers. Furthermore, we have set the population correlation to .3. First, let’s plot the data to get an idea of how this relationship looks:

ggplot(data, aes(x=SB, y=SU))+
  geom_point(alpha=.05, color="grey31")+
  geom_smooth(method="lm", color="red")+
  labs(x="Suicidal Behaviors", y="Substance Use")+
  coord_cartesian(xlim=c(50, 150), ylim=c(50,150))+
  theme_classic(15)
## `geom_smooth()` using formula 'y ~ x'

So, it appears that as substance use increases, so do suicidal behaviors We can check to make sure that the population correlation actually came out to be = .3, as we intended when we simulated the data:

cor.test(data$SU, data$SB)
## 
##  Pearson's product-moment correlation
## 
## data:  data$SU and data$SB
## t = 99.473, df = 99998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2944180 0.3056978
## sample estimates:
##       cor 
## 0.3000684

So, the correlation in the population is \(\rho = .3000684\) (close enough). Although we aren’t so omniscient in the real world, we will use this known population correlation to determine the probability that our proposed study will correctly reject a null hypothesis, which, as you remember, states there is no relationship between SU and SB.

Before conducting our study, we will conduct a power analysis to determine an appropriate sample size required to adequately power our study. We want to have a good probability of rejecting the null, if it were false (here, we know this is the case). To calculate our required sample size, we require: i) our \(\alpha\) criterion level, ii) our desired power (\(1-\beta\)), and iii) our predicted effect size. We will use the standard for our criterion and power, i) \(\alpha = .05\), ii) 1-= .8, and we know our true effect, iii) = .3000684. Although we know the population correlation, in real-world research we would need ot find a good estimate of the effect size, which is typically through reading the literature for effect sizes in similar populations. We can use the pwr package to conduct our power analysis. This package is very useful; you can insert any three of the required four pieces of information to calculate the missing piece. For our analysis, we write:

library(pwr)
pwr.r.test(r=.3000684, sig.level=.05, power=.8)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 84.03388
##               r = 0.3000684
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided

The results of this suggest that we need a sample of about 84 people (n = 84.03388) to achieve our desired power (power = 0.8), using our known popultion correlation (r = 0.3000684) and (sig.level = 0.05) value. What does this mean? It means that if the true population effect/association/relationship was = .3, then 80% of all hypothetical studies, using a sample size of n = 84 drawn from this population, will yield \(p < \alpha\).

Visualizing Power

We can create a histogram that plots the results of many random studies drawn from our population to determine which meet \(p < \alpha\). First, I will create a function that can easily draw many random samples from our population and provide the resultant correlation value.

sample_cor <- function(x){
  data1 <- sample_n(data, x)
cor <- cor(data1$SU, data1$SB)
print(cor)
}

This function will retrieve a sample of x from our population data, which we created above. We can use this function to draw many samples. In the following, we will draw 2000 samples of n = 84 (these are evident in the script: rep(84, each=2000))

set.seed(5664131)
correlations <- data.frame(correlations=lapply(rep(84, each=2000), sample_cor))
correlations <- data.frame(t(correlations))
names(correlations) <- c("correlation")
rownames(correlations) <- 1:2000

We now have a dataframe, correlations, that consist of the 2000 correlation coefficients from our 2000 random samples.

head(correlations)
##   correlation
## 1   0.3182940
## 2   0.5499953
## 3   0.2369162
## 4   0.2818166
## 5   0.4269413
## 6   0.3988922

So, sample 1 resulted in a correlation of .356, sample 2 a correlation of .273, sample 3 a correlation of .225, and so on…

We can plot these as a histogram. However, this doesn’t tell us anything about p-values. We can calculate the required correlation coefficient that would results in \(p < \alpha = .05\). You can use a critical correlation table to find associated correlation values, or a website such as here. Once we fill in our values, we see that a correlation of at least \(r =|.215|\) is required to obtain \(p < \alpha = .05\) when \(n=84\).

First, we will will plot a histogram of all of our calculated p-values.

ggplot(correlations, aes(correlation))+
  geom_histogram(fill="grey", color="black", binwidth = .025)+
  coord_cartesian(xlim=c(-1, 1))+
  scale_x_continuous(breaks=seq(-1, 1, by=.2))+
  scale_y_continuous(breaks=seq(0, 250, by=25))+
  geom_vline(xintercept=0.300684, color="blue", size=1)+
  geom_vline(xintercept=0.215, color="red", size=1, linetype=2)+
  geom_vline(xintercept=-0.215, color="red", size=1, linetype=2)+
  labs(x="Correlation Coefficient", y="Frequency")+
  annotate("rect", xmin = -Inf, xmax = -.215, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) +
  annotate("rect", xmin = .215, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) +
  theme_classic(15)

This graph represents the distribution of correlation coefficients for each of our random samples, which were drawn from the 100,000 high schoolers. Notice how they form a seemingly normal distribution around our true population correlation coefficient, \(\rho=.300684\). Although it may seem normal, it is actually skewed because of the bounds of the correlation (i.e., -1 to 1). The red lines and shaded regions represent correlation coefficients that are beyond the r = .215 cut-off, which would result in \(p < .05\). Thus, these are statistically significant results in our test of the null hypothesis (i.e., \(\rho = 0\)). The blue line represents the true population correlation coefficient (\(\rho=.300684\)). Just how many samples scored at or beyond the critical correlation value of \(r = .215\)?

pwr <- correlations %>% 
  mutate(sig=ifelse(correlation>=.215, 1, 
                    ifelse(correlation<=-.215, 1, 0))) %>% 
  mutate(sig=as.factor(sig)) %>% 
  select(sig) 

table(pwr$sig)
## 
##    0    1 
##  413 1587

This tells us that 1613 correlation coefficients were at or beyond the critical value. Do you have any guess what proportion of the total samples that was? Recall that power is the probability that any study will have \(p<\alpha\). Approximately eighty percent (80.65%%) of these studies met that criteria: this was our power! Why is it 80.65% and not 80%? Remember, power refers to a hypothetically infinite number of samples drawn from the population, for any given sample size, effect size, and \(\alpha\). Had we drawn \(\infty\) samples, we would have 80% having \(p<\alpha\). Also, our power analysis suggested a sample size of \(n=84.03388\).

What if we couldn’t recruit 84 participants?

Perhaps we sampled from one high school in a small Canadian city and could only recruit 32 participants. How do you think this would affect our power? Consider that smaller sample sizes are likely to give less precise estimations of population parameters (i.e., the histogram above may be more spread out). However, this influences our critical correlation coefficient, so the red regions shift outward. This will reduce power. Let’s rerun our simulation with 2000 random samples of \(n=32\) to see how it affects out power.

set.seed(372837)
correlations_32 <- data.frame(correlations=lapply(rep(32, each=2000), sample_cor))
correlations_32 <- data.frame(t(correlations_32))
names(correlations_32) <- c("correlation")
rownames(correlations_32) <- 1:2000

Next we can plot the resultant correlation coefficients. Note that the critical correlation coefficient value for \(n = 32\) is larger than for \(n=84\). The new critical value is \(r=349\). The following script will adjust for these differences.

ggplot(correlations_32, aes(correlation))+
  geom_histogram(fill="grey", color="black", binwidth = .025)+
  coord_cartesian(xlim=c(-1, 1))+
  scale_x_continuous(breaks=seq(-1, 1, by=.2))+
  scale_y_continuous(breaks=seq(0, 250, by=25))+
  geom_vline(xintercept=0.300684, color="blue", size=1)+
  geom_vline(xintercept=0.349, color="red", size=1, linetype=2)+
  geom_vline(xintercept=-0.349, color="red", size=1, linetype=2)+
  labs(x="Correlation Coefficient", y="Frequency")+
  annotate("rect", xmin = -Inf, xmax = -.349, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) +
  annotate("rect", xmin = .349, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) +
  theme_classic(15)

Hopefully, you can see that despite the distribution still centering around the population correlation of \(.3\), the red region is shifted outward due to a smaller \(n\) (which affects out critical correlation coefficient). This would mean that fewer of the samples result in correlation coefficients that fall in the red regions (i.e., power is lower). Would a formal calculation agree? First, let’s find out how many studies resulted in \(p<\alpha=.05\) and then do a formal power analysis to determine if they are equal.

pwr_32 <- correlations_32 %>% 
  mutate(sig=ifelse(correlation>=.349, 1, 
                    ifelse(correlation<=-.349, 1, 0))) %>% 
  mutate(sig=as.factor(sig)) %>% 
  select(sig) 

table(pwr_32$sig)
## 
##    0    1 
## 1208  792

So, 808 studies yeilded statistically significant results (808/2000 = 40.4%). Would our power align?

pwr.r.test(n=32, sig.level = .05, r=.3)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 32
##               r = 0.3
##       sig.level = 0.05
##           power = 0.3932315
##     alternative = two.sided

Whoa! Close enough for me. Let’s take it one step further and assume we could only get 20 participants.

set.seed(372837)
correlations_20 <- data.frame(correlations=lapply(rep(20, each=2000), sample_cor))
correlations_20 <- data.frame(t(correlations_20))
names(correlations_20) <- c("correlation")
rownames(correlations_20) <- 1:2000

Next we can plot the resultant correlation coefficients. Note that the critical correlation coefficient value for \(n = 20\) is larger than for \(n=32\) (remember that the null distribution or samples would be spread out more for smaller sample sizes). The new critical value is \(r=444\). The following script will adjust for these differences.

ggplot(correlations_20, aes(correlation))+
  geom_histogram(fill="grey", color="black", binwidth = .025)+
  coord_cartesian(xlim=c(-1, 1))+
  scale_x_continuous(breaks=seq(-1, 1, by=.2))+
  scale_y_continuous(breaks=seq(0, 250, by=25))+
  geom_vline(xintercept=0.300684, color="blue", size=1)+
  geom_vline(xintercept=0.444, color="red", size=1, linetype=2)+
  geom_vline(xintercept=-0.444, color="red", size=1, linetype=2)+
  labs(x="Correlation Coefficient", y="Frequency")+
  annotate("rect", xmin = -Inf, xmax = -.444, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) +
  annotate("rect", xmin = .444, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) +
  theme_classic(15)

Hopefully, you can see that despite the distribution still centering around our population correlation of .3, the red regions are further outward than the previous two examples. Again, let’s see how many studies resulted in \(p<\alpha=.05\) and then do a formal power analysis to determine if they are equal.

pwr_20 <- correlations_20 %>% 
  mutate(sig=ifelse(correlation>=.444, 1, 
                    ifelse(correlation<=-.444, 1, 0))) %>% 
  mutate(sig=as.factor(sig)) %>% 
  select(sig) 

table(pwr_20$sig)
## 
##    0    1 
## 1504  496

So, 511 studies yielded statistically significant results (511/2000 = 25.5%). Would our power align?

pwr.r.test(n=20, sig.level = .05, r=.3)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 20
##               r = 0.3
##       sig.level = 0.05
##           power = 0.2559237
##     alternative = two.sided

So, out of 2000 random samples from our population, 25.5% had \(p<\alpha=.05\) and our power analysis suggests that 25.6% would. Again, given infinite samples, there would be 25.59237…% will \(p<\alpha=.05\). In sum, as sample size decreases and all other things are help constant, our power decreases. If your sample is too small, you will be unlikely to reject \(H_{o}\) even if a true effect exists. Thus, it is important to ensure you have an adequately powered study. Plan ahead. Otherwise, even if a population effect exists, you may not conclude that through NHST’. Or, maybe a non-NHST approach is best ;).

What if our population effect was larger than \(\rho = .3\)?

If our population correlation was higher, what do you think would happen to power? Specifically, imagine we reran the three simulations above, with \(n = 84, 32, & 20\), but the population correlation was \(\rho=.6\). It’s intuitive to think that the distribution of sample correlation coefficients would spread with smaller sample sizes and our critical correlation coefficient would change, yet we should have more samples with \(p<\alpha=.05\) because they would be centering further away from \(0\). I will quickly re-run all analysis using a different \(\rho\) and plot the results and power. My script is almost identical to above, so it is excluded for the most part.

$n = 84 & \(\rho=.6\)$

Re-simulating population data

True Population Correlation

Drawing Samples and Plotting

Notice anything? Let’s calculate how many studies were statistically significant and how many we would expect through power analysis.

## 
##    1 
## 2000

So, 2000 studies yielded statistically significant results (100%). Would our power align?

pwr.r.test(n=84, sig.level = .05, r=.6)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 84
##               r = 0.6
##       sig.level = 0.05
##           power = 0.9999918
##     alternative = two.sided

Close enough.

\(n = 32, \rho=.6\)

Drawing Samples and Plotting

Let’s calculate how many studies were statistically significant and how many we would expect through power analysis

## 
##    0    1 
##   70 1930

So, 1937 studies yielded statistically significant results (1937/2000 = 96.85%). Would our power align?

pwr.r.test(n=32, sig.level = .05, r=.6)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 32
##               r = 0.6
##       sig.level = 0.05
##           power = 0.965675
##     alternative = two.sided

\(n = 20, \rho=.6\)

Drawing Samples and Plotting

Let’s calculate how many studies were statistically significant and how many we would expect through power analysis

## 
##    0    1 
##  310 1690

So, 1674 studies yielded statistically significant results (1674/2000 = 83.7%). Would our power align?

pwr.r.test(n=20, sig.level = .05, r=.6)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 20
##               r = 0.6
##       sig.level = 0.05
##           power = 0.8306364
##     alternative = two.sided

As can be seen, power increases with effect size, when holding sample size and \(\alpha\) constant. Also, larger sample sizes will increase power, holding all else constant. This relationship can be seen in the following table:

Population \(\rho\) Sample Size \(\alpha\) Power (\(1-\beta\))
.1 20 .05 .0670
.1 32 .05 .0845
.1 84 .05 .1482
.1 200 .05 .2919
.3 20 .05 .2559
.3 32 .05 .3932
.3 84 .05 .9776
.3 200 .05 .9917
.6 20 .05 .8306
.6 32 .05 .9657
.6 84 .05 .9999
.6 200 .05 1.000
.1 782 .05 .8000
.05 3136 .05 .8000

As the true population effect reduces in magnitude, your power is also reduced, given constant \(n\) and \(\alpha\). So, if the population correlation between substance abuse and suicidal behaviors was \(\rho = .1\), we would require approximately \(n=782\) to achieve power of \(1-\beta = .8\). In other words, 80% of hypothetically infinite number of samples of \(n=782\) would give statistically significant results, \(p<\alpha\), when \(\rho=.1\). If the population correlation was \(\rho=.05\), we would require approximately \(n=3136\) to achieve the same power. This is the basis of the argument that large enough sample sizes result in statistically significant results, \(p <\alpha\), that are meaningless. If \(\rho = .05\), which is a very small and potentially meaningless effect, large samples will likely detect this effect and result in statistical significance. Hypothetically, 80% of the random samples of \(n=3136\) will result in \(p <\alpha=.05\) for a population correlation of \(\rho=.05\). Despite the statistical significance, there isn’t much clinical significance.

What happens when we adjust our level?

Recall from above that our alpha level is a criterion that we select to compare our p-values. Reducing our alpha level will result in a larger correlation coefficient criterion that we deem extreme and, thus, smaller red areas. In short, reducing alpha will decrease power, holding all other things constant. Consider the following graph:

This represents two histograms of hypothetical distributions of correlations drawn from 2000 samples. The black represents a distribution where the population effect is zero (\(\rho=0\)), and the blue where a population effect is .3 (\(\rho=.3\)). If we hold all things constant except our \(\alpha\) level, only the red regions will change by becoming smaller and shifting further out towards the tails. If the red region changes, the proportion of blue samples within the red region also decreases. This proportion of decrease is the reduction of power.

We have visited statistical significance, in the context of NHST, and statistical power through the use of visualizations Using hypothetical data regarding the correlation of substance use and suicidal behavior in high school students, I have shown how adjusting sample size, population effect sizes, and alpha can influence statistical power. A better understanding of p-values and power will help you better plan for your studies. Despite this, it is important to be mindful or recent criticisms and flaws of NHST in psychological sciences.

Companion shiny app can be found here.

Tyler Pritchard
Lab Director, Professor, Researcher, and Clinician

My research interests include suicide theory, research methods and statistics, and online activity’s impact on mental health and illness.

Related