Photo by Tom Hermans on Unsplash

Orthogonal Predictors Influence on Statistical Power

I recently came across a Twitter poll that piqued my interest. The specific poll asked:

Including non confounding covariates (Z) in the regression y~ X + Z increases power to detect association of X with y. (assuming association of Z with y is non-zero).

My immediate response was “No” because the variance predicted by the covariate will not influence the variance explained by the original predictor and, thus, not influence the standard error. However, after reading comments, it may be useful to delve a little deeper to determine if this is ideed the case. We will run a quick simulation to corroborate, or refute, this claim (or leave me more confused).

The question of interest asks if the power to detect significant results for \(\beta1\) increases when we move from regression model:

\(Y_i = \beta_0 + \beta_1X_{1i}+e_i\)

to a model that includes an uncorrelated covariate:

\(Y_i = \beta_0 + \beta_1X_{i}+\beta_2Z_i+e_i\)

Note that \(r_{X,Z} = 0\). If you are unfamiliar with statistical power, please see my (previous post)[https://tylerpritchard.netlify.com/post/visualizing-power-in-correlation-analysis/] on power for a quick primer.

Some Maths

It is useful to understand how significance testing for works in multiple regession to answer our question. I don’t remember the maths by heart, so I will be referring to Kutner et al. (2005), to refresh my memory First, follows the \(t\) distribution and significance tests are completing using the following:

\(t = \frac{b-\beta_{null}}{s_b}\)

where \(b\) is the calculated regression coefficient; \(\beta_{null}\) is the null hypothesis, commonly and here, \(\beta_{null}=0\); and \(s_b\) is the standard error of the regression coefficient estimate. So, as we all learned in our undergrads, we get our t-value, break out of critical t table and compare the values to determine if the results in statistically significant. Power for this test can be understood as:

\(Power = P(|t| > t(1-\alpha/2; n-2 | \delta))\)

Or, the proportion of times that our calculated t will be higher than the critical t value. Here, \(\delta\) is the non-centrality parameter (NCP) and is calculated using:

\(\delta = \frac{\beta_{TRUE} - \beta_{NULL}}{\sigma_{beta}}\)

Becasue the true beta value is unknown, it can be estimated using past research or estimates. After you calculate your NCP you can look up the estimated power using an old-fashioned table. I’m not feeling very old-fashioned, so let’s estimate power in using the pwr package, simulate some data, and see if the numbers align.

Consider a researcher who is testing whether graduate students’ productivity (\(Y\)) is predicted by the average hours reading scholarly articles per day (\(x\)). They hypothesize that there will be a positive relationship between the two such that students who read more articles are more productive, \(\beta_{x}>0\). The researchers would also like to control for two potential covairiates, the average hours per sleep per night (\(z\)) and number of average grams of vegetables eaten per day (\(v\)). Thus, their final regression model is: \(Y_i = \beta_0 + \beta_{articles}x_{i}+\beta_{sleep}z_i+\beta_{veg}v_i+e_i\)

Would their ability to detect a relationship between articles read and productivity be increased by introducing one covariate? We will examine this, assuming that \(r_{xz}=0\), \(r_{xy}\neq0\), and \(r_{zy}\neq0\).

Simulating Popluation Data

I like the simstudy package to simulate data. Note, however, that in practice we do not know population parameters such as correlations, means, and variances. We know these values in this exmaple to facilitate our understanding.

library(simstudy)
library(tidyverse)
library(pwr)
library(apaTables)
library(MBESS)
set.seed(327846)
cormatrix <- genCorMat(4,cors=c(.1, .1, .1,0, 0, 0))
means <- c(100, 3, 7, 200)
sd <- c(15, 0.5, 1, 50)
population <- genCorData(n=500000,
                         mu=means, 
                         sigma=sd, 
                         corMatrix = cormatrix, 
                         cnames = c("y", "x", "z", "v"))

Let’s do a quick check to see what the actual associations between the data are:

pop_mod <- lm(data=population, y~x+z+v)
apa.reg.table(pop_mod)
## 
## 
## Regression results using y as the criterion
##  
## 
##    Predictor       b       b_95%_CI beta  beta_95%_CI sr2 sr2_95%_CI     r
##  (Intercept) 74.40** [73.99, 74.82]                                       
##            x  3.03**   [2.95, 3.11] 0.10 [0.10, 0.10] .01 [.01, .01] .10**
##            z  1.50**   [1.46, 1.54] 0.10 [0.10, 0.10] .01 [.01, .01] .10**
##            v  0.03**   [0.03, 0.03] 0.10 [0.10, 0.10] .01 [.01, .01] .10**
##                                                                           
##                                                                           
##                                                                           
##              Fit
##                 
##                 
##                 
##                 
##      R2 = .030**
##  95% CI[.03,.03]
##                 
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
## 

Look at those beautiful CIs. Thus, the population correlations between our predictors and outcome are both approximately \(.1\) and each uniquely predict about 1% of the variance in y. Formal correlations between the predictors tells us that each predictor is uncorrelated (identifying that the \(\sum(sr_{predictors}^2)=R^2\) also indicates they are uncorrelated, bonus points to you) this:

cor(population)[2:5,2:5]
##            y            x             z             v
## y 1.00000000  0.100695532  0.0998577546  0.0998578679
## x 0.10069553  1.000000000 -0.0020983454 -0.0018468130
## z 0.09985775 -0.002098345  1.0000000000  0.0003026048
## v 0.09985787 -0.001846813  0.0003026048  1.0000000000

The correlation between the predictors, x, z, and v are near-zero.

What would the power of each predictor be for a sample size of \(n=100\) and estimated (in this case, known) effect size of \(\beta = .1\). A formal power analysis, using the constraint that the predictors are uncorrelated, tells us:

pwr.r.test(r=.1, n=100, alternative = "two.sided")
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 100
##               r = 0.1
##       sig.level = 0.05
##           power = 0.168047
##     alternative = two.sided

I’m using this quick and dirty method, which means I’m assuming that the power analysis for a correlation applies to a case of single predictor in a multple regression whena covariate is uncorrelated with said predictor.

Random Sampling

Let’s take a sample of \(n=100\) and run a linear regression:

set.seed(272)
sample <- sample_n(population, size=100, replace=F)
model <- lm(data=sample, y~x+z)
apa.reg.table(model, table.number = 1)
## 
## 
## Table 1 
## 
## Regression results using y as the criterion
##  
## 
##    Predictor       b        b_95%_CI beta   beta_95%_CI sr2  sr2_95%_CI    r
##  (Intercept) 81.66** [54.50, 108.82]                                        
##            x   5.97*   [0.21, 11.73] 0.20  [0.01, 0.40] .04 [-.03, .12] .21*
##            z    0.26   [-2.90, 3.42] 0.02 [-0.18, 0.21] .00 [-.01, .01]  .03
##                                                                             
##                                                                             
##                                                                             
##              Fit
##                 
##                 
##                 
##        R2 = .043
##  95% CI[.00,.13]
##                 
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
## 

The results of our regression inform us that neither the number of articles read (x) or hours slept (z) predict student productivity (from a statistically significant, frequentist perspective). What would happen if draw many sample and tested if our \(x\) is significant? Should we expect ~16.8% of the samples to give significant results both with and without the inclusion of the \(y\), the covariate?

sampling_one_predictor <- function(a){
  func <- sample_n(population, size=a, replace=F)
  mod <- lm(data=func, y~x)
  results <- summary(mod)$coefficient[2, 3:4]
  return(results)
}
set.seed(47827)
data_one_pred <- data.frame(lapply(rep(100, each=10000), sampling_one_predictor))
data_one_pred  <- data.frame(t(data_one_pred))
names(data_one_pred)[1] <- "t"
names(data_one_pred)[2] <- "p"
data_one_pred <- data_one_pred %>% 
  mutate(sig=ifelse(p<.05, 1, 0))
power <- table(data_one_pred$sig)


power
## 
##    0    1 
## 8284 1716

Here, 17.2% (1,654 of 10000) of the \(t\)-tests were statistically significant. This aligns fairly well with our power calculation of 16.8%. We know that our population predictor, \(x\), and covariate, \(z\), are uncorrelated. If including the covariate in the model with the predictor influences power, if should be reflect a change in the proportion of statistically significant resuts. Let’s simulate it.

sampling_two_predictor <- function(a){
  func <- sample_n(population, size=a, replace=F)
  mod <- lm(data=func, y~x+z)
  results <- summary(mod)$coefficient[2, 3:4]
  return(results)
}
set.seed(3278634)
data_two_pred <- data.frame(lapply(rep(100, each=10000), sampling_two_predictor))
data_two_pred  <- data.frame(t(data_two_pred))
names(data_two_pred)[1] <- "t"
names(data_two_pred)[2] <- "p"
data_two_pred <- data_two_pred %>% 
  mutate(sig=ifelse(p<.05, 1, 0))
power_2 <- table(data_two_pred$sig)
power_2
## 
##    0    1 
## 8266 1734

So, it appears that including the covariate, which we know is uncorrelated, decreased our power, but it’s still below, and within the realm, of our calculated power. What is we add a second covariate that is uncorrelated?

sampling_three_predictor <- function(a){
  func <- sample_n(population, size=a, replace=F)
  mod <- lm(data=func, y~x+z+v)
  results <- summary(mod)$coefficient[2, 3:4]
  return(results)
}
set.seed(1561)
data_three_pred <- data.frame(lapply(rep(100, each=10000), sampling_three_predictor))
data_three_pred  <- data.frame(t(data_three_pred))
names(data_three_pred)[1] <- "t"
names(data_three_pred)[2] <- "p"
data_three_pred <- data_three_pred %>% 
  mutate(sig=ifelse(p<.05, 1, 0))
power_3 <- table(data_three_pred$sig)
power_3
## 
##    0    1 
## 8320 1680

The results suggest a small increase in power; 17.3% of the results resulted in a statistically significant regression coefficient. I will repeat each simulation to determine f the pattern holds.

No Covariates

0 1 8304 1696 So, we have 17.7% results with statistical significance. What about our model with the covariate?

set.seed(2425)
data_two_pred2 <- data.frame(lapply(rep(100, each=10000), sampling_two_predictor))
data_two_pred2  <- data.frame(t(data_two_pred2))
names(data_two_pred2)[1] <- "t"
names(data_two_pred2)[2] <- "p"
data_two_pred2 <- data_two_pred2 %>% 
  mutate(sig=ifelse(p<.05, 1, 0))
power_22 <- table(data_two_pred2$sig)
power_22
## 
##    0    1 
## 8293 1707

It’s about the same!

Two Covariates

set.seed(6485)
data_three_pred2 <- data.frame(lapply(rep(100, each=10000), sampling_three_predictor))
data_three_pred2  <- data.frame(t(data_three_pred2))
names(data_three_pred2)[1] <- "t"
names(data_three_pred2)[2] <- "p"
data_three_pred2 <- data_three_pred2 %>% 
  mutate(sig=ifelse(p<.05, 1, 0))
power_32 <- table(data_three_pred2$sig)
power_32
## 
##    0    1 
## 8336 1664

Still in the ballpark I’ll repeat the analysis with four more trials and plot the results.

Plot

…and for you ‘big picture’ people

Thus, there may be an increase in power to detect an effect in a single regression predictor when incorporating uncorrelated covariates. However, the increase may be negligible.

I hope you enjoyed this post. If you see some error in my post, please enlighten me. I’m always happy to learn.

Tyler Pritchard
PhD Student, Sessional Professor, Researcher, and Clinician

My research interests include suicide and theory, research methods and statistics, and online activity.

Related