Photo by Raphael Lovaski on Unsplash

Diagnosing Multicollinearity using Variance Inflation Factors (VIF)

Last updated on Apr 5, 2021 8 min read

In an ideal world, a regression model’s predictors will be uncorrelated with each other or any other omitted predictor that is associated with the outcome variable. When this is the case, the sums of squares accounted for by each predictor will be uninfluenced by any other predictor. That is, if you ran a simple regression:

Model 1 = \(\hat{Y} = \beta_{0} + \beta_{1}X1\)

and

Model 2 = \(\hat{Y} = \beta_{0} + \beta_{2}X2\)

where \(\rho_{(X1, X2)}=0\), then adding the regression sums of squares (i.e., SSR) of each simple model will equal to that of a multiple regression model:

Model 3 = \(\hat{Y} = \beta_{0} + \beta_{1}X1+\beta_{2}X2\).

Let’s check this out with a quick example. Imagine the following data.

head(data, n =10)

##    id  y x1 x2
## 1   1  5  2  2
## 2   2  3  3  3
## 3   3  8  2  4
## 4   4  2  3  5
## 5   5  6  2  6
## 6   6  7  3  2
## 7   7  6  2  3
## 8   8  4  3  4
## 9   9  9  2  5
## 10 10 10  3  6

Here, the correlation between x1 and x2 = 0, \(r_{x1, x2}=0\).

## 
##  Pearson's product-moment correlation
## 
## data:  data$x1 and data$x2
## t = 0, df = 8, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6296263  0.6296263
## sample estimates:
## cor 
##   0

and the results of the regression are:

## Call:
##    aov(formula = model1)
## 
## Terms:
##                   x1 Residuals
## Sum of Squares   6.4      53.6
## Deg. of Freedom    1         8
## 
## Residual standard error: 2.588436
## Estimated effects may be unbalanced

The sum of squares regression (SSR) for model 1 (above) is equal to 6.4. What about model 2?

## Call:
##    aov(formula = model2)
## 
## Terms:
##                 x2 Residuals
## Sum of Squares   5        55
## Deg. of Freedom  1         8
## 
## Residual standard error: 2.622022
## Estimated effects may be unbalanced

For model 2, the SSR is 5.0. Given the above, SSR for a model with x1 and x2 predicting y should result in SSR of 11.4 total.

## Call:
##    aov(formula = model3)
## 
## Terms:
##                   x1   x2 Residuals
## Sum of Squares   6.4  5.0      48.6
## Deg. of Freedom    1    1         7
## 
## Residual standard error: 2.63493
## Estimated effects may be unbalanced

Fantastic! What an ideal situation to arise in a study! Hold up - this is not the case for the social sciences. Predictor variables are often correlated with each other or omitted variables that are associated with the outcome. Correlated predictors are often referred to as, at least for extreme cases, multicollinearity (Kutner et al., 2013).

So, perhaps we can simply look at a correlation matrix to observe the presence of highly correlated predictors and, thus, multicollinearity. Not exactly; high multicollinearity can exist and influence any given \(\beta\) despite that predictor not having a high correlation with any single other predictor.

Multicollinearity in a regression model will influence the resulting statistics and, naturally, their interpretation. First, when predictors are perfectly correlated, there are infinite OLS solutions that fit the model equally. Which one is correct? None and all. Will you ever come across this? Maybe. Regardless, you can, by default, successfully run a lm() in R with linearly dependent variables and get a summary() that appears like a model with uncorrelated or less correlated predictors. This marks the import of checking for multicollinearity among the variables in the model.

Second, and intuitively, remember that a regression coefficient implies that a one-unit change in \(X\) indicates a \(\beta_{x}\)-unit predicted change in \(Y\), while holding all others predictors constant. Holding one predictor constant without influencing other related predictors should seems implausible, in this case.

Last, correlated predictors influence the sampling variability of the \(\beta\)s. That is, the variance of \(\beta\) is a function of multicollinearity. Again, consider the extreme case of perfectly correlated variables; there are infinit solutions that may be drawn from the population.

One commonly used diagnostic to test for multicolinearity is the Variance Inflation Factor (VIF). VIF indicates the ratio between the variance of a \(\beta_{x}\) in a regression model and the \(\beta_{x}\) given orthogonal predictors (Salmerón et al., 2018):

\(VIF = \frac{var_{\beta_{x}^{multicolinear}}}{var_{\beta_{x}^{orthogonal}}}\)

Mathematically, it is the coefficient of determination (\(R^{2}\)) for \(X_{1}\) regressed on all other (\(p\)) predictors:

\(\hat{X_{1}} = \gamma_{2}X_{2}+\gamma_{3}X_{3}+...\gamma_{p}X_{p}\)

so that,

\(VIF = \frac{var_{\beta_{x}^{multicolinear}}}{var_{\beta_{x}^{orthogonal}}} = \frac{1}{1-R^{2}}\)

VIFs in R

We will use the data from January 21st, 2020’s TidyTuesday. Specifically, we will take all the Taylor Swift songs and build a regression model wherein the track’s popularity is predicted using the track’s danceability, energy, and tempo.

library(tidyverse)
taylor <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv') %>% 
  filter(track_artist=="Taylor Swift") %>% 
  select(track_name, track_popularity, danceability, energy, tempo)

head(taylor)

## # A tibble: 6 x 5
##   track_name                         track_popularity danceability energy tempo
##   <chr>                                         <dbl>        <dbl>  <dbl> <dbl>
## 1 Lover (Remix) [feat. Shawn Mendes]               85        0.448  0.603 205. 
## 2 Blank Space                                      78        0.76   0.703  96.0
## 3 You Need To Calm Down                            86        0.771  0.671  85.0
## 4 22                                                0        0.661  0.724 104. 
## 5 I Knew You Were Trouble.                         76        0.622  0.469  77.0
## 6 Our Song                                         61        0.667  0.659  89.1

Our regression model can be understood as

\(\hat{Y}_{pop} = \beta_{0}+\beta_{dance}X_{dance}+\beta_{energy}X_{energy}+\beta_{popularity}X_{popularity}\)

and in R we can write (I like the apaTables package output):,

library(apaTables)
taylor_model <- lm(track_popularity ~ danceability + energy + tempo, data=taylor)
apa.reg.table(taylor_model)

## 
## 
## Regression results using track_popularity as the criterion
##  
## 
##     Predictor      b         b_95%_CI  beta   beta_95%_CI sr2  sr2_95%_CI    r
##   (Intercept)   9.76 [-82.41, 101.93]                                         
##  danceability 107.02  [-9.44, 223.48]  0.43 [-0.04, 0.89] .13 [-.10, .36]  .26
##        energy -63.67  [-136.18, 8.83] -0.41 [-0.88, 0.06] .12 [-.10, .34] -.24
##         tempo   0.19    [-0.12, 0.49]  0.33 [-0.20, 0.86] .06 [-.10, .22] -.07
##                                                                               
##                                                                               
##                                                                               
##              Fit
##                 
##                 
##                 
##                 
##        R2 = .189
##  95% CI[.00,.38]
##                 
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
##

To get our VIFs, we can use the car package.

library(car)
vif(taylor_model)

## danceability       energy        tempo 
##     1.420959     1.463243     1.856826

Based on our newfound knowledge of VIFs, the VIF for danceability should be equal to the \(\frac{1}{1-R^{2}}\):

\(\hat{X}_{dance} = \gamma_{0}+\gamma_{energy}X_{energy}+\gamma_{popularity}X_{popularity}\)

Let’s run the model.

dance <- lm(danceability~energy+tempo, data=taylor)
apa.reg.table(dance)

## 
## 
## Regression results using danceability as the criterion
##  
## 
##    Predictor       b       b_95%_CI  beta    beta_95%_CI sr2  sr2_95%_CI     r
##  (Intercept)  0.70**   [0.55, 0.86]                                           
##       energy    0.20  [-0.05, 0.45]  0.33  [-0.07, 0.73] .08 [-.09, .26]   .03
##        tempo -0.00** [-0.00, -0.00] -0.62 [-1.03, -0.22] .30  [.01, .58] -.46*
##                                                                               
##                                                                               
##                                                                               
##              Fit
##                 
##                 
##                 
##       R2 = .296*
##  95% CI[.01,.50]
##                 
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
##

\(R^{2} = .296\), so \(\frac{1}{1-R^{2}} = \frac{1}{1-.296} = \frac{1}{.704} = 1.420\). Exactly! This means that the variance of the original model’s \(\beta_{dance}\) is 1.42 times larger than if the predictors were orthogonal or that the sum of squares for the error term is 1.42 times larger.

What VIF is considered large?

I propose no single answer. Rather, now you should have an understanding of how a VIF is calculated and how to interpret it. Regardless, here are some quick ‘rules’ and a counterargument:

A single VIF >10 is considered large and may be ‘unduly influencing the least square estimate’ (Kutner et al., 2013)
A mean VIF of all predictors that is substantially larger than 1 (Kutner et al., 2013)
A VIF > 5 indicates highly correlated predictors (Stats How to)
Ignore hard and fast rules (O’Brien, 2007)

What can I do about multicollinearity?

There are some proposed ways to help reduce the influence of multicollinearity. I will not comment on the appropriateness of these options but ask readers to consider the theories being tested by their models when choosing next steps. Some options are:

Perhaps the simplest solution is to drop one or more of the correlated predictors.
Increasing the sample size to ‘break the pattern of multicollinearity’ (p. 413; Kutner et al., 2013).
Create a composite variable that consists of the correlated predictors using data reduction techniques
Ridge regression
Nothing. I suggest reading O’Brien (2007) for his thoughts on inerpreting VIFs, including “Even with VIF values that greatly exceed the rules of 4 or 10, one can often confidently draw conclusions from regression analyses” (p. 681).

I hope you have found this helpful. Please contact me if I have made any errors, or with questions, comments, or concerns.

Statistics

Tyler Pritchard

Lab Director, Professor, Researcher, and Clinician

My research interests include suicide theory, research methods and statistics, and online activity’s impact on mental health and illness.