Multicollinearity and GLM

rikh · November 11, 2021, 5:25pm

Below, I show some generated data for a linear regression. With these features (also known as variables, covariates or predictors) A, B, C, D and E, I aim to predict an outcome Y. Would the data shown below be considered multicollinear in the sense that it could become problematic for linear regressions?

I would say yes, and I read that Bayesian models can handle collinear data well, so I expected a huge difference between a Bayesian and Frequentist model. However, I compared a Bayesian to a Frequentist model and they gave the same outcomes, see the figures below. Therefore, I concluded that the Frequentist model did not have any issues with the collinearity.

Might this be because lm from GLM uses QR decomposition? Or, is my data not correlated enough? When would GLM.lm start showing huge variances as mentioned in Wasserman’s lecture notes?

pdeffebach · November 11, 2021, 5:38pm

Your data is not correlated enough. Multi-collinearity will only become a problem with estiamtion if its is very very high, i.e. within the margin of error for QR decomposition. A correlation of .82 is not that.

rikh · November 11, 2021, 5:46pm

You mean between the variables? Dormann et al. (2012) talk about degraded performance from correlation coefficients between variables of |r| > 0.7. Maybe Statistics.cor is very different from r. I’ll look into that now.

EDIT: Nope that’s not it. Pearson’s r it is and Statistics.cor calculates the Pearson correlation too.

EDIT2: The correlations between the variables are definitely above 0.7:

julia> cor(df.D, df.E)
0.7750338235759653

pdeffebach · November 11, 2021, 5:58pm

I’m not familiar with the issues studied in the paper linked, but I’ve never heard of anyone in econometrics discuss |r| > .7 being a problem.

mcreel · November 11, 2021, 7:42pm

https://github.com/ericqu/LinearRegression.jl has a test for collinearity built in. Collinearity in linear regression models means that the coefficients will be estimated imprecisely. If the priors counter the particular imprecision, then Bayesian methods will help. But, if the priors don’t add information in the dimensions where it’s lacking, they won’t help much.

rikh · November 11, 2021, 9:29pm

That makes sense. Thanks a lot!

Topic		Replies	Views
Discrepancy between lme4 and GLM.jl Machine Learning statistics , linear-regression	7	1082	November 1, 2022
GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed Statistics question , glm	14	5002	June 9, 2022
Can someone replicate this GLM problem with linear regression on your computer? Statistics linearalgebra , glm	13	839	April 27, 2021
Fail to do multiple linear regression using GLM Statistics regression , glm	7	1879	December 18, 2018
Linear regression with a positive definite matrix in GLM.jl? Statistics glm	11	2712	February 1, 2019

Multicollinearity and GLM

Related topics