GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed

The following errors:

using DataFrames, GLM

df = DataFrame(x1=[1,2,3,4], x2=[1,2,3,4], y=[1,1,0,0])

mdl = glm(@formula(y~x1+x2), df, Binomial(), LogitLink())

predict(mdl, df[:, [:x1, :x2]])

ERROR: LoadError: PosDefException: matrix is not positive definite; Cholesky factorization failed.

On the other hand, logistic regression is successfully done with ScikitLearn:


using DataFrames, ScikitLearn

@sk_import linear_model: LogisticRegression

df = DataFrame(x1=[1,2,3,4], x2=[1,2,3,4], y=[1,1,0,0])

mdl = fit!(LogisticRegression(), Matrix(df[:, [:x1, :x2]]), df.y)

ps = predict_proba(mdl, Matrix(df[:, [:x1, :x2]]))

Multi-collinearity cannnot be an issue for logistic regression unlike linear regression.

Then, any idea why GLM fails in this case?

Yes, collinearity can be a problem for nonlinear models, including logistic regression. The coefficients are simply not identified in your case, because x2 is an exact copy of x1. If ScikitLearn gives an answer, it must be eliminating either x1 or x2, or doing something similar to deal with the problem. If the index function is t(x1a+x2b), you are probably getting an estimate of a+b, which may be misleading if you don’t notice the collinearity problem. Or, it might restrict a=b, which will also give a biased estimate. In cases like this, I prefer what GLM does, an error message that helps you find the problem.

2 Likes

Scikit learn glm isn’t vanilla glm. It is glm with L2 penalty with magic number 1 as the hyper parameter

4 Likes

Linear regression entails matrix inversion, and this is the mechanism via which collinearity affects linear regression in case that the matrix is singular.

However in logistic regression, the estimation of coefficients is based on some likelihood function instead of normal equation as is used in linear regression.

Since there is no closed form expression for obtaining the solution, iterative calculation is instead used for the estimation. There is no way collinearity kicks in the process.

I think you need https://github.com/JuliaStats/GLM.jl/pull/314.

2 Likes

Thanks for the information. But it is still unclear how logistic regression can suffer from collinearity unless somehow GLM’s logistic regression algorithm employs matrix calculation like in linear regression( It seems because Cholesky decomposition was mentioned in the error message)

This is mistaken. Estimation of nonlinear models does entail solving first order conditions, numerically, rather than analytically, as can be done for linear models. With perfect collinearity, there is more than one solution to the first order conditions, and the matrix of second derivatives is singular. Most iterative gradient-based optimization methods (e.g., BFGS) will not work. It may be the case that certain procedures can still produce an estimate (e.g., regularization methods, as noted in another response, or using a derivative free optimization method) but the estimate will not be consistent, in general.

7 Likes

The algorithm for fitting generalized linear models in the GLM package is iteratively reweighted least squares (IRLS) which performs a weighted linear least squares fit at each iteration. Thus a rank-deficient model matrix means that the coefficients are unidentifiable in a GLM as well as in an LM.

The identifiability is a property of the linear predictor, Xβ, where X is the model matrix and β is the coefficient vector. It doesn’t really matter whether the linear predictor is also the fitted value, as in a linear model, or if an inverse link function is applied to it, as in logistic regression.

6 Likes

Possibly useful: https://github.com/timholy/PositiveFactorizations.jl

2 Likes

This is a great answer. For such situations - with the consistency caveat applied - I’ve applied the Moore-Penrose Inverse with success…

Thanks a lot. It would be useful if GLM could produce p-value and confidence interval along with model coefficients rather than aborts in case there is a collinearity (like in R) so that one can drop collinear predictors.

I have the same error when I ran the GLMER model: dfAllform = @formula(ys ~ 1 + bio_pd+ (1|species)+ (1|region)) gm1 = fit(MixedModel, dfAllform, dfAll, Binomial(), CloglogLink(), fast = true)
In my model, there is only one fixed effect, so collinearity won’t be the cause of the error. Does anyone know there are other possible reasons?

A similar problem reported in LoadError: PosDefException: matrix is not positive definite; Cholesky factorization failed · Issue #611 · JuliaStats/MixedModels.jl · GitHub turned out to be due to scaling of the response and model matrix - because of weights in that case. That example was a linear mixed model (LMM). With generalized linear mixed models (GLMMs) the problem is even more likely because of the iteratively re-weighted least squares (IRLS) algorithm and its penalized version (PIRLS). Obtaining reasonable starting estimates to avoid this is not easy.

Thanks for your kindly reply. Actually, I still don’t quite understand what should I do to solve this problem. Accroding to your suggestion, the response variable (ys in my model) should not be scaled or the parameters should be set to certain initial values, then I can deal with this error by the two methods? I am sorry that I barely know the specific fitting process of the GLMM model. If I misunderstood, please let me know. Thanks again.

It’s best if you open a new thread because this is getting further and further away from the original GLM.jl issue. The problem with fitting GLMMs is broadly similar to the problem in fitting GLMs, but there are a lot of differences in the details.

It’s also very hard for us to further debug your problem without having access to the data to try to reproduce the problem. If you are unable to publicly share your data, but you could share the data with me or @dmbates, you can contact us via direct message and provide a private link to the data.