GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed

Sijun · September 20, 2019, 4:37am

The following errors:

using DataFrames, GLM

df = DataFrame(x1=[1,2,3,4], x2=[1,2,3,4], y=[1,1,0,0])

mdl = glm(@formula(y~x1+x2), df, Binomial(), LogitLink())

predict(mdl, df[:, [:x1, :x2]])

ERROR: LoadError: PosDefException: matrix is not positive definite; Cholesky factorization failed.

On the other hand, logistic regression is successfully done with ScikitLearn:


using DataFrames, ScikitLearn

@sk_import linear_model: LogisticRegression

df = DataFrame(x1=[1,2,3,4], x2=[1,2,3,4], y=[1,1,0,0])

mdl = fit!(LogisticRegression(), Matrix(df[:, [:x1, :x2]]), df.y)

ps = predict_proba(mdl, Matrix(df[:, [:x1, :x2]]))

Multi-collinearity cannnot be an issue for logistic regression unlike linear regression.

Then, any idea why GLM fails in this case?

mcreel · September 20, 2019, 5:16am

Yes, collinearity can be a problem for nonlinear models, including logistic regression. The coefficients are simply not identified in your case, because x2 is an exact copy of x1. If ScikitLearn gives an answer, it must be eliminating either x1 or x2, or doing something similar to deal with the problem. If the index function is t(x1a+x2b), you are probably getting an estimate of a+b, which may be misleading if you don’t notice the collinearity problem. Or, it might restrict a=b, which will also give a biased estimate. In cases like this, I prefer what GLM does, an error message that helps you find the problem.

xiaodai · September 20, 2019, 6:29am

Scikit learn glm isn’t vanilla glm. It is glm with L2 penalty with magic number 1 as the hyper parameter

Sijun · September 20, 2019, 7:02am

Linear regression entails matrix inversion, and this is the mechanism via which collinearity affects linear regression in case that the matrix is singular.

However in logistic regression, the estimation of coefficients is based on some likelihood function instead of normal equation as is used in linear regression.

Since there is no closed form expression for obtaining the solution, iterative calculation is instead used for the estimation. There is no way collinearity kicks in the process.

nalimilan · September 20, 2019, 7:16am

I think you need https://github.com/JuliaStats/GLM.jl/pull/314.

Sijun · September 20, 2019, 7:22am

Thanks for the information. But it is still unclear how logistic regression can suffer from collinearity unless somehow GLM’s logistic regression algorithm employs matrix calculation like in linear regression( It seems because Cholesky decomposition was mentioned in the error message)

mcreel · September 20, 2019, 7:30am

This is mistaken. Estimation of nonlinear models does entail solving first order conditions, numerically, rather than analytically, as can be done for linear models. With perfect collinearity, there is more than one solution to the first order conditions, and the matrix of second derivatives is singular. Most iterative gradient-based optimization methods (e.g., BFGS) will not work. It may be the case that certain procedures can still produce an estimate (e.g., regularization methods, as noted in another response, or using a derivative free optimization method) but the estimate will not be consistent, in general.

dmbates · September 20, 2019, 7:33am

The algorithm for fitting generalized linear models in the GLM package is iteratively reweighted least squares (IRLS) which performs a weighted linear least squares fit at each iteration. Thus a rank-deficient model matrix means that the coefficients are unidentifiable in a GLM as well as in an LM.

The identifiability is a property of the linear predictor, Xβ, where X is the model matrix and β is the coefficient vector. It doesn’t really matter whether the linear predictor is also the fitted value, as in a linear model, or if an inverse link function is applied to it, as in logistic regression.

tim.holy · September 20, 2019, 9:35am

Possibly useful: GitHub - timholy/PositiveFactorizations.jl: Positive-definite "approximations" to matrices

spinkney · September 20, 2019, 2:31pm

This is a great answer. For such situations - with the consistency caveat applied - I’ve applied the Moore-Penrose Inverse with success…

Sijun · September 20, 2019, 2:59pm

Thanks a lot. It would be useful if GLM could produce p-value and confidence interval along with model coefficients rather than aborts in case there is a collinearity (like in R) so that one can drop collinear predictors.

Shuya_Fan · June 8, 2022, 9:10am

I have the same error when I ran the GLMER model: dfAllform = @formula(ys ~ 1 + bio_pd+ (1|species)+ (1|region)) gm1 = fit(MixedModel, dfAllform, dfAll, Binomial(), CloglogLink(), fast = true)
In my model, there is only one fixed effect, so collinearity won’t be the cause of the error. Does anyone know there are other possible reasons?

dmbates · June 8, 2022, 2:12pm

A similar problem reported in LoadError: PosDefException: matrix is not positive definite; Cholesky factorization failed · Issue #611 · JuliaStats/MixedModels.jl · GitHub turned out to be due to scaling of the response and model matrix - because of weights in that case. That example was a linear mixed model (LMM). With generalized linear mixed models (GLMMs) the problem is even more likely because of the iteratively re-weighted least squares (IRLS) algorithm and its penalized version (PIRLS). Obtaining reasonable starting estimates to avoid this is not easy.

Shuya_Fan · June 9, 2022, 3:46pm

Thanks for your kindly reply. Actually, I still don’t quite understand what should I do to solve this problem. Accroding to your suggestion, the response variable (ys in my model) should not be scaled or the parameters should be set to certain initial values, then I can deal with this error by the two methods? I am sorry that I barely know the specific fitting process of the GLMM model. If I misunderstood, please let me know. Thanks again.

palday · June 9, 2022, 10:04pm

It’s best if you open a new thread because this is getting further and further away from the original GLM.jl issue. The problem with fitting GLMMs is broadly similar to the problem in fitting GLMs, but there are a lot of differences in the details.

It’s also very hard for us to further debug your problem without having access to the data to try to reproduce the problem. If you are unable to publicly share your data, but you could share the data with me or @dmbates, you can contact us via direct message and provide a private link to the data.

Topic		Replies	Views
Linear regression with a positive definite matrix in GLM.jl? Statistics glm	11	2712	February 1, 2019
Error in Regression but I don't think there is collinearity: "PosDefException: matrix is not positive definite; Cholesky factorization failed." New to Julia dataframes , glm	0	315	March 7, 2021
PosDefException: matrix is not positive definite; Cholesky factorization failed Statistics	9	1637	July 30, 2021
Why does GLM.jl fail to fit my logistic regression model but scikit learn and R has no issues? I think I found the reason Modelling & Simulations glm , r	5	309	October 1, 2024
How to check if GLM will be able to run OLS beforehand? General Usage question , glm	8	508	November 1, 2020

GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed

Related topics