Why does GLM.jl fail to fit my logistic regression model but scikit learn and R has no issues? I think I found the reason

xiaodai · August 24, 2024, 9:45am

I have been trying to use GLM.jl to fit some logistic regression models and the fit failed. I don’t remember the exact error message but it was something about singularity.

That annoyed me for a couple of years but I never figured why because using RCall.jl was straight-forward enough and I just use it to fit the logistic regression.

Recently, I decided to look into the scikit-learn implementation and it occurred to me why that is the case!

The GLM.jl implementation using Julia’s powerful numerical abilities implemented the EM algorithms where as scikit just invoked a numerical solver that doesn’t use EM.

The EM must have some technical condition where the fit will fail while scikit-learn and R’s implementation will just treat it like any other numerical problem and try to find the best coefficients!

Admittedly the R’s coefficients for the problem I tried to solve contained some NA coefficients which I needed to discard but I actually prefer it to GLM.jl’s approach.

In Julia, is there an implementation of GLM that uses a solver like they do in R? Can’t seem to find one but I guess it’s easy to structure the problem as an optimisation problem and invoke a solver.

EOhneberg · August 24, 2024, 11:22am

Not an expert and don’t have an answer but I wonder whether it has anything to do with precision/scale. I had a similar problem before between R and Matlab. Turns out the default scale of Matlab was higher than in R resulting in singularity in R and a result in Matlab.

Edit: Agree that Julia should aim to obtain the same results as R when it comes to statistical analysis. Currently this doesn’t seem to be the case for other models also

xiaodai · August 24, 2024, 11:31am

I highly doubt it. But I can try to experiment.

stevengj · August 24, 2024, 1:44pm

If you can give a reproducible example and the exact error message, peopel can give you more helpful advice. Without that, it’s hard for this thread to go anywhere productive. Please read: make it easier to help you

dmbates · August 26, 2024, 3:35pm

Both the R and the GLM.jl implementation of Generalized Linear Models use Iteratively Reweighted Least Squares (IRLS), not EM. The least squares part will fail if the coefficients are undefined due to a singular model matrix (i.e. the X matrix). A general optimizer is less likely to detect the singularity and, depending on the convergence criteria, may well declare convergence near the subspace of possible solutions.

Most analysts, including me, prefer to know when the model is computationally singular. GLM.jl uses a Cholesky factorization of X’WX to solve for the new coefficient vector at each iteration. I have an unregistered package at GitHub - dmbates/GLMMng.jl: Experimental version of GLM and GLMM fitting in Julia that uses a QR decomposition of √W * X, which will be slightly slower but better able to handle near singularity, If you want to try that I will add some documentation (right now it is a test-bed with perfunctory documentation). Or, if you could provide a MWE then we can check the condition number of the weighted model matrix.

xiaodai · October 1, 2024, 2:43pm

This is the same issue I encountered

github.com/JuliaStats/GLM.jl

Incorrect linear regression results

opened 11:18AM - 27 Apr 21 UTC

xianwenchen

[Car-Training.csv](https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Traini…ng.csv) This is a bit urgent. In a home exam that it is ongoing, I have the attached dataset, where students need to estimate models. Here are the Julia codes: ``` using GLM using DataFrames using CSV data = CSV.read( "Car-Training.csv", DataFrame ) model = @formula( Price ~ Year + Mileage ) results = lm( model, data ) ``` The output is the following: ``` StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}} Price ~ 1 + Year + Mileage Coefficients: ───────────────────────────────────────────────────────────────────────────────── Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% ───────────────────────────────────────────────────────────────────────────────── (Intercept) 0.0 NaN NaN NaN NaN NaN Year 8.17971 0.167978 48.70 <1e-73 7.84664 8.51278 Mileage -0.0580528 0.00949846 -6.11 <1e-7 -0.0768865 -0.0392191 ───────────────────────────────────────────────────────────────────────────────── ``` The intercept is not estimated. The other two coefficients are not correct. I tried R, SPSS, and Excel. All gave the same results that are different from Julia. I post the results from R below: ``` R> data <- read.csv("Car-Training.csv") R> lm(Price ~ Year + Mileage, data = data ) Call: lm(formula = Price ~ Year + Mileage, data = data) R> summary( lm(Price ~ Year + Mileage, data = data ) ) Call: lm(formula = Price ~ Year + Mileage, data = data) Residuals: Min 1Q Median 3Q Max -2168 -835 -49 567 5402 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.19e+06 3.52e+05 -6.21 1.1e-08 Year 1.10e+03 1.75e+02 6.25 9.0e-09 Mileage -2.38e-02 9.84e-03 -2.42 0.017 Residual standard error: 1300 on 104 degrees of freedom Multiple R-squared: 0.464, Adjusted R-squared: 0.454 F-statistic: 45.1 on 2 and 104 DF, p-value: 8e-15 ``` Did I do something wrong here? Or is there an issue with GLM? If there is an issue with GLM, can a new version be published as soon as possible, so that I can notify the students who are taking this exam right now?

Topic		Replies	Views
This linear regression fails in Julia (GLM) vs Python (sklearn) Data linearalgebra , linear-regression	7	411	July 4, 2023
GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed Statistics question , glm	14	5003	June 9, 2022
Discrepancy between lme4 and GLM.jl Machine Learning statistics , linear-regression	7	1082	November 1, 2022
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5093	November 26, 2018
R2 utility function for glm GeneralizedLinearModel General Usage machine-learning	6	756	January 28, 2023

Why does GLM.jl fail to fit my logistic regression model but scikit learn and R has no issues? I think I found the reason

Related topics