This linear regression fails in Julia (GLM) vs Python (sklearn)

SantiagoOrtiz · July 4, 2023, 6:28am

I have not figured out why the Julia and Python implementations yield different results when performing this “simple” linear regression with the following data. While sklearn is consistent with Excel, GLM fails to perform the regression and returns a zero value for the slope.

Julia

using DataFrames, GLM

x = [6.20E-13, 5.83E-11, 1.06E-10, 1.70E-10, 2.64E-10, 3.79E-10, 4.51E-10,
     6.25E-10, 6.92E-10, 5.45E-10, 8.22E-10, 1.11E-09, 1.41E-09]
y = [0, 0.009836193, 0.017599565, 0.025362938, 0.040889682, 0.056416427, 0.071943172,
     0.087469917, 0.102996662, 0.098084522, 0.137400303, 0.175161579, 0.214743511]
data = DataFrame(x=x, y=y)
model = lm(@formula(y ~ x), data)
println(model)

Python

import numpy as np
from sklearn.linear_model import LinearRegression

x = np.array([6.20E-13, 5.83E-11, 1.06E-10, 1.70E-10, 2.64E-10, 3.79E-10, 4.51E-10,
              6.25E-10, 6.92E-10, 5.45E-10, 8.22E-10, 1.11E-09, 1.41E-09]).reshape(-1, 1)
y = np.array([0, 0.009836193, 0.017599565, 0.025362938, 0.040889682, 0.056416427, 0.071943172,
              0.087469917, 0.102996662, 0.098084522, 0.137400303, 0.175161579, 0.214743511])

model = LinearRegression()
model.fit(x, y)
print(model.coef_)
print(model.intercept_)

Any insights are appreciated!

Note that GLM yields a non-zero slope when the axes are reversed and the results are consistent when compared with Python (or Excel).

xgdgsc · July 4, 2023, 7:01am

using GitHub - JuliaAI/MLJLinearModels.jl: Generalized Linear Regressions Models (penalized regressions, robust regressions, ...) gives same result as sklearn.

nilshg · July 4, 2023, 8:46am

Probably something like this here: PosDefException with simple linear regression when x/y ranges are small · Issue #375 · JuliaStats/GLM.jl · GitHub i.e. a precision issue in Cholesky. Consider:

julia> lm(@formula(y ~ x), (y = y, x = x .* 1e10))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)    Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  0.000773609  0.00275744   0.28    0.7843  -0.00529547  0.00684268
x            0.0154962    0.00042306  36.63    <1e-12   0.014565    0.0164273
──────────────────────────────────────────────────────────────────────────────

which is basically the same as:

julia> [ones(length(x)) x]\y
2-element Vector{Float64}:
 0.0007736085408234451
 1.549615493582457e8

once you account for the scaling of x.

EDITED to add: you can go all the way up to 1e2 here:

julia> lm(@formula(y ~ x), (y = y, x = x .* 1e2))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
                   Coef.      Std. Error      t  Pr(>|t|)    Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)  0.000773609      0.00275744   0.28    0.7843  -0.00529547  0.00684268
x            1.54962e6    42306.0         36.63    <1e-12   1.4565e6    1.64273e6
──────────────────────────────────────────────────────────────────────────────────

sijo · July 4, 2023, 8:47am

This looks like the GLM.jl issue Incorrect linear regression results · Issue #426 · JuliaStats/GLM.jl · GitHub. You can get the sklearn result by calling lm(@formula(y ~ x), data, dropcollinear=false).

nilshg · July 4, 2023, 9:05am

I should have also said that QR decomposition is already available on GLM master:

julia> lm(@formula(y ~ x), (;y, x); method = :qr)
LinearModel

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)    Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  0.000773609  0.00275744   0.28    0.7843  -0.00529547  0.00684268
x            1.54962e8    4.2306e6    36.63    <1e-12   1.4565e8    1.64273e8
──────────────────────────────────────────────────────────────────────────────

SantiagoOrtiz · July 4, 2023, 6:22pm

I am getting this error message for this solution:

MethodError: no method matching fit(::Type{LinearModel}, ::Matrix{Float64}, ::Vector{Float64}, ::Nothing; method=:OLS)
Closest candidates are:
  fit(::Type{LinearModel}, ::AbstractMatrix{<:Real}, ::AbstractVector{<:Real}, ::Union{Nothing, Bool}; wts, dropcollinear) at C:\Users\santi\.julia\packages\GLM\6ycOV\src\lm.jl:134 got unsupported keyword argument "method"

I am using GLM v1.8.3 in Julia 1.9.1 – I updated GLM but I guess I am not using the latest version yet.

nilshg · July 4, 2023, 6:59pm

As I said this is on master only currently, it was planned for 2.0 but there is a backport PR for 1.x currently which isn’t merged yet. If you want to use it now you’ll have to add GLM#master

SantiagoOrtiz · July 4, 2023, 7:05pm

Got It, Thanks!

Topic		Replies	Views
Why does GLM.jl fail to fit my logistic regression model but scikit learn and R has no issues? I think I found the reason Modelling & Simulations glm , r	5	309	October 1, 2024
Can someone replicate this GLM problem with linear regression on your computer? Statistics linearalgebra , glm	13	839	April 27, 2021
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018
LinearRegressionKit results correctness Statistics	0	462	June 30, 2022
Discrepancy between lme4 and GLM.jl Machine Learning statistics , linear-regression	7	1082	November 1, 2022

This linear regression fails in Julia (GLM) vs Python (sklearn)

Related topics