Discrepancy between lme4 and GLM.jl

kevbonham · August 11, 2022, 2:55am

There I was, happily running thousands of linear models with GLM.jl, but the results didn’t seem quite right. For one thing, all of my p-values were exceptionally low (>90% of features significant, even after FDR correction). For another, all of the coeficients were negative (which is unexpected). I ran my data through a standard tool in my field, that’s built on top of lme4 in R, and got very different results - frankly more like I’d expect.

I did some digging, and indeed, the results with the exact same data and exact same model are giving me different results.

First a verification, here I’m running a very simple model with the data, y ~ x, and both GLM.jl and lme4 give the exact same result:

Code in julia (using RCall for the R bits)

using CSV
using DataFrames
using DataFrames.PrettyTables
using GLM
using RCall

dat = CSV.read("/home/kevin/Downloads/dat.csv", DataFrame)

modjl = lm(@formula(y ~ x), dat)
modjldf = DataFrame(coeftable(modjl))
pretty_table(select(modjldf, 1:5))

R"library('lme4')"
@rput dat
R"modR <- lm(y ~ x, dat)"
R"summary(modR)"

GLM results:

┌─────────────┬──────────┬────────────┬──────────┬─────────────┐
│        Name │    Coef. │ Std. Error │        t │    Pr(>|t|) │
│      String │  Float64 │    Float64 │  Float64 │     Float64 │
├─────────────┼──────────┼────────────┼──────────┼─────────────┤
│ (Intercept) │   104.43 │    1.64392 │  63.5247 │ 4.0734e-169 │
│           x │ 0.248086 │   0.254027 │ 0.976613 │    0.329598 │
└─────────────┴──────────┴────────────┴──────────┴─────────────┘

lme4 results:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 104.4295     1.6439  63.525   <2e-16 ***
x             0.2481     0.2540   0.977     0.33

But when I throw in a covariate

Code

modjl2 = lm(@formula(y ~ x + c1), dat)
modjl2df = DataFrame(coeftable(modjl2))
pretty_table(select(modjl2df, 1:5))

R"modR2 <- lm(y ~ x + c1, dat)"
R"summary(modR2)"

GLM.jl:

┌─────────────┬─────────────┬────────────┬───────────┬──────────────┐
│        Name │       Coef. │ Std. Error │         t │     Pr(>|t|) │
│      String │     Float64 │    Float64 │   Float64 │      Float64 │
├─────────────┼─────────────┼────────────┼───────────┼──────────────┤
│ (Intercept) │         0.0 │        NaN │       NaN │          NaN │
│           x │    -9.48368 │   0.272352 │  -34.8213 │ 3.99587e-104 │
│          c1 │ -4.08581e-8 │ 2.57778e-7 │ -0.158501 │     0.874176 │
└─────────────┴─────────────┴────────────┴───────────┴──────────────┘

lme4:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.147e+02  7.949e+00  14.425  < 2e-16 ***
x            4.362e-01  7.181e-01   0.607 0.544088
c1          -6.721e-07  2.006e-07  -3.351 0.000915 ***

Notice in the GLM model, the inflated (and sign-reversed) coefficient, and the wildly low p-value. Also note the intercept going to 0 and all the other stats are NaN… FWIW - this only happens with certain variables. The data table also contains c2, which is a categorical variable (encoded with integers and a few missing values and if I only do y ~ x + c2, there’s no discrepancy.

Data table is here if anyone wants to play. I’m stumped…

kevbonham · August 11, 2022, 3:13am

Oh FFS, it’s the size of the numbers in c1… (on the order of 1e7). If I just divide them by 1e6, the discrepancy goes away. This is a bug, right?

here’s a MWE:

Code

using DataFrames
using DataFrames.PrettyTables
using GLM
using RCall
using Distributions

df = DataFrame(y = rand(Normal(), 200), x = rand(Normal(), 200), c1 = rand(Normal(1e7, 1e6), 200))
df.c2 = df.c1 ./ 1e6

modjl = DataFrame(coeftable(lm(@formula(y ~ x + c1), df)))
modjl2 = DataFrame(coeftable(lm(@formula(y ~ x + c2), df)))

@rput df
R"modR <- lm(y ~ x + c1, df)"
R"modR2 <- lm(y ~ x + c2, df)"

pretty_table(select(modjl, 1:5))
pretty_table(select(modjl2, 1:5))
R"summary(modR)"
R"summary(modR2)"

GLM:

┌─────────────┬─────────────┬────────────┬───────────┬──────────┐
│        Name │       Coef. │ Std. Error │         t │ Pr(>|t|) │
│      String │     Float64 │    Float64 │   Float64 │  Float64 │
├─────────────┼─────────────┼────────────┼───────────┼──────────┤
│ (Intercept) │         0.0 │        NaN │       NaN │      NaN │
│           x │   0.0885528 │  0.0666665 │   1.32829 │  0.18561 │
│          c1 │ -4.56781e-9 │ 6.54316e-9 │ -0.698105 │ 0.485931 │
└─────────────┴─────────────┴────────────┴───────────┴──────────┘
┌─────────────┬───────────┬────────────┬───────────┬──────────┐
│        Name │     Coef. │ Std. Error │         t │ Pr(>|t|) │
│      String │   Float64 │    Float64 │   Float64 │  Float64 │
├─────────────┼───────────┼────────────┼───────────┼──────────┤
│ (Intercept) │ -0.180284 │   0.663792 │ -0.271597 │ 0.786216 │
│           x │  0.088447 │  0.0668241 │   1.32358 │ 0.187177 │
│          c2 │ 0.0131475 │  0.0655554 │  0.200556 │ 0.841253 │
└─────────────┴───────────┴────────────┴───────────┴──────────┘

lme4:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.803e-01  6.638e-01  -0.272    0.786
x            8.845e-02  6.682e-02   1.324    0.187
c1           1.315e-08  6.556e-08   0.201    0.841

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18028    0.66379  -0.272    0.786
x            0.08845    0.06682   1.324    0.187
c2           0.01315    0.06556   0.201    0.841

dlakelan · August 11, 2022, 4:37am

I’d say yes for sure. File a bug!

kevbonham · August 11, 2022, 2:03pm

It looks like it’s this issue: https://github.com/JuliaStats/GLM.jl/issues/426

Using dropcollinear=false also fixes it.

palday · August 11, 2022, 3:26pm

The core issue here is the underlying numerics, which is a challenging topic. Numerical stability is unfortunately not scale invariant. I wish more introductory stats courses taught practical aspects of statistical computation like rescaling instead of spending time focusing on arcana from pre-computer-era statistical practice.

Just FYI: you’ve loaded but are not using lme4 in the R section. lm is part of base R (technically part of the stats package, but that package is autoloaded at R startup).

lmer and glmer are the key functions from lme4 and the comparable Julia package is MixedModels.jl.

kevbonham · August 11, 2022, 3:44pm

Can you tell I don’t use R much , thanks for the heads up!

Eric · August 15, 2022, 7:45pm

Yes, definitely with the collinear option.
It is a problem mainly as it makes one think that numerical issue is the underlying cause here. And while indeed the data could be rescaled more meaningfully, this is not the main issue. In my view, the main issue in your model is with the variable x, which does not bring anything to your model (a model with only an intercept and no variable will work just as well as one with only the x variable; the R² value will give you hint too).
Now other solutions like R, SAS, and Julia (with for instance, the \ operator) will give you accurate coefficients, so I don’t think a numerical issue is at play in this specific case.

palday · November 1, 2022, 4:18pm

@Eric numerical stability is a property of the method and how it interacts with the data. Other solutions in R, SAS and Julia uses different methods, so saying that they work is a statement about the stability of those methods and not about the current default Cholesky method used in GLM.jl. Collinearity is well defined in theory, but less well defined in practice, where it is sensitive to rounding error and the numerical tolerances used in e.g. the pivoting methods used to handle it. That’s what’s happening with the intercept in the one model: it’s being pivoted out as part of the collinearity handling.

Relatedly, this is why GLM.jl will switch to using a QR decomposition method as its default solve method soon: we’ve had a number of issues reported by people running into numerical issues on the Cholesky decomposition.

Topic		Replies	Views
What is the difference between the two methods of lm on GLM.jl Statistics regression , glm	2	468	February 25, 2023
Why does GLM.jl fail to fit my logistic regression model but scikit learn and R has no issues? I think I found the reason Modelling & Simulations glm , r	5	308	October 1, 2024
Can someone replicate this GLM problem with linear regression on your computer? Statistics linearalgebra , glm	13	839	April 27, 2021
This linear regression fails in Julia (GLM) vs Python (sklearn) Data linearalgebra , linear-regression	7	411	July 4, 2023
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018

Discrepancy between lme4 and GLM.jl

Related topics