Discrepancy between lme4 and GLM.jl

There I was, happily running thousands of linear models with GLM.jl, but the results didn’t seem quite right. For one thing, all of my p-values were exceptionally low (>90% of features significant, even after FDR correction). For another, all of the coeficients were negative (which is unexpected). I ran my data through a standard tool in my field, that’s built on top of lme4 in R, and got very different results - frankly more like I’d expect.

I did some digging, and indeed, the results with the exact same data and exact same model are giving me different results.

First a verification, here I’m running a very simple model with the data, y ~ x, and both GLM.jl and lme4 give the exact same result:

Code in julia (using RCall for the R bits)
using CSV
using DataFrames
using DataFrames.PrettyTables
using GLM
using RCall

dat = CSV.read("/home/kevin/Downloads/dat.csv", DataFrame)

modjl = lm(@formula(y ~ x), dat)
modjldf = DataFrame(coeftable(modjl))
pretty_table(select(modjldf, 1:5))

R"library('lme4')"
@rput dat
R"modR <- lm(y ~ x, dat)"
R"summary(modR)"

GLM results:

┌─────────────┬──────────┬────────────┬──────────┬─────────────┐
│        Name │    Coef. │ Std. Error │        t │    Pr(>|t|) │
│      String │  Float64 │    Float64 │  Float64 │     Float64 │
├─────────────┼──────────┼────────────┼──────────┼─────────────┤
│ (Intercept) │   104.43 │    1.64392 │  63.5247 │ 4.0734e-169 │
│           x │ 0.248086 │   0.254027 │ 0.976613 │    0.329598 │
└─────────────┴──────────┴────────────┴──────────┴─────────────┘

lme4 results:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 104.4295     1.6439  63.525   <2e-16 ***
x             0.2481     0.2540   0.977     0.33

But when I throw in a covariate

Code
modjl2 = lm(@formula(y ~ x + c1), dat)
modjl2df = DataFrame(coeftable(modjl2))
pretty_table(select(modjl2df, 1:5))

R"modR2 <- lm(y ~ x + c1, dat)"
R"summary(modR2)"

GLM.jl:

┌─────────────┬─────────────┬────────────┬───────────┬──────────────┐
│        Name │       Coef. │ Std. Error │         t │     Pr(>|t|) │
│      String │     Float64 │    Float64 │   Float64 │      Float64 │
├─────────────┼─────────────┼────────────┼───────────┼──────────────┤
│ (Intercept) │         0.0 │        NaN │       NaN │          NaN │
│           x │    -9.48368 │   0.272352 │  -34.8213 │ 3.99587e-104 │
│          c1 │ -4.08581e-8 │ 2.57778e-7 │ -0.158501 │     0.874176 │
└─────────────┴─────────────┴────────────┴───────────┴──────────────┘

lme4:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.147e+02  7.949e+00  14.425  < 2e-16 ***
x            4.362e-01  7.181e-01   0.607 0.544088
c1          -6.721e-07  2.006e-07  -3.351 0.000915 ***

Notice in the GLM model, the inflated (and sign-reversed) coefficient, and the wildly low p-value. Also note the intercept going to 0 and all the other stats are NaN… FWIW - this only happens with certain variables. The data table also contains c2, which is a categorical variable (encoded with integers and a few missing values and if I only do y ~ x + c2, there’s no discrepancy.

Data table is here if anyone wants to play. I’m stumped…

Oh FFS, it’s the size of the numbers in c1… (on the order of 1e7). If I just divide them by 1e6, the discrepancy goes away. This is a bug, right?

here’s a MWE:

Code
using DataFrames
using DataFrames.PrettyTables
using GLM
using RCall
using Distributions

df = DataFrame(y = rand(Normal(), 200), x = rand(Normal(), 200), c1 = rand(Normal(1e7, 1e6), 200))
df.c2 = df.c1 ./ 1e6

modjl = DataFrame(coeftable(lm(@formula(y ~ x + c1), df)))
modjl2 = DataFrame(coeftable(lm(@formula(y ~ x + c2), df)))

@rput df
R"modR <- lm(y ~ x + c1, df)"
R"modR2 <- lm(y ~ x + c2, df)"

pretty_table(select(modjl, 1:5))
pretty_table(select(modjl2, 1:5))
R"summary(modR)"
R"summary(modR2)"

GLM:

┌─────────────┬─────────────┬────────────┬───────────┬──────────┐
│        Name │       Coef. │ Std. Error │         t │ Pr(>|t|) │
│      String │     Float64 │    Float64 │   Float64 │  Float64 │
├─────────────┼─────────────┼────────────┼───────────┼──────────┤
│ (Intercept) │         0.0 │        NaN │       NaN │      NaN │
│           x │   0.0885528 │  0.0666665 │   1.32829 │  0.18561 │
│          c1 │ -4.56781e-9 │ 6.54316e-9 │ -0.698105 │ 0.485931 │
└─────────────┴─────────────┴────────────┴───────────┴──────────┘
┌─────────────┬───────────┬────────────┬───────────┬──────────┐
│        Name │     Coef. │ Std. Error │         t │ Pr(>|t|) │
│      String │   Float64 │    Float64 │   Float64 │  Float64 │
├─────────────┼───────────┼────────────┼───────────┼──────────┤
│ (Intercept) │ -0.180284 │   0.663792 │ -0.271597 │ 0.786216 │
│           x │  0.088447 │  0.0668241 │   1.32358 │ 0.187177 │
│          c2 │ 0.0131475 │  0.0655554 │  0.200556 │ 0.841253 │
└─────────────┴───────────┴────────────┴───────────┴──────────┘

lme4:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.803e-01  6.638e-01  -0.272    0.786
x            8.845e-02  6.682e-02   1.324    0.187
c1           1.315e-08  6.556e-08   0.201    0.841

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18028    0.66379  -0.272    0.786
x            0.08845    0.06682   1.324    0.187
c2           0.01315    0.06556   0.201    0.841

I’d say yes for sure. File a bug!

2 Likes

It looks like it’s this issue: https://github.com/JuliaStats/GLM.jl/issues/426

Using dropcollinear=false also fixes it.

2 Likes

The core issue here is the underlying numerics, which is a challenging topic. Numerical stability is unfortunately not scale invariant. I wish more introductory stats courses taught practical aspects of statistical computation like rescaling instead of spending time focusing on arcana from pre-computer-era statistical practice. :confused:

Just FYI: you’ve loaded but are not using lme4 in the R section. lm is part of base R (technically part of the stats package, but that package is autoloaded at R startup).

lmer and glmer are the key functions from lme4 and the comparable Julia package is MixedModels.jl.

2 Likes

Can you tell I don’t use R much :flushed:, thanks for the heads up!

Yes, definitely with the collinear option.
It is a problem mainly as it makes one think that numerical issue is the underlying cause here. And while indeed the data could be rescaled more meaningfully, this is not the main issue. In my view, the main issue in your model is with the variable x, which does not bring anything to your model (a model with only an intercept and no variable will work just as well as one with only the x variable; the R² value will give you hint too).
Now other solutions like R, SAS, and Julia (with for instance, the \ operator) will give you accurate coefficients, so I don’t think a numerical issue is at play in this specific case.

1 Like

@Eric numerical stability is a property of the method and how it interacts with the data. Other solutions in R, SAS and Julia uses different methods, so saying that they work is a statement about the stability of those methods and not about the current default Cholesky method used in GLM.jl. Collinearity is well defined in theory, but less well defined in practice, where it is sensitive to rounding error and the numerical tolerances used in e.g. the pivoting methods used to handle it. That’s what’s happening with the intercept in the one model: it’s being pivoted out as part of the collinearity handling.

Relatedly, this is why GLM.jl will switch to using a QR decomposition method as its default solve method soon: we’ve had a number of issues reported by people running into numerical issues on the Cholesky decomposition.

3 Likes