There I was, happily running thousands of linear models with GLM.jl, but the results didn’t seem quite right. For one thing, all of my p-values were exceptionally low (>90% of features significant, even after FDR correction). For another, all of the coeficients were negative (which is unexpected). I ran my data through a standard tool in my field, that’s built on top of lme4
in R, and got very different results - frankly more like I’d expect.
I did some digging, and indeed, the results with the exact same data and exact same model are giving me different results.
First a verification, here I’m running a very simple model with the data, y ~ x
, and both GLM.jl and lme4 give the exact same result:
Code in julia (using RCall for the R bits)
using CSV
using DataFrames
using DataFrames.PrettyTables
using GLM
using RCall
dat = CSV.read("/home/kevin/Downloads/dat.csv", DataFrame)
modjl = lm(@formula(y ~ x), dat)
modjldf = DataFrame(coeftable(modjl))
pretty_table(select(modjldf, 1:5))
R"library('lme4')"
@rput dat
R"modR <- lm(y ~ x, dat)"
R"summary(modR)"
GLM results:
┌─────────────┬──────────┬────────────┬──────────┬─────────────┐
│ Name │ Coef. │ Std. Error │ t │ Pr(>|t|) │
│ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────────────┼──────────┼────────────┼──────────┼─────────────┤
│ (Intercept) │ 104.43 │ 1.64392 │ 63.5247 │ 4.0734e-169 │
│ x │ 0.248086 │ 0.254027 │ 0.976613 │ 0.329598 │
└─────────────┴──────────┴────────────┴──────────┴─────────────┘
lme4
results:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 104.4295 1.6439 63.525 <2e-16 ***
x 0.2481 0.2540 0.977 0.33
But when I throw in a covariate
Code
modjl2 = lm(@formula(y ~ x + c1), dat)
modjl2df = DataFrame(coeftable(modjl2))
pretty_table(select(modjl2df, 1:5))
R"modR2 <- lm(y ~ x + c1, dat)"
R"summary(modR2)"
GLM.jl:
┌─────────────┬─────────────┬────────────┬───────────┬──────────────┐
│ Name │ Coef. │ Std. Error │ t │ Pr(>|t|) │
│ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────────────┼─────────────┼────────────┼───────────┼──────────────┤
│ (Intercept) │ 0.0 │ NaN │ NaN │ NaN │
│ x │ -9.48368 │ 0.272352 │ -34.8213 │ 3.99587e-104 │
│ c1 │ -4.08581e-8 │ 2.57778e-7 │ -0.158501 │ 0.874176 │
└─────────────┴─────────────┴────────────┴───────────┴──────────────┘
lme4:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.147e+02 7.949e+00 14.425 < 2e-16 ***
x 4.362e-01 7.181e-01 0.607 0.544088
c1 -6.721e-07 2.006e-07 -3.351 0.000915 ***
Notice in the GLM model, the inflated (and sign-reversed) coefficient, and the wildly low p-value. Also note the intercept going to 0 and all the other stats are NaN… FWIW - this only happens with certain variables. The data table also contains c2
, which is a categorical variable (encoded with integers and a few missing values and if I only do y ~ x + c2
, there’s no discrepancy.
Data table is here if anyone wants to play. I’m stumped…