Fail to do multiple linear regression using GLM

Hello, everybody.

I have encountered doing multiple regression using GLM.
Here’s my dataset data (originally come from Optimization of critical medium components using response surface methodology for ethanol production from cellulosic biomass by Clostridium thermocellum SS19 - ScienceDirect)

x1 x2 x3 x4 x5 b y
-1 -1 -1 -1 -1 1 9.137
1 -1 -1 -1 -1 1 11.996
-1 1 -1 -1 -1 1 8.887
1 1 -1 -1 -1 1 12.405
-1 -1 1 -1 -1 1 11.274
1 -1 1 -1 -1 1 10.511
-1 1 1 -1 -1 1 11.113
1 1 1 -1 -1 1 11.764
-1 -1 -1 1 -1 1 9.482
1 -1 -1 1 -1 1 10.683
-1 1 -1 1 -1 1 8.277
1 1 -1 1 -1 1 11.243
-1 -1 1 1 -1 1 8.188
1 -1 1 1 -1 1 10.219
-1 1 1 1 -1 1 8.88
1 1 1 1 -1 1 9.309
-1 -1 -1 -1 1 1 10.428
1 -1 -1 -1 1 1 11.118
-1 1 -1 -1 1 1 11.242
1 1 -1 -1 1 1 11.636
-1 -1 1 -1 1 1 8.448
1 -1 1 -1 1 1 11.078
-1 1 1 -1 1 1 11.285
1 1 1 -1 1 1 11.976
-1 -1 -1 1 1 1 8.233
1 -1 -1 1 1 1 8.537
-1 1 -1 1 1 1 9.014
1 1 -1 1 1 1 7.791
-1 -1 1 1 1 1 7.022
1 -1 1 1 1 1 9.74
-1 1 1 1 1 1 9.051
1 1 1 1 1 1 8.709
0 0 0 0 0 1 11.249
0 0 0 0 0 1 9.827
0 0 0 0 0 1 10.43
0 0 0 0 0 1 11.726
0 0 0 0 0 1 10.689
0 0 0 0 0 1 11.033
0 0 0 0 0 1 9.57
0 0 0 0 0 1 11.596
-2 0 0 0 0 2 6.288
2 0 0 0 0 2 11.463
0 -2 0 0 0 2 6.809
0 2 0 0 0 2 11.982
0 0 -2 0 0 2 8.485
0 0 2 0 0 2 8.105
0 0 0 -2 0 2 11.25
0 0 0 2 0 2 6.294
0 0 0 0 -2 2 9.742
0 0 0 0 2 2 10.389
0 0 0 0 0 2 11.115
0 0 0 0 0 2 9.132
0 0 0 0 0 2 11.492
0 0 0 0 0 2 10.861

I tried but failed to fit the data to the model including first-order, second-order, and interaction terms (b is a block term).

julia> using GLM

julia> fm = @formula(y ~ 1 + b + (x1 + x2 + x3 + x4 + x5)*(x1 + x2 + x3 + x4 + x5))
Formula: y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x4 + x4 & x5 + x5 & x5

julia> ols = lm(fm, data)
PosDefException: matrix is not positive definite; Cholesky factorization failed.

Stacktrace:
 [1] checkpositivedefinite at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\factorization.jl:11 [inlined]
 [2] #cholesky!#92(::Bool, ::Function, ::LinearAlgebra.Hermitian{Float64,Array{Float64,2}}, ::Val{false}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:153
 [3] cholesky! at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:152 [inlined] (repeats 2 times)
 [4] DensePredChol(::Array{Float64,2}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:107
 [5] cholpred at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:117 [inlined]
 [6] fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,1}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:136
 [7] #fit#36(::Dict{Any,Any}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Type{LinearModel}, ::Formula, ::DataFrames.DataFrame, ::Bool) at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:72
 [8] fit at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:66 [inlined]
 [9] lm at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:146 [inlined] (repeats 2 times)
 [10] top-level scope at In[33]:1

Using the same data, MATLAB does result in a fitting.

>> ols = fitlm(data, 'y ~ b + (x1+x2+x3+x4+x5)^2')

ols = 


Linear regression model:
    y ~ 1 + b + x1*x2 + x1*x3 + x2*x3 + x1*x4 + x2*x4 + x3*x4 + x1*x5 + x2*x5 + x3*x5 + x4*x5 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2

Estimated Coefficients:
                   Estimate       SE        tStat        pValue  
                   _________    _______    ________    __________

    (Intercept)       11.804    0.59186      19.944    1.2949e-19
    x1                0.7276    0.17085      4.2586    0.00016848
    x2               0.42085    0.17085      2.4632      0.019334
    x3              -0.05755    0.17085    -0.33684       0.73844
    x4               -0.9958    0.17085     -5.8284     1.785e-06
    x5              -0.16915    0.17085    -0.99003       0.32959
    b               -0.83382    0.34642      -2.407       0.02203
    x1:x2           -0.14331    0.19102    -0.75025       0.45859
    x1:x3           -0.08325    0.19102    -0.43582        0.6659
    x2:x3            0.14769    0.19102     0.77315       0.44511
    x1:x4          -0.080813    0.19102    -0.42306       0.67508
    x2:x4           -0.19213    0.19102     -1.0058       0.32207
    x3:x4          -0.085688    0.19102    -0.44858       0.65676
    x1:x5           -0.21969    0.19102     -1.1501       0.25864
    x2:x5             0.1785    0.19102     0.93445       0.35707
    x3:x5          0.0050625    0.19102    0.026502       0.97902
    x4:x5           -0.25963    0.19102     -1.3591        0.1836
    x1^2            -0.26392    0.18673     -1.4134       0.16719
    x2^2            -0.13392    0.18673    -0.71719       0.47846
    x3^2            -0.40904    0.18673     -2.1906      0.035881
    x4^2            -0.28979    0.18673      -1.552       0.13051
    x5^2            0.033581    0.18673     0.17984       0.85841


Number of observations: 54, Error degrees of freedom: 32
Root Mean Squared Error: 1.08
R-squared: 0.715,  Adjusted R-Squared 0.528
F-statistic vs. constant model: 3.82, p-value = 0.000332

Did I do something wrong?

Weirdly enough, my formula looks different than yours.

fm = @formula(y ~ 1 + b + (x1 + x2 + x3 + x4 + x5)*(x1 + x2 + x3 + x4 + x5))
julia > Formula: y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x1 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x1 + x3 & x2 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x1 + x4 & x2 + x4 & x3 + x4 & x4 + x4 & x5 + x5 & x1 + x5 & x2 + x5 & x3 + x5 & x4 + x5 & x5

You will notice I have a x1 & x2 term as well as an x2 & x1 term. Your formula was smart enough to omit those. I am on GLM 1.0.2. Its no surprise why my formula fails. I wonder if something is going on there?

Yours is correct. To see whether this error is related to formula, I tested another one but got a same result.

fm2 = @formula(y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x4 + x4 & x5 + x5 & x5)

I copied and pasted by mistake.

Is it all good or do you still require assistance?

Definitely yes… but I suspect it’s a bug of GLM.

I tested using R and Python with the same dataset, and the data fits to a same linear model.

It seems to be correct,

using CSV, DataFrames, GLM, LinearAlgebra

data = CSV.File(file, delim = '\t') |>
    DataFrame
fm = @formula(y ~ b + (x1 + x2 + x3 + x4 + x5) * (x1 + x2 + x3 + x4 + x5))
model = fit(LinearModel, fm, data, true) # the true means to allow rank-deficient

mf = ModelFrame(fm, data)
mm = ModelMatrix(mf)
X = mm.m
F = qr(X, Val(true))
count(x -> abs(x) ≤ √eps(), diag(F.R)) # Number of linearly dependent columns
count(isnan, stderror(model)) # Number of linearly dependent columns identified by GLM
svdvals(X) # confirm by the SVD

The matrix is rank-deficient by multiple criteria. I can confirm that Stata and R both fit without dropping the linearly dependent columns, but in this case GLM.jl actually does better. A sign of mishandling a rank-deficient model is the many standard errors being the same.

1 Like

Ah, understood! (actually there are too many coefficients for 54 samples…)
Currently, there is no way to put squared terms into a formula, but that is not a great problem.

Thank you for demonstrating an example.

Sure thing. The next release of StatsModels will likely make it possible to specify polynomial terms in the formula.

2 Likes