Fail to do multiple linear regression using GLM

leejm516 · December 18, 2018, 3:50am

Hello, everybody.

I have encountered doing multiple regression using GLM.
Here’s my dataset data (originally come from Optimization of critical medium components using response surface methodology for ethanol production from cellulosic biomass by Clostridium thermocellum SS19 - ScienceDirect)

x1	x2	x3	x4	x5	b	y
-1	-1	-1	-1	-1	1	9.137
1	-1	-1	-1	-1	1	11.996
-1	1	-1	-1	-1	1	8.887
1	1	-1	-1	-1	1	12.405
-1	-1	1	-1	-1	1	11.274
1	-1	1	-1	-1	1	10.511
-1	1	1	-1	-1	1	11.113
1	1	1	-1	-1	1	11.764
-1	-1	-1	1	-1	1	9.482
1	-1	-1	1	-1	1	10.683
-1	1	-1	1	-1	1	8.277
1	1	-1	1	-1	1	11.243
-1	-1	1	1	-1	1	8.188
1	-1	1	1	-1	1	10.219
-1	1	1	1	-1	1	8.88
1	1	1	1	-1	1	9.309
-1	-1	-1	-1	1	1	10.428
1	-1	-1	-1	1	1	11.118
-1	1	-1	-1	1	1	11.242
1	1	-1	-1	1	1	11.636
-1	-1	1	-1	1	1	8.448
1	-1	1	-1	1	1	11.078
-1	1	1	-1	1	1	11.285
1	1	1	-1	1	1	11.976
-1	-1	-1	1	1	1	8.233
1	-1	-1	1	1	1	8.537
-1	1	-1	1	1	1	9.014
1	1	-1	1	1	1	7.791
-1	-1	1	1	1	1	7.022
1	-1	1	1	1	1	9.74
-1	1	1	1	1	1	9.051
1	1	1	1	1	1	8.709
0	0	0	0	0	1	11.249
0	0	0	0	0	1	9.827
0	0	0	0	0	1	10.43
0	0	0	0	0	1	11.726
0	0	0	0	0	1	10.689
0	0	0	0	0	1	11.033
0	0	0	0	0	1	9.57
0	0	0	0	0	1	11.596
-2	0	0	0	0	2	6.288
2	0	0	0	0	2	11.463
0	-2	0	0	0	2	6.809
0	2	0	0	0	2	11.982
0	0	-2	0	0	2	8.485
0	0	2	0	0	2	8.105
0	0	0	-2	0	2	11.25
0	0	0	2	0	2	6.294
0	0	0	0	-2	2	9.742
0	0	0	0	2	2	10.389
0	0	0	0	0	2	11.115
0	0	0	0	0	2	9.132
0	0	0	0	0	2	11.492
0	0	0	0	0	2	10.861

I tried but failed to fit the data to the model including first-order, second-order, and interaction terms (b is a block term).

julia> using GLM

julia> fm = @formula(y ~ 1 + b + (x1 + x2 + x3 + x4 + x5)*(x1 + x2 + x3 + x4 + x5))
Formula: y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x4 + x4 & x5 + x5 & x5

julia> ols = lm(fm, data)
PosDefException: matrix is not positive definite; Cholesky factorization failed.

Stacktrace:
 [1] checkpositivedefinite at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\factorization.jl:11 [inlined]
 [2] #cholesky!#92(::Bool, ::Function, ::LinearAlgebra.Hermitian{Float64,Array{Float64,2}}, ::Val{false}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:153
 [3] cholesky! at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:152 [inlined] (repeats 2 times)
 [4] DensePredChol(::Array{Float64,2}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:107
 [5] cholpred at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:117 [inlined]
 [6] fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,1}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:136
 [7] #fit#36(::Dict{Any,Any}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Type{LinearModel}, ::Formula, ::DataFrames.DataFrame, ::Bool) at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:72
 [8] fit at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:66 [inlined]
 [9] lm at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:146 [inlined] (repeats 2 times)
 [10] top-level scope at In[33]:1

Using the same data, MATLAB does result in a fitting.

>> ols = fitlm(data, 'y ~ b + (x1+x2+x3+x4+x5)^2')

ols = 


Linear regression model:
    y ~ 1 + b + x1*x2 + x1*x3 + x2*x3 + x1*x4 + x2*x4 + x3*x4 + x1*x5 + x2*x5 + x3*x5 + x4*x5 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2

Estimated Coefficients:
                   Estimate       SE        tStat        pValue  
                   _________    _______    ________    __________

    (Intercept)       11.804    0.59186      19.944    1.2949e-19
    x1                0.7276    0.17085      4.2586    0.00016848
    x2               0.42085    0.17085      2.4632      0.019334
    x3              -0.05755    0.17085    -0.33684       0.73844
    x4               -0.9958    0.17085     -5.8284     1.785e-06
    x5              -0.16915    0.17085    -0.99003       0.32959
    b               -0.83382    0.34642      -2.407       0.02203
    x1:x2           -0.14331    0.19102    -0.75025       0.45859
    x1:x3           -0.08325    0.19102    -0.43582        0.6659
    x2:x3            0.14769    0.19102     0.77315       0.44511
    x1:x4          -0.080813    0.19102    -0.42306       0.67508
    x2:x4           -0.19213    0.19102     -1.0058       0.32207
    x3:x4          -0.085688    0.19102    -0.44858       0.65676
    x1:x5           -0.21969    0.19102     -1.1501       0.25864
    x2:x5             0.1785    0.19102     0.93445       0.35707
    x3:x5          0.0050625    0.19102    0.026502       0.97902
    x4:x5           -0.25963    0.19102     -1.3591        0.1836
    x1^2            -0.26392    0.18673     -1.4134       0.16719
    x2^2            -0.13392    0.18673    -0.71719       0.47846
    x3^2            -0.40904    0.18673     -2.1906      0.035881
    x4^2            -0.28979    0.18673      -1.552       0.13051
    x5^2            0.033581    0.18673     0.17984       0.85841


Number of observations: 54, Error degrees of freedom: 32
Root Mean Squared Error: 1.08
R-squared: 0.715,  Adjusted R-Squared 0.528
F-statistic vs. constant model: 3.82, p-value = 0.000332

Did I do something wrong?

pdeffebach · December 18, 2018, 5:17am

Weirdly enough, my formula looks different than yours.

fm = @formula(y ~ 1 + b + (x1 + x2 + x3 + x4 + x5)*(x1 + x2 + x3 + x4 + x5))
julia > Formula: y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x1 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x1 + x3 & x2 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x1 + x4 & x2 + x4 & x3 + x4 & x4 + x4 & x5 + x5 & x1 + x5 & x2 + x5 & x3 + x5 & x4 + x5 & x5

You will notice I have a x1 & x2 term as well as an x2 & x1 term. Your formula was smart enough to omit those. I am on GLM 1.0.2. Its no surprise why my formula fails. I wonder if something is going on there?

leejm516 · December 18, 2018, 6:51am

Yours is correct. To see whether this error is related to formula, I tested another one but got a same result.

fm2 = @formula(y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x4 + x4 & x5 + x5 & x5)

I copied and pasted by mistake.

Nosferican · December 18, 2018, 7:09am

Is it all good or do you still require assistance?

leejm516 · December 18, 2018, 7:13am

Definitely yes… but I suspect it’s a bug of GLM.

I tested using R and Python with the same dataset, and the data fits to a same linear model.

Nosferican · December 18, 2018, 7:38am

It seems to be correct,

using CSV, DataFrames, GLM, LinearAlgebra

data = CSV.File(file, delim = '\t') |>
    DataFrame
fm = @formula(y ~ b + (x1 + x2 + x3 + x4 + x5) * (x1 + x2 + x3 + x4 + x5))
model = fit(LinearModel, fm, data, true) # the true means to allow rank-deficient

mf = ModelFrame(fm, data)
mm = ModelMatrix(mf)
X = mm.m
F = qr(X, Val(true))
count(x -> abs(x) ≤ √eps(), diag(F.R)) # Number of linearly dependent columns
count(isnan, stderror(model)) # Number of linearly dependent columns identified by GLM
svdvals(X) # confirm by the SVD

The matrix is rank-deficient by multiple criteria. I can confirm that Stata and R both fit without dropping the linearly dependent columns, but in this case GLM.jl actually does better. A sign of mishandling a rank-deficient model is the many standard errors being the same.

leejm516 · December 18, 2018, 8:05am

Ah, understood! (actually there are too many coefficients for 54 samples…)
Currently, there is no way to put squared terms into a formula, but that is not a great problem.

Thank you for demonstrating an example.

Nosferican · December 18, 2018, 8:13am

Sure thing. The next release of StatsModels will likely make it possible to specify polynomial terms in the formula.

Topic		Replies	Views
Linear regression with a positive definite matrix in GLM.jl? Statistics glm	11	2756	February 1, 2019
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5200	November 26, 2018
How to check if GLM will be able to run OLS beforehand? General Usage question , glm	8	544	November 1, 2020
Error in Regression but I don't think there is collinearity: "PosDefException: matrix is not positive definite; Cholesky factorization failed." New to Julia dataframes , glm	0	323	March 7, 2021
What shape should data be in to perform a linear fit with GLM? Statistics question , fit , glm	7	1057	January 11, 2021

Fail to do multiple linear regression using GLM

Related topics