Fail to do multiple linear regression using GLM


#1

Hello, everybody.

I have encountered doing multiple regression using GLM.
Here’s my dataset data (originally come from https://www.sciencedirect.com/science/article/pii/S1359511305001121)

x1 x2 x3 x4 x5 b y
-1 -1 -1 -1 -1 1 9.137
1 -1 -1 -1 -1 1 11.996
-1 1 -1 -1 -1 1 8.887
1 1 -1 -1 -1 1 12.405
-1 -1 1 -1 -1 1 11.274
1 -1 1 -1 -1 1 10.511
-1 1 1 -1 -1 1 11.113
1 1 1 -1 -1 1 11.764
-1 -1 -1 1 -1 1 9.482
1 -1 -1 1 -1 1 10.683
-1 1 -1 1 -1 1 8.277
1 1 -1 1 -1 1 11.243
-1 -1 1 1 -1 1 8.188
1 -1 1 1 -1 1 10.219
-1 1 1 1 -1 1 8.88
1 1 1 1 -1 1 9.309
-1 -1 -1 -1 1 1 10.428
1 -1 -1 -1 1 1 11.118
-1 1 -1 -1 1 1 11.242
1 1 -1 -1 1 1 11.636
-1 -1 1 -1 1 1 8.448
1 -1 1 -1 1 1 11.078
-1 1 1 -1 1 1 11.285
1 1 1 -1 1 1 11.976
-1 -1 -1 1 1 1 8.233
1 -1 -1 1 1 1 8.537
-1 1 -1 1 1 1 9.014
1 1 -1 1 1 1 7.791
-1 -1 1 1 1 1 7.022
1 -1 1 1 1 1 9.74
-1 1 1 1 1 1 9.051
1 1 1 1 1 1 8.709
0 0 0 0 0 1 11.249
0 0 0 0 0 1 9.827
0 0 0 0 0 1 10.43
0 0 0 0 0 1 11.726
0 0 0 0 0 1 10.689
0 0 0 0 0 1 11.033
0 0 0 0 0 1 9.57
0 0 0 0 0 1 11.596
-2 0 0 0 0 2 6.288
2 0 0 0 0 2 11.463
0 -2 0 0 0 2 6.809
0 2 0 0 0 2 11.982
0 0 -2 0 0 2 8.485
0 0 2 0 0 2 8.105
0 0 0 -2 0 2 11.25
0 0 0 2 0 2 6.294
0 0 0 0 -2 2 9.742
0 0 0 0 2 2 10.389
0 0 0 0 0 2 11.115
0 0 0 0 0 2 9.132
0 0 0 0 0 2 11.492
0 0 0 0 0 2 10.861

I tried but failed to fit the data to the model including first-order, second-order, and interaction terms (b is a block term).

julia> using GLM

julia> fm = @formula(y ~ 1 + b + (x1 + x2 + x3 + x4 + x5)*(x1 + x2 + x3 + x4 + x5))
Formula: y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x4 + x4 & x5 + x5 & x5

julia> ols = lm(fm, data)
PosDefException: matrix is not positive definite; Cholesky factorization failed.

Stacktrace:
 [1] checkpositivedefinite at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\factorization.jl:11 [inlined]
 [2] #cholesky!#92(::Bool, ::Function, ::LinearAlgebra.Hermitian{Float64,Array{Float64,2}}, ::Val{false}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:153
 [3] cholesky! at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:152 [inlined] (repeats 2 times)
 [4] DensePredChol(::Array{Float64,2}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:107
 [5] cholpred at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:117 [inlined]
 [6] fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,1}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:136
 [7] #fit#36(::Dict{Any,Any}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Type{LinearModel}, ::Formula, ::DataFrames.DataFrame, ::Bool) at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:72
 [8] fit at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:66 [inlined]
 [9] lm at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:146 [inlined] (repeats 2 times)
 [10] top-level scope at In[33]:1

Using the same data, MATLAB does result in a fitting.

>> ols = fitlm(data, 'y ~ b + (x1+x2+x3+x4+x5)^2')

ols = 


Linear regression model:
    y ~ 1 + b + x1*x2 + x1*x3 + x2*x3 + x1*x4 + x2*x4 + x3*x4 + x1*x5 + x2*x5 + x3*x5 + x4*x5 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2

Estimated Coefficients:
                   Estimate       SE        tStat        pValue  
                   _________    _______    ________    __________

    (Intercept)       11.804    0.59186      19.944    1.2949e-19
    x1                0.7276    0.17085      4.2586    0.00016848
    x2               0.42085    0.17085      2.4632      0.019334
    x3              -0.05755    0.17085    -0.33684       0.73844
    x4               -0.9958    0.17085     -5.8284     1.785e-06
    x5              -0.16915    0.17085    -0.99003       0.32959
    b               -0.83382    0.34642      -2.407       0.02203
    x1:x2           -0.14331    0.19102    -0.75025       0.45859
    x1:x3           -0.08325    0.19102    -0.43582        0.6659
    x2:x3            0.14769    0.19102     0.77315       0.44511
    x1:x4          -0.080813    0.19102    -0.42306       0.67508
    x2:x4           -0.19213    0.19102     -1.0058       0.32207
    x3:x4          -0.085688    0.19102    -0.44858       0.65676
    x1:x5           -0.21969    0.19102     -1.1501       0.25864
    x2:x5             0.1785    0.19102     0.93445       0.35707
    x3:x5          0.0050625    0.19102    0.026502       0.97902
    x4:x5           -0.25963    0.19102     -1.3591        0.1836
    x1^2            -0.26392    0.18673     -1.4134       0.16719
    x2^2            -0.13392    0.18673    -0.71719       0.47846
    x3^2            -0.40904    0.18673     -2.1906      0.035881
    x4^2            -0.28979    0.18673      -1.552       0.13051
    x5^2            0.033581    0.18673     0.17984       0.85841


Number of observations: 54, Error degrees of freedom: 32
Root Mean Squared Error: 1.08
R-squared: 0.715,  Adjusted R-Squared 0.528
F-statistic vs. constant model: 3.82, p-value = 0.000332

Did I do something wrong?


#2

Weirdly enough, my formula looks different than yours.

fm = @formula(y ~ 1 + b + (x1 + x2 + x3 + x4 + x5)*(x1 + x2 + x3 + x4 + x5))
julia > Formula: y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x1 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x1 + x3 & x2 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x1 + x4 & x2 + x4 & x3 + x4 & x4 + x4 & x5 + x5 & x1 + x5 & x2 + x5 & x3 + x5 & x4 + x5 & x5

You will notice I have a x1 & x2 term as well as an x2 & x1 term. Your formula was smart enough to omit those. I am on GLM 1.0.2. Its no surprise why my formula fails. I wonder if something is going on there?


#3

Yours is correct. To see whether this error is related to formula, I tested another one but got a same result.

fm2 = @formula(y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x4 + x4 & x5 + x5 & x5)

I copied and pasted by mistake.


#4

Is it all good or do you still require assistance?


#5

Definitely yes… but I suspect it’s a bug of GLM.

I tested using R and Python with the same dataset, and the data fits to a same linear model.


#6

It seems to be correct,

using CSV, DataFrames, GLM, LinearAlgebra

data = CSV.File(file, delim = '\t') |>
    DataFrame
fm = @formula(y ~ b + (x1 + x2 + x3 + x4 + x5) * (x1 + x2 + x3 + x4 + x5))
model = fit(LinearModel, fm, data, true) # the true means to allow rank-deficient

mf = ModelFrame(fm, data)
mm = ModelMatrix(mf)
X = mm.m
F = qr(X, Val(true))
count(x -> abs(x) ≤ √eps(), diag(F.R)) # Number of linearly dependent columns
count(isnan, stderror(model)) # Number of linearly dependent columns identified by GLM
svdvals(X) # confirm by the SVD

The matrix is rank-deficient by multiple criteria. I can confirm that Stata and R both fit without dropping the linearly dependent columns, but in this case GLM.jl actually does better. A sign of mishandling a rank-deficient model is the many standard errors being the same.


#7

Ah, understood! (actually there are too many coefficients for 54 samples…)
Currently, there is no way to put squared terms into a formula, but that is not a great problem.

Thank you for demonstrating an example.


#8

Sure thing. The next release of StatsModels will likely make it possible to specify polynomial terms in the formula.