Hello, everybody.
I have encountered doing multiple regression using GLM.
Here’s my dataset data
(originally come from Optimization of critical medium components using response surface methodology for ethanol production from cellulosic biomass by Clostridium thermocellum SS19 - ScienceDirect)
x1 | x2 | x3 | x4 | x5 | b | y |
---|---|---|---|---|---|---|
-1 | -1 | -1 | -1 | -1 | 1 | 9.137 |
1 | -1 | -1 | -1 | -1 | 1 | 11.996 |
-1 | 1 | -1 | -1 | -1 | 1 | 8.887 |
1 | 1 | -1 | -1 | -1 | 1 | 12.405 |
-1 | -1 | 1 | -1 | -1 | 1 | 11.274 |
1 | -1 | 1 | -1 | -1 | 1 | 10.511 |
-1 | 1 | 1 | -1 | -1 | 1 | 11.113 |
1 | 1 | 1 | -1 | -1 | 1 | 11.764 |
-1 | -1 | -1 | 1 | -1 | 1 | 9.482 |
1 | -1 | -1 | 1 | -1 | 1 | 10.683 |
-1 | 1 | -1 | 1 | -1 | 1 | 8.277 |
1 | 1 | -1 | 1 | -1 | 1 | 11.243 |
-1 | -1 | 1 | 1 | -1 | 1 | 8.188 |
1 | -1 | 1 | 1 | -1 | 1 | 10.219 |
-1 | 1 | 1 | 1 | -1 | 1 | 8.88 |
1 | 1 | 1 | 1 | -1 | 1 | 9.309 |
-1 | -1 | -1 | -1 | 1 | 1 | 10.428 |
1 | -1 | -1 | -1 | 1 | 1 | 11.118 |
-1 | 1 | -1 | -1 | 1 | 1 | 11.242 |
1 | 1 | -1 | -1 | 1 | 1 | 11.636 |
-1 | -1 | 1 | -1 | 1 | 1 | 8.448 |
1 | -1 | 1 | -1 | 1 | 1 | 11.078 |
-1 | 1 | 1 | -1 | 1 | 1 | 11.285 |
1 | 1 | 1 | -1 | 1 | 1 | 11.976 |
-1 | -1 | -1 | 1 | 1 | 1 | 8.233 |
1 | -1 | -1 | 1 | 1 | 1 | 8.537 |
-1 | 1 | -1 | 1 | 1 | 1 | 9.014 |
1 | 1 | -1 | 1 | 1 | 1 | 7.791 |
-1 | -1 | 1 | 1 | 1 | 1 | 7.022 |
1 | -1 | 1 | 1 | 1 | 1 | 9.74 |
-1 | 1 | 1 | 1 | 1 | 1 | 9.051 |
1 | 1 | 1 | 1 | 1 | 1 | 8.709 |
0 | 0 | 0 | 0 | 0 | 1 | 11.249 |
0 | 0 | 0 | 0 | 0 | 1 | 9.827 |
0 | 0 | 0 | 0 | 0 | 1 | 10.43 |
0 | 0 | 0 | 0 | 0 | 1 | 11.726 |
0 | 0 | 0 | 0 | 0 | 1 | 10.689 |
0 | 0 | 0 | 0 | 0 | 1 | 11.033 |
0 | 0 | 0 | 0 | 0 | 1 | 9.57 |
0 | 0 | 0 | 0 | 0 | 1 | 11.596 |
-2 | 0 | 0 | 0 | 0 | 2 | 6.288 |
2 | 0 | 0 | 0 | 0 | 2 | 11.463 |
0 | -2 | 0 | 0 | 0 | 2 | 6.809 |
0 | 2 | 0 | 0 | 0 | 2 | 11.982 |
0 | 0 | -2 | 0 | 0 | 2 | 8.485 |
0 | 0 | 2 | 0 | 0 | 2 | 8.105 |
0 | 0 | 0 | -2 | 0 | 2 | 11.25 |
0 | 0 | 0 | 2 | 0 | 2 | 6.294 |
0 | 0 | 0 | 0 | -2 | 2 | 9.742 |
0 | 0 | 0 | 0 | 2 | 2 | 10.389 |
0 | 0 | 0 | 0 | 0 | 2 | 11.115 |
0 | 0 | 0 | 0 | 0 | 2 | 9.132 |
0 | 0 | 0 | 0 | 0 | 2 | 11.492 |
0 | 0 | 0 | 0 | 0 | 2 | 10.861 |
I tried but failed to fit the data to the model including first-order, second-order, and interaction terms (b is a block term).
julia> using GLM
julia> fm = @formula(y ~ 1 + b + (x1 + x2 + x3 + x4 + x5)*(x1 + x2 + x3 + x4 + x5))
Formula: y ~ 1 + b + x1 + x2 + x3 + x4 + x5 + x1 & x1 + x1 & x2 + x1 & x3 + x1 & x4 + x1 & x5 + x2 & x2 + x2 & x3 + x2 & x4 + x2 & x5 + x3 & x3 + x3 & x4 + x3 & x5 + x4 & x4 + x4 & x5 + x5 & x5
julia> ols = lm(fm, data)
PosDefException: matrix is not positive definite; Cholesky factorization failed.
Stacktrace:
[1] checkpositivedefinite at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\factorization.jl:11 [inlined]
[2] #cholesky!#92(::Bool, ::Function, ::LinearAlgebra.Hermitian{Float64,Array{Float64,2}}, ::Val{false}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:153
[3] cholesky! at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\cholesky.jl:152 [inlined] (repeats 2 times)
[4] DensePredChol(::Array{Float64,2}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:107
[5] cholpred at D:\Apps\juliapkg\packages\GLM\0c65q\src\linpred.jl:117 [inlined]
[6] fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,1}, ::Bool) at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:136
[7] #fit#36(::Dict{Any,Any}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Type{LinearModel}, ::Formula, ::DataFrames.DataFrame, ::Bool) at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:72
[8] fit at D:\Apps\juliapkg\packages\StatsModels\AYB2E\src\statsmodel.jl:66 [inlined]
[9] lm at D:\Apps\juliapkg\packages\GLM\0c65q\src\lm.jl:146 [inlined] (repeats 2 times)
[10] top-level scope at In[33]:1
Using the same data, MATLAB does result in a fitting.
>> ols = fitlm(data, 'y ~ b + (x1+x2+x3+x4+x5)^2')
ols =
Linear regression model:
y ~ 1 + b + x1*x2 + x1*x3 + x2*x3 + x1*x4 + x2*x4 + x3*x4 + x1*x5 + x2*x5 + x3*x5 + x4*x5 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2
Estimated Coefficients:
Estimate SE tStat pValue
_________ _______ ________ __________
(Intercept) 11.804 0.59186 19.944 1.2949e-19
x1 0.7276 0.17085 4.2586 0.00016848
x2 0.42085 0.17085 2.4632 0.019334
x3 -0.05755 0.17085 -0.33684 0.73844
x4 -0.9958 0.17085 -5.8284 1.785e-06
x5 -0.16915 0.17085 -0.99003 0.32959
b -0.83382 0.34642 -2.407 0.02203
x1:x2 -0.14331 0.19102 -0.75025 0.45859
x1:x3 -0.08325 0.19102 -0.43582 0.6659
x2:x3 0.14769 0.19102 0.77315 0.44511
x1:x4 -0.080813 0.19102 -0.42306 0.67508
x2:x4 -0.19213 0.19102 -1.0058 0.32207
x3:x4 -0.085688 0.19102 -0.44858 0.65676
x1:x5 -0.21969 0.19102 -1.1501 0.25864
x2:x5 0.1785 0.19102 0.93445 0.35707
x3:x5 0.0050625 0.19102 0.026502 0.97902
x4:x5 -0.25963 0.19102 -1.3591 0.1836
x1^2 -0.26392 0.18673 -1.4134 0.16719
x2^2 -0.13392 0.18673 -0.71719 0.47846
x3^2 -0.40904 0.18673 -2.1906 0.035881
x4^2 -0.28979 0.18673 -1.552 0.13051
x5^2 0.033581 0.18673 0.17984 0.85841
Number of observations: 54, Error degrees of freedom: 32
Root Mean Squared Error: 1.08
R-squared: 0.715, Adjusted R-Squared 0.528
F-statistic vs. constant model: 3.82, p-value = 0.000332
Did I do something wrong?