Removing coefficients in GLM

Hello

I have the following result for my GLM analysis:

julia> gm1 = fit(GeneralizedLinearModel, @formula(compte ~ exp27 + exp26 + exp25+ exp24 + exp27*exp26*exp25*exp24), viewbits, Poisson())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Poisson{Float64}, LogLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

compte ~ 1 + exp27 + exp26 + exp25 + exp24 + exp27 & exp26 + exp27 & exp25 + exp26 & exp25 + exp27 & exp24 + exp26 & exp24 + exp25 & exp24 + exp27 & exp26 & exp25 + exp27 & exp26 & exp24 + exp27 & exp25 & exp24 + exp26 & exp25 & exp24 + exp27 & exp26 & exp25 & exp24

Coefficients:
─────────────────────────────────────────────────────────────────────────────────────────────────
                                      Coef.  Std. Error       z  Pr(>|z|)   Lower 95%   Upper 95%
─────────────────────────────────────────────────────────────────────────────────────────────────
(Intercept)                      3.43399       0.179605   19.12    <1e-80    3.08197     3.78601
exp27                            5.54088       0.179957   30.79    <1e-99    5.18818     5.89359
exp26                            2.76246       0.185188   14.92    <1e-49    2.3995      3.12542
exp25                            1.31094       0.20237     6.48    <1e-10    0.914307    1.70758
exp24                            0.755668      0.217737    3.47    0.0005    0.328911    1.18242
exp27 & exp26                   -0.00116308    0.185551   -0.01    0.9950   -0.364836    0.36251
exp27 & exp25                    0.066962      0.202761    0.33    0.7412   -0.330442    0.464366
exp26 & exp25                    0.0401126     0.208609    0.19    0.8475   -0.368753    0.448978
exp27 & exp24                   -0.0658796     0.218173   -0.30    0.7627   -0.493491    0.361732
exp26 & exp24                   -0.0748156     0.224671   -0.33    0.7391   -0.515162    0.365531
exp25 & exp24                   -0.0625204     0.245872   -0.25    0.7993   -0.54442     0.419379
exp27 & exp26 & exp25           -0.0340156     0.209012   -0.16    0.8707   -0.443672    0.375641
exp27 & exp26 & exp24            0.0810225     0.22512     0.36    0.7189   -0.360205    0.52225
exp27 & exp25 & exp24            0.0636278     0.246355    0.26    0.7962   -0.419219    0.546475
exp26 & exp25 & exp24            0.125964      0.253571    0.50    0.6194   -0.371027    0.622955
exp27 & exp26 & exp25 & exp24  -11.3783        0.376127  -30.25    <1e-99  -12.1155    -10.6411
─────────────────────────────────────────────────────────────────────────────────────────────────

As you can see much of the interaction coefficients can be removed except for the quadruple one.

How do I specify the model (@formula) to keep only the significant parameters? That is, the intercept, the four first and the last one?

Thanks for you help.

From the way a formula is printed, you can already see that * expands into a sum of all interactions – written as &:

julia> @formula y ~ x1 * x2 * x3
FormulaTerm
Response:
  y(unknown)
Predictors:
  x1(unknown)
  x2(unknown)
  x3(unknown)
  x1(unknown) & x2(unknown)
  x1(unknown) & x3(unknown)
  x2(unknown) & x3(unknown)
  x1(unknown) & x2(unknown) & x3(unknown)

Thus, you can just specify the individual interactions you want in your formula. E.g., in your example this would be

@formula(compte ~ 1 + exp27 + exp26 + exp25+ exp24 + exp27 & exp26 & exp25 & exp24)

In general, I would advise against using statistical significance for feature or model selection. Especially, given that interaction terms tend to be collinear and have higher variances when estimated.
Better alternatives might be sequential feature/model selection methods optimizing cross-validated scores or Bayesian shrinkage priors.

1 Like

Thanks @bertschi. I have looked and the GLM documentation again but did not find a mention of β€œ&” anywhere. But it works. Maybe they are assuming experience with R, which I don’t have.

The data I am analysing are not experimental measures. In fact these are bits that I look at in IEEE float32 numbers. The result I have shown is a very simple example and the result was as theorically expected. So I simply need the values of the β€œimportant” parameters and the predicted frequency table. I don’t need to interpret the coefficients and some of them may possibily be noise I introducted by the way the numbers were generated. Further, my sample can be very large (well it can’t be bigger than 2^32 for the 32 bits case).

The formula mini language is defined in StatsModels.jl, not GLM, so that’s where you’ll find the docs.

1 Like