Add column names to a GLM.LinearModel estimated in matrix form

I am estimating a bunch of linear models using GLM.lm(X, y), where X is a matrix. I have column names for X from a database, and I would like to have the result from fit/lm to include them. MWE:

using GLM
X = hcat(ones(100), randn(100, 2))
y = X * [1, 2, 3]
reg = lm(X, y)
colnames = ["intercept", "age", "distance"]

so that

julia> reg
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────────
    Estimate   Std. Error     t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────
x1       1.0  9.41024e-17  1.06267e16    <1e-99        1.0        1.0
x2       2.0  1.09949e-16  1.81903e16    <1e-99        2.0        2.0
x3       3.0  9.99446e-17  3.00166e16    <1e-99        3.0        3.0
─────────────────────────────────────────────────────────────────────

and I would like to replace x1, x2, x3 with column names in colnames.

I think a better solution is to use the overloading of + to construct a formula programatically.

julia> using GLM
julia> t = (y = rand(10), var1 = rand(10), var2 = rand(10), var3 = rand(10))
julia> colnames = [:var1, :var2, :var3];

julia> h = Term(:y) ~ mapreduce(Term, +, colnames)
FormulaTerm
Response:
  y(unknown)
Predictors:
  var1(unknown)
  var2(unknown)
  var3(unknown)
julia> lm(h, t)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + var1 + var2 + var3

Coefficients:
────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error   t value  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0290147    0.259379  0.111862    0.9146  -0.605662   0.663692
var1         0.769806     0.368581  2.08857     0.0818  -0.132079   1.67169
var2         0.0925692    0.2603    0.355624    0.7343  -0.544363   0.729501
var3         0.489724     0.249155  1.96554     0.0969  -0.119937   1.09939
────────────────────────────────────────────────────────────────────────────
5 Likes

Neat! Where is Term documented? Sorry but I could not find it.

It’s here. Well, the Term API is afaik quite complicated, but this is the gist.

1 Like

Or, slightly more readable imho:

Term(:y) ~ sum(Term.(colnames))
2 Likes

Is there also a documentation of how to populate all of a TableRegressionModel? It seems to be a struct of structs. Some parts are straightforward, while others are more obscure.

(Also, if I may, I believe that such constructions are convenient for end-users, but make life hard for those who try the extend the code or bridge packages.)

2 Likes

There isn’t documentation for that because for two reasons. First, we’re planning (and have been for some time, oops) to replace it with something that has a more hackable API. Second and relatedly, is that the existing API is a historical artifact and is built around the idea that you’ll just get table/formula support “for free” by defining your model to fit using numeric y and X (it goes back to when the formula stuff lived in DataFrames).

As a general note, constructing a TableRegressionModel isn’t really necessary, as long as you provide methods for the statistical model API. It might be worth looking at how the TableXModel wrappers delegate to the wrapped formula here.

You can see an example of a “modern” use of statsmodels in the MixedModels.jl package, which provides custom syntax and also creates its own structs to hold the specified and fitted models. This file has a bunch of the API method definitions, and here is where the special syntax for the grouping terms (liek (1 + x | group)) is implemented.

If you have a specific use case and want help interacting with statsmodels feel free to ping me! Either here or on slack/zulip.

If you still feel strongly that you’d like to populate TableRegressionModel manually, you’ll need to

  1. create a ModelFrame, which holds that formula, the schema, the data (table), and the model (type). See here: https://github.com/JuliaStats/StatsModels.jl/blob/master/src/modelframe.jl#L72-L80 for the constructor. That constructor should work in most cases but if you need to reach in and take control, there’s more documentation on the formula-schema-data pipeline here.
  2. create a ModelMatrix, which holds the generated numerical predictor matrix (you can create that easily by calling ModelMatrix(model_frame::ModelFrame). This basically just calls modelcols on the final formula (with the schema applied) and then computes the mapping from columns of the resulting matrix and the terms of the formula.
  3. fit your model, however you want, based on the numeric arrays generated by modelcols.
  4. construct the TableStatisticalModel by calling the constructor (it’s just a thin container).
2 Likes

Thanks for the reply. I’ll probably have to take some time to digest it all. For now, let me just reply to

If you still feel strongly that you’d like to populate TableRegressionModel manually,

The reason why I brought this up is that that I would like to make several statistics/econometrics packages more open, so they do not require DataFrames and glm. (Both are very nice, but we should allow for other approaches.)

For instance, GitHub - gragusa/CovarianceMatrices.jl: Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation for Julia. contains a lot of useful methods but is limited to TableRegressionModel.

I am looking for a way to open these and other methods to alternative ways of estimation (even so far as being compatible with users doing b=x\y)

2 Likes

Note that these packages do not require DataFrames, they only require a table, for instance an object like

(y = rand(100), x1 = rand(100), x2 = rand(100)

will work just fine in GLM, despite not being a data frame.

Additionally, TableRegressionModel is simply an abstract type that users write their own implementations of. The reason CovarianceMatrices requres <:TableRegressionModel is so that it can guarantee functions like coef can work, and in Julia these are enforced by abstract types.

I don’t really see how we could solve the problems of consistency with model fitting that plague R without limiting our focus to an interface defined for an abstract type.

1 Like

In my view, the most “julian” solution is not to enforce any abstract type but just an implementation of a common set of functions. For example, you can provide a method for StatsBase.coef without being <:TableRegressionModel. This is the approach that Tables.jl takes: you just have to implement the requisite methods and your structure is a table. One of the near-term goals for StatsModels is to split out a StatsModelsBase which has the basic type definitions and the API function stubs. Then packages can depend on that in order to implement the API, without necessarily opting into all the regression-specific (e.g., borrowed from R) stuff that’s in StatsModels right now.

I’ll also note that TableRegressionModel is not, alas, an abstract type. It’s just a container that wraps a model which is <:RegressionModel (and I’m now seeing that that restriction isn’t enforced in the struct but rather in the fact that we define methods for fit(<:RegressionModel, ::FormulaTerm, table)

3 Likes

Sorry for my ill informed reply.

Given the success of Tables.jl, I think that this is a good path forward. I would gladly help with documentation / testing to make sure it’s easy to write your own implementation and all the methods work.

1 Like

No worries :slight_smile: I had to go back and double check the source so you’re forgiven for not knowing it exactly off the top of your head either. And it’s a genuinely confusing system.

1 Like