How to get a GLM where formula is programmatically generated

I have a DataFrame and need to build a model where the predictors follow some naming scheme.

For the example data below, suppose the scheme is “!=y”:

using DataFrames
using GLM

data = DataFrame(y=[22.1,20.1,7.1,9.1,1000,200],
                 x1=[1.1,2.1,3.1,4.1,10,100.2],
                 x2=[1,2,3,4.0,11.2,100.1])

Can anyone suggest a modification to Ex.2 below that would make the models in Ex.1 (ols1) and Ex.2 (ols2) equivalent?

Please note that while ols2 does not run, I’m looking for something of comparable terseness, if possible.

Ex 1:

ols1 = GLM.lm(@formula(y ~ x1 + x2), data)
y ~ 1 + x1 + x2

Coefficients:
────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error   t value  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)    84.4784     4.76541   17.7274    0.0004    69.3127     99.644
x1           -745.249      8.33275  -89.4361    <1e-5   -771.767    -718.73
x2            747.144      8.34614   89.5196    <1e-5    720.582     773.705

Ex.2

preds = Symbol.(names(data)[findall(names(data) .!= "y")])
2-element Array{String,1}:
 "x1"
 "x2"

ols2 = GLM.lm(@formula(y ~ preds), data)
ERROR: type NamedTuple has no field preds
1 Like

You can probably make use of Terms objects, as described here.

3 Likes

Cameron is exactly right. These two forms are exactly equivalent:

julia> using StatsModels

julia> (Term(:y) ~ Term(:x1) + Term(:x2)) == @formula(y ~ x1 + x2)
true

As @nilshg pointed out in this post, this is actually the expression that’s generated by the formula macro:

julia> @macroexpand @formula(y ~ x1 + x2)
:(StatsModels.Term(:y) ~ StatsModels.Term(:x1) + StatsModels.Term(:x2))
2 Likes

@nilshg just added support for creating a Term from a string (available in v0.6.14, which should automerge soon), so you can now do something like

terms = term.(names(data))
f = terms[1] ~ sum(terms[2:end]) # assuming first column is "y"
ols2 = lm(f, data)
3 Likes