Using all independent variables with @formula in a multiple linear model

Hi, I’m trying to learn how to do linear models and wanted to know if there something like in R, where you can use a dot (“.”) instead of writing all the indepent variables in a linear regression model, for example in R:

LinearRegressor = lm(Y ~ ., data = dataset)

in Julia, I’m using GLM like this

fm = @formula(Y ~ X1 + X2 + … + Xn)
LinearRegressor = lm(fm, df)

and I had no problems until the moment, but now I need to do a LM with 28 independent variables and wanted to know if theres is a simple way to do it.

Thanks.

1 Like

The dot notation does not exist at the moment in StatsModels, so the best bet is to programmatically build up the formula using the Term constructor directly:

lm(Term(:Y) ~ sum(Term.(Symbol.(names(df[:, Not(:Y)])))), df)
8 Likes

Edit2: Ok, is working now but I need to use your solution in the in the lm() directly, instead of being able to create a @formula variable. would be nice to be able to do a @formula variable, specially for cross validation and other that need to work with that. But for the time being, thank you again.

Thanks, it was give me an error:

LoadError: ArgumentError: non-call expression encountered: Term.(Symbol.(names(df[:, Not(:Y])))

but I solve it creating a variable with the sum of the independent variables like this

IndependVars = sum(Term.(Symbol.(names(df[:, Not(:Y)]))))

and using it as

fm = @formula(Y ~ IndependVars)

edit: my solution is not working, the lm thinks that IndependVars is a column in my df…

You are on the right track, but you don’t need the @formula call when constructing names programmatically.

julia> df = DataFrame(y = rand(100), x1 = rand(100), x2 = rand(100));

julia> lm(term(:y) ~ sum(term.(Symbol.(names(df, Not(:y))))), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x1 + x2

Coefficients:
───────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%   Upper 95%
───────────────────────────────────────────────────────────────────────────────
(Intercept)   0.64731     0.0719799   8.99292     <1e-13   0.504449   0.79017
x1            0.0392958   0.0932668   0.421327    0.6744  -0.145813   0.224405
x2           -0.259494    0.100676   -2.57752     0.0115  -0.459308  -0.0596804
───────────────────────────────────────────────────────────────────────────────

This is definitely not easy syntax. But overall the system is a bit more flexible than R, where using the rest of the names is easy but using a specific list of names for independent variables is very hard.

2 Likes

To slightly expand on Peter’s answer, what you tried originally was essentially constructing the Terms twice, as the @formula macro converts its arguments to Terms already. It being a macro, it ultimately has to generate some code that you could have written yourself by hand. To see this:

julia> using MacroTools

julia> @macroexpand @formula(Y ~ X1 + X2)
:(StatsModels.Term(:Y) ~ StatsModels.Term(:X1) + StatsModels.Term(:X2))

julia> @formula(Y ~ X1 + X2) == (Term(:Y) ~ Term(:X1) + Term(:X2))
true
3 Likes

As I still get likes on my old answer above but can’t edit it, I’ll just add here that I’ve since added a term method for strings, so one can do term.(names(df)) instead of Term.(Symbol.(names(df))).

1 Like

Hi
Using term() is indeed quite useful to programmatically run regressions.

covariates =["x1", "x2"]
lm(term(:y) ~ sum(term.(covariates)), df)

This works perfectly in any linearly separable specification!

However, I maybe missing something in using term() when I want to incorporate interactions and fixed effects. In particular, it would be great to specify a vector of covariates which has interactions and fixed effects like:

#DataFrame df contains columns names y, x1, x2, z1, and z2
covariates = ["x1*x2", "x3", "fe(z1)", "fe(z2)", "fe(z1)&fe(z2)"]  

# run the fixed effects estimation as:
using FixedEffectModels

reg(df, (term(:y) ~ sum(term.(covariates)) ))

Even without the fixed effects, including “x1*x2” or “x1&x2” in covariates produces error when used inside of term() because it thinks that :x1*x2 exists as a column in the dataframe.

Would appreciate any help on this! Thanks!

That’s because term does not implement the full formula parser implemented in the @formula macro, you’d have to implement that yourself. Unfortunately the output you’re targeting especially with package specific extensions of the base StatsModels syntax is a bit more complicated :sweat_smile:

julia> @macroexpand(@formula(y ~ x1*x2 + fe(x3)*fe(x4)))
:(StatsModels.Term(:y) ~ StatsModels.Term(:x1) + StatsModels.Term(:x2) + 
StatsModels.capture_call(fe, ((x3,)->fe(x3)), (:x3,), $(Expr(:copyast, :($(QuoteNode(:(fe(x3))))))), 
[StatsModels.Term(:x3)]) + StatsModels.capture_call(fe, ((x4,)->fe(x4)), (:x4,), $(Expr(:copyast, :($(QuoteNode(:(fe(x4))))))), 
[StatsModels.Term(:x4)]) + StatsModels.Term(:x1) & StatsModels.Term(:x2) + 
StatsModels.capture_call(fe, ((x3,)->fe(x3)), (:x3,), $(Expr(:copyast, :($(QuoteNode(:(fe(x3))))))), [StatsModels.Term(:x3)]) & StatsModels.capture_call(fe, ((x4,)->fe(x4)), (:x4,), $(Expr(:copyast, :($(QuoteNode(:(fe(x4))))))), [StatsModels.Term(:x4)]))

If you look at the definition of term in the source, it doesn’t quite match that complexity…

term(s::Symbol) = Term(s)
term(s::AbstractString) = term(Symbol(s))

Maybe @dave.f.kleinschmidt has an idea, or views on whether term should be able to parse more complicated expressions?

2 Likes

views on whether term should be able to parse more complicated expressions?

Like @nilshg pointed out, term is (by design) extremely simple, and basically just to make it easy to do things like term.([1, :a, :b, :c]) (there’s another method that creats a ConstantTerm from non-string/non-symbol arguments). If you really, truly, actually need parsing, then you’re back in macro land and you should do something like (I’m very sorry) @eval @formula $(parse(my_formula_string)) but you probably do not actually want or need to do that (that should always be the first thing you think when you find yourself reaching for @eval :grimacing: )

I’m not sure about whether fe works on Terms or not but that’s generally the recommended path (we talked a bit about this in a JuliaCon talk from last year on RegressionFormulae.jl. That is, we now recommend that if you provide “special syntax” like fe(x) for use inside a @formula, you should also provide methods for fe(::Term) that allows you to do things like fe(term(:z1)).

So, if FixedEffects.jl works like this, then you would do something like

term(:y) ~ term(1) + term(:x1) + term(:x2) + term(:x1) & term(:x2) + term(:x3) + fe(term(:z1)) ... # etc.

I’ll add that while we’ve meant for quite a while now to add methods like Base.:(*)(a::Term, b::Term) = a + b + a&b, that work has gotten a bit bogged down. If/when this PR lands, IIRC it includes a method like that, so you would then be able to write

term(:y) ~ term(1) + term(:x1) * term(:x2) # ...
1 Like

Hello! Could you kindly give me advice for one problem:

I want to make models with @formula from vector of variables vars (vars is Vector{Symbol}, length of vars can be any):

models = [@eval @formula(log($i) ~ $(formulation) + $(period) + $(sequence)) for i in vars]

And I got (when try to work with GLM or MixedModels):

ERROR: MethodError: no method matching (::Main.MetidaBioeq.var"#42#43")(::Float64)
The applicable method may be too new: running in world age 55033, while current world is 55034.
Closest candidates are:
  (::Main.MetidaBioeq.var"#42#43")(::Any) at none:0 (method too new to be called from this world context.)
Stacktrace:

How to make automatic function term formula correctly?

Without log works fine:

models = [@eval @formula($i ~ $(formulation) + $(period) + $(sequence)) for i in vars]

but I don’t want to modyfy DataFrame and make additional variables like log_var ets.

Is it possible to solve it?

Yes, I think as step 1, allowing for * and & itself will be very helpful !

As step 2, even if estimations like Fixed effects which require special syntax for them, instead of allowing them within Term is something like the following possible where you allow term to take into account the literal special syntax that will go inside @formula

covariates = ["x1*x2", "x3"] # such that modified term can work through the interactions
#and then something like 
special_syntax = "fe(z1) + fe(z1)&fe(z2)"
reg(df, 
          (term(:y) ~ sum(term.(covariates)) , special_syntax )
)

Just to be clear: & works on Terms as of right now. It’s just * that doesn’t work directly on Terms yet! So while it’s slightly inconvenient you can still do something like

x1, x2, x3 = term.((:x1, :x2, :x3))
covariates = [x1, x2, x1 & x2, x3]
term(:y) ~ sum(covariates)

which is equivalent to

@formula(y ~ x1 * x2 + x3)

What is not supported and will almost certainly never be supported is parsing any kind of function calls from strings when creating terms at ‘run time’ (e.g., outside of a macro). So that rules out term("x1 * x2"). If you really, truly need to parse strings into expressions and splice those expressions into a formula, then the way to do it is

covariate_strings = ["x1 * x2", "x3", "fe(z1) + fe(z1) & fe(z2)"]
covariate_expr = Meta.parse(join(covariate_strings, " + "))
formula = @eval @formula y ~ $(covariate_expr)

But as I said above, this is discouraged and shouldn’t (in theory) be necessary.

1 Like

currently no, not without @eval. The main goal of FunctionTerm is dead, long live FunctionTerm by kleinschmidt · Pull Request #183 · JuliaStats/StatsModels.jl · GitHub is to make eval unnecessary in these cases; with that, you could do

julia> ff = FunctionTerm(log, [term(:a)], :(log(a)))
(a)->log(a)

julia> d = (; a=1:3)
(a = 1:3,)

julia> modelcols(ff, d)
3-element Vector{Float64}:
 0.0
 0.6931471805599453
 1.0986122886681098

This is equivalent to a FunctionTerm created inside a formula:

julia> f = @formula(0 ~ log(a)).rhs
(a)->log(a)

julia> f == ff
true

julia> modelcols(f, d)
3-element Vector{Float64}:
 0.0
 0.6931471805599453
 1.0986122886681098
1 Like

Thank you for reply!
Can you tell when this PR will be merged?

Well, on the one hand, it’s been open for…2 and a half years. On the other hand, I pushed more commits yesterday that addressed nearly all of @nalimilan’s feedback so…I’d estimate anywhere from today to another year :grin:

1 Like

I’m looking forward to it. :smile: Me (and maybe some other developers) strongly waiting for 0.7 with this PR because it’s blocking factor for some features in dev plans… :face_with_monocle:

Hi

I wanted to ask a very basic question on trying to create term objects but in a for loop.

julia> vars =["A", "B"]
2-element Vector{String}:
 "A"
 "B"

julia> [i = term.(Symbol("$i")) for i in vars]
2-element Vector{Term}:
 A(unknown)
 B(unknown)

julia> B
ERROR: UndefVarError: B not defined

So I am confused that the loop seems to work fine, but when I call B I get the error of it being undefined. However if I do the same but not in a loop it works perfect:

julia> B = term.(Symbol("B"))
B(unknown)

julia> B
B(unknown)

Maybe I am missing something very basic.

Yes, this is a misunderstanding about the way Julia works.

Things inside vectors don’t have names and they don’t create variables in the local environment. In the second line, A(unknown) and B(unknown) are not names variables. They are elements in the vector.

I see. Thanks! This solves it:

julia> covs = [i = term.(Symbol("$i")) for i in vars]
2-element Vector{Term}:
 A(unknown)
 B(unknown)

julia> covs
2-element Vector{Term}:
 A(unknown)
 B(unknown)