Using all independent variables with @formula in a multiple linear model

pauloat · July 26, 2020, 1:22am

Hi, I’m trying to learn how to do linear models and wanted to know if there something like in R, where you can use a dot (“.”) instead of writing all the indepent variables in a linear regression model, for example in R:

LinearRegressor = lm(Y ~ ., data = dataset)

in Julia, I’m using GLM like this

fm = @formula(Y ~ X1 + X2 + … + Xn)
LinearRegressor = lm(fm, df)

and I had no problems until the moment, but now I need to do a LM with 28 independent variables and wanted to know if theres is a simple way to do it.

Thanks.

nilshg · July 26, 2020, 1:32pm

The dot notation does not exist at the moment in StatsModels, so the best bet is to programmatically build up the formula using the Term constructor directly:

lm(Term(:Y) ~ sum(Term.(Symbol.(names(df[:, Not(:Y)])))), df)

pauloat · July 26, 2020, 9:19pm

Edit2: Ok, is working now but I need to use your solution in the in the lm() directly, instead of being able to create a @formula variable. would be nice to be able to do a @formula variable, specially for cross validation and other that need to work with that. But for the time being, thank you again.

Thanks, it was give me an error:

LoadError: ArgumentError: non-call expression encountered: Term.(Symbol.(names(df[:, Not(:Y])))

~~but I solve it creating a variable with the sum of the independent variables like this~~

~~IndependVars = sum(Term.(Symbol.(names(df[:, Not(:Y)]))))~~

~~and using it as~~

~~fm = @formula(Y ~ IndependVars)~~

edit: my solution is not working, the lm thinks that IndependVars is a column in my df…

pdeffebach · July 26, 2020, 9:43pm

You are on the right track, but you don’t need the @formula call when constructing names programmatically.

julia> df = DataFrame(y = rand(100), x1 = rand(100), x2 = rand(100));

julia> lm(term(:y) ~ sum(term.(Symbol.(names(df, Not(:y))))), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x1 + x2

Coefficients:
───────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%   Upper 95%
───────────────────────────────────────────────────────────────────────────────
(Intercept)   0.64731     0.0719799   8.99292     <1e-13   0.504449   0.79017
x1            0.0392958   0.0932668   0.421327    0.6744  -0.145813   0.224405
x2           -0.259494    0.100676   -2.57752     0.0115  -0.459308  -0.0596804
───────────────────────────────────────────────────────────────────────────────

This is definitely not easy syntax. But overall the system is a bit more flexible than R, where using the rest of the names is easy but using a specific list of names for independent variables is very hard.

nilshg · July 27, 2020, 8:01am

To slightly expand on Peter’s answer, what you tried originally was essentially constructing the Terms twice, as the @formula macro converts its arguments to Terms already. It being a macro, it ultimately has to generate some code that you could have written yourself by hand. To see this:

julia> using MacroTools

julia> @macroexpand @formula(Y ~ X1 + X2)
:(StatsModels.Term(:Y) ~ StatsModels.Term(:X1) + StatsModels.Term(:X2))

julia> @formula(Y ~ X1 + X2) == (Term(:Y) ~ Term(:X1) + Term(:X2))
true

nilshg · December 12, 2022, 3:56pm

As I still get likes on my old answer above but can’t edit it, I’ll just add here that I’ve since added a term method for strings, so one can do term.(names(df)) instead of Term.(Symbol.(names(df))).

moshi · January 19, 2023, 2:13pm

Hi
Using term() is indeed quite useful to programmatically run regressions.

covariates =["x1", "x2"]
lm(term(:y) ~ sum(term.(covariates)), df)

This works perfectly in any linearly separable specification!

However, I maybe missing something in using term() when I want to incorporate interactions and fixed effects. In particular, it would be great to specify a vector of covariates which has interactions and fixed effects like:

#DataFrame df contains columns names y, x1, x2, z1, and z2
covariates = ["x1*x2", "x3", "fe(z1)", "fe(z2)", "fe(z1)&fe(z2)"]  

# run the fixed effects estimation as:
using FixedEffectModels

reg(df, (term(:y) ~ sum(term.(covariates)) ))

Even without the fixed effects, including “x1*x2” or “x1&x2” in covariates produces error when used inside of term() because it thinks that :x1*x2 exists as a column in the dataframe.

Would appreciate any help on this! Thanks!

nilshg · January 19, 2023, 2:28pm

That’s because term does not implement the full formula parser implemented in the @formula macro, you’d have to implement that yourself. Unfortunately the output you’re targeting especially with package specific extensions of the base StatsModels syntax is a bit more complicated

julia> @macroexpand(@formula(y ~ x1*x2 + fe(x3)*fe(x4)))
:(StatsModels.Term(:y) ~ StatsModels.Term(:x1) + StatsModels.Term(:x2) + 
StatsModels.capture_call(fe, ((x3,)->fe(x3)), (:x3,), $(Expr(:copyast, :($(QuoteNode(:(fe(x3))))))), 
[StatsModels.Term(:x3)]) + StatsModels.capture_call(fe, ((x4,)->fe(x4)), (:x4,), $(Expr(:copyast, :($(QuoteNode(:(fe(x4))))))), 
[StatsModels.Term(:x4)]) + StatsModels.Term(:x1) & StatsModels.Term(:x2) + 
StatsModels.capture_call(fe, ((x3,)->fe(x3)), (:x3,), $(Expr(:copyast, :($(QuoteNode(:(fe(x3))))))), [StatsModels.Term(:x3)]) & StatsModels.capture_call(fe, ((x4,)->fe(x4)), (:x4,), $(Expr(:copyast, :($(QuoteNode(:(fe(x4))))))), [StatsModels.Term(:x4)]))

If you look at the definition of term in the source, it doesn’t quite match that complexity…

term(s::Symbol) = Term(s)
term(s::AbstractString) = term(Symbol(s))

Maybe @dave.f.kleinschmidt has an idea, or views on whether term should be able to parse more complicated expressions?

dave.f.kleinschmidt · January 21, 2023, 1:45am

views on whether term should be able to parse more complicated expressions?

Like @nilshg pointed out, term is (by design) extremely simple, and basically just to make it easy to do things like term.([1, :a, :b, :c]) (there’s another method that creats a ConstantTerm from non-string/non-symbol arguments). If you really, truly, actually need parsing, then you’re back in macro land and you should do something like (I’m very sorry) @eval @formula $(parse(my_formula_string)) but you probably do not actually want or need to do that (that should always be the first thing you think when you find yourself reaching for @eval )

I’m not sure about whether fe works on Terms or not but that’s generally the recommended path (we talked a bit about this in a JuliaCon talk from last year on RegressionFormulae.jl. That is, we now recommend that if you provide “special syntax” like fe(x) for use inside a @formula, you should also provide methods for fe(::Term) that allows you to do things like fe(term(:z1)).

So, if FixedEffects.jl works like this, then you would do something like

term(:y) ~ term(1) + term(:x1) + term(:x2) + term(:x1) & term(:x2) + term(:x3) + fe(term(:z1)) ... # etc.

I’ll add that while we’ve meant for quite a while now to add methods like Base.:(*)(a::Term, b::Term) = a + b + a&b, that work has gotten a bit bogged down. If/when this PR lands, IIRC it includes a method like that, so you would then be able to write

term(:y) ~ term(1) + term(:x1) * term(:x2) # ...

PharmCat · January 21, 2023, 9:19am

Hello! Could you kindly give me advice for one problem:

I want to make models with @formula from vector of variables vars (vars is Vector{Symbol}, length of vars can be any):

models = [@eval @formula(log($i) ~ $(formulation) + $(period) + $(sequence)) for i in vars]

And I got (when try to work with GLM or MixedModels):

ERROR: MethodError: no method matching (::Main.MetidaBioeq.var"#42#43")(::Float64)
The applicable method may be too new: running in world age 55033, while current world is 55034.
Closest candidates are:
  (::Main.MetidaBioeq.var"#42#43")(::Any) at none:0 (method too new to be called from this world context.)
Stacktrace:

How to make automatic function term formula correctly?

Without log works fine:

models = [@eval @formula($i ~ $(formulation) + $(period) + $(sequence)) for i in vars]

but I don’t want to modyfy DataFrame and make additional variables like log_var ets.

Is it possible to solve it?

moshi · January 22, 2023, 4:06pm

Yes, I think as step 1, allowing for * and & itself will be very helpful !

As step 2, even if estimations like Fixed effects which require special syntax for them, instead of allowing them within Term is something like the following possible where you allow term to take into account the literal special syntax that will go inside @formula

covariates = ["x1*x2", "x3"] # such that modified term can work through the interactions
#and then something like 
special_syntax = "fe(z1) + fe(z1)&fe(z2)"
reg(df, 
          (term(:y) ~ sum(term.(covariates)) , special_syntax )
)

dave.f.kleinschmidt · January 22, 2023, 6:35pm

Just to be clear: & works on Terms as of right now. It’s just * that doesn’t work directly on Terms yet! So while it’s slightly inconvenient you can still do something like

x1, x2, x3 = term.((:x1, :x2, :x3))
covariates = [x1, x2, x1 & x2, x3]
term(:y) ~ sum(covariates)

which is equivalent to

@formula(y ~ x1 * x2 + x3)

What is not supported and will almost certainly never be supported is parsing any kind of function calls from strings when creating terms at ‘run time’ (e.g., outside of a macro). So that rules out term("x1 * x2"). If you really, truly need to parse strings into expressions and splice those expressions into a formula, then the way to do it is

covariate_strings = ["x1 * x2", "x3", "fe(z1) + fe(z1) & fe(z2)"]
covariate_expr = Meta.parse(join(covariate_strings, " + "))
formula = @eval @formula y ~ $(covariate_expr)

But as I said above, this is discouraged and shouldn’t (in theory) be necessary.

dave.f.kleinschmidt · January 22, 2023, 6:55pm

currently no, not without @eval. The main goal of FunctionTerm is dead, long live FunctionTerm by kleinschmidt · Pull Request #183 · JuliaStats/StatsModels.jl · GitHub is to make eval unnecessary in these cases; with that, you could do

julia> ff = FunctionTerm(log, [term(:a)], :(log(a)))
(a)->log(a)

julia> d = (; a=1:3)
(a = 1:3,)

julia> modelcols(ff, d)
3-element Vector{Float64}:
 0.0
 0.6931471805599453
 1.0986122886681098

This is equivalent to a FunctionTerm created inside a formula:

julia> f = @formula(0 ~ log(a)).rhs
(a)->log(a)

julia> f == ff
true

julia> modelcols(f, d)
3-element Vector{Float64}:
 0.0
 0.6931471805599453
 1.0986122886681098

PharmCat · January 23, 2023, 11:58am

Thank you for reply!
Can you tell when this PR will be merged?

dave.f.kleinschmidt · January 23, 2023, 3:01pm

Well, on the one hand, it’s been open for…2 and a half years. On the other hand, I pushed more commits yesterday that addressed nearly all of @nalimilan’s feedback so…I’d estimate anywhere from today to another year

PharmCat · January 23, 2023, 3:19pm

I’m looking forward to it. Me (and maybe some other developers) strongly waiting for 0.7 with this PR because it’s blocking factor for some features in dev plans…

moshi · January 28, 2023, 9:49pm

Hi

I wanted to ask a very basic question on trying to create term objects but in a for loop.

julia> vars =["A", "B"]
2-element Vector{String}:
 "A"
 "B"

julia> [i = term.(Symbol("$i")) for i in vars]
2-element Vector{Term}:
 A(unknown)
 B(unknown)

julia> B
ERROR: UndefVarError: B not defined

So I am confused that the loop seems to work fine, but when I call B I get the error of it being undefined. However if I do the same but not in a loop it works perfect:

julia> B = term.(Symbol("B"))
B(unknown)

julia> B
B(unknown)

Maybe I am missing something very basic.

pdeffebach · January 29, 2023, 10:58am

Yes, this is a misunderstanding about the way Julia works.

Things inside vectors don’t have names and they don’t create variables in the local environment. In the second line, A(unknown) and B(unknown) are not names variables. They are elements in the vector.

moshi · January 29, 2023, 4:56pm

I see. Thanks! This solves it:

julia> covs = [i = term.(Symbol("$i")) for i in vars]
2-element Vector{Term}:
 A(unknown)
 B(unknown)

julia> covs
2-element Vector{Term}:
 A(unknown)
 B(unknown)

Topic		Replies	Views
How to use all variable in @formula like R language? New to Julia	4	658	September 30, 2022
Using GLM programmatically General Usage question , metaprogramming , glm	8	948	October 8, 2024
How to get a GLM where formula is programmatically generated New to Julia dataframes , glm	3	1309	September 1, 2020
How to use lm inside a for loop for different variables Statistics	2	642	June 8, 2022
How to define a formula with a vector of string names for GLM.jl? General Usage	3	1255	August 14, 2020

Using all independent variables with @formula in a multiple linear model

Related topics