I have a dataframe with 5 covariates X1, X2, X3, X4, X5
and a dependent variable Y
. I have to perform a univariate regression of Y
on each covariate. I don’t want to do
using GLM
lm(@formula(Y~X1), data)
lm(@formula(Y~X2), data)
lm(@formula(Y~X3), data)
lm(@formula(Y~X4), data)
lm(@formula(Y~X5), data)
because it’s counterproductive, specially if I have a large number of covariates. I’m trying to build a loop, but I don’t know how to call each covariate from the dataframe with @formula
. I tried
for i=1:5
lm(@formula(Y~data[:,i]), data)
end
but it does not work. Any hint?
See the StatsModels docs on constructing a formula programmatically here.
In particular, you don’t want to use the @formula
macro here.
for covariate in names(df[!, r"X"])
lm(term(:Y) ~ term(covariate), data)
end
Pining @dave.f.kleinschmidt in case there are any further developments that would make this process easier.
2 Likes
Where does the covariate
appear in the loop? And there is no name
for term(name)
… Is it term(covariate)
instead? Finally, is it possible in that loop to keep the iteration number so that I can save the results of each iteration?
apologies, I didn’t test the code enough. I edited the example and now it should work.
You can use enumerate
so the iterator returns a Pair of the item and the counter.
You can also do term(Symbol(:X, i))
inside the loop to construct the vairable name programmatically.
One thing you can’t do is do term(1)
to construct the covariate. That does something else (not really sure what)
1 Like
Yes, I just edited my question too because I figured out that it must be term(covariate)
And for the iteration, thank you! It worked I did
for (index,covariate) in enumerate(names(df[!, r"X"]))
...
end
That’s about what I’d recommend! The term(1)
constructs the same thing that 1
in a formula does (e.g., @formula(y ~ 1 + x)
is the same as term(:y) ~ term(1) + term(:x)
). Same with other numbers (0
is the only one that has special meaning).
If you get into a situation where it’s going to be expensive to keep computing the schema for the data you can precompute it and provide a concrete formula to lm
, like:
sch = schema(data)
for covariate in names(df[!, r"X"])
f = apply_schema(term(:Y) ~ term(covariate), sch, LinearModel)
lm(f, data)
end
(You need the third argument of apply_schema
to be LinearModel
to get the “implicit intercept” behavior that you probably expect)
But that shouldn’t make a huge difference unless your data is really big, because there’s minimal overlap between the schema that’s required for each model (just Y
). But if you were re-fitting models that have more overlapping terms and the dataset is very large then you might benefit from precomputing the schema.
3 Likes