Univariate linear regression with each covariate of dataframe

I have a dataframe with 5 covariates X1, X2, X3, X4, X5 and a dependent variable Y. I have to perform a univariate regression of Y on each covariate. I don’t want to do

using GLM
lm(@formula(Y~X1), data)
lm(@formula(Y~X2), data)
lm(@formula(Y~X3), data)
lm(@formula(Y~X4), data)
lm(@formula(Y~X5), data)

because it’s counterproductive, specially if I have a large number of covariates. I’m trying to build a loop, but I don’t know how to call each covariate from the dataframe with @formula. I tried

for i=1:5 
     lm(@formula(Y~data[:,i]), data)
end

but it does not work. Any hint?

See the StatsModels docs on constructing a formula programmatically here.

In particular, you don’t want to use the @formula macro here.

for covariate in names(df[!, r"X"])
     lm(term(:Y) ~ term(covariate), data)
end

Pining @dave.f.kleinschmidt in case there are any further developments that would make this process easier.

2 Likes

Where does the covariate appear in the loop? And there is no name for term(name)… Is it term(covariate) instead? Finally, is it possible in that loop to keep the iteration number so that I can save the results of each iteration?

apologies, I didn’t test the code enough. I edited the example and now it should work.

You can use enumerate so the iterator returns a Pair of the item and the counter.

You can also do term(Symbol(:X, i)) inside the loop to construct the vairable name programmatically.

One thing you can’t do is do term(1) to construct the covariate. That does something else (not really sure what)

1 Like

Yes, I just edited my question too because I figured out that it must be term(covariate) :slight_smile: And for the iteration, thank you! It worked :slight_smile: I did

for (index,covariate) in enumerate(names(df[!, r"X"]))
...
end

That’s about what I’d recommend! The term(1) constructs the same thing that 1 in a formula does (e.g., @formula(y ~ 1 + x) is the same as term(:y) ~ term(1) + term(:x)). Same with other numbers (0 is the only one that has special meaning).

If you get into a situation where it’s going to be expensive to keep computing the schema for the data you can precompute it and provide a concrete formula to lm, like:

sch = schema(data)
for covariate in names(df[!, r"X"])
    f = apply_schema(term(:Y) ~ term(covariate), sch, LinearModel)
    lm(f, data)
end

(You need the third argument of apply_schema to be LinearModel to get the “implicit intercept” behavior that you probably expect)

But that shouldn’t make a huge difference unless your data is really big, because there’s minimal overlap between the schema that’s required for each model (just Y). But if you were re-fitting models that have more overlapping terms and the dataset is very large then you might benefit from precomputing the schema.

3 Likes