 # Univariate linear regression with each covariate of dataframe

I have a dataframe with 5 covariates `X1, X2, X3, X4, X5` and a dependent variable `Y`. I have to perform a univariate regression of `Y` on each covariate. I don’t want to do

``````using GLM
lm(@formula(Y~X1), data)
lm(@formula(Y~X2), data)
lm(@formula(Y~X3), data)
lm(@formula(Y~X4), data)
lm(@formula(Y~X5), data)
``````

because it’s counterproductive, specially if I have a large number of covariates. I’m trying to build a loop, but I don’t know how to call each covariate from the dataframe with `@formula`. I tried

``````for i=1:5
lm(@formula(Y~data[:,i]), data)
end
``````

but it does not work. Any hint?

See the StatsModels docs on constructing a formula programmatically here.

In particular, you don’t want to use the `@formula` macro here.

``````for covariate in names(df[!, r"X"])
lm(term(:Y) ~ term(covariate), data)
end
``````

Pining @dave.f.kleinschmidt in case there are any further developments that would make this process easier.

2 Likes

Where does the `covariate` appear in the loop? And there is no `name` for `term(name)`… Is it `term(covariate)` instead? Finally, is it possible in that loop to keep the iteration number so that I can save the results of each iteration?

apologies, I didn’t test the code enough. I edited the example and now it should work.

You can use `enumerate` so the iterator returns a Pair of the item and the counter.

You can also do `term(Symbol(:X, i))` inside the loop to construct the vairable name programmatically.

One thing you can’t do is do `term(1)` to construct the covariate. That does something else (not really sure what)

1 Like

Yes, I just edited my question too because I figured out that it must be `term(covariate)` And for the iteration, thank you! It worked I did

``````for (index,covariate) in enumerate(names(df[!, r"X"]))
...
end
``````

That’s about what I’d recommend! The `term(1)` constructs the same thing that `1` in a formula does (e.g., `@formula(y ~ 1 + x)` is the same as `term(:y) ~ term(1) + term(:x)`). Same with other numbers (`0` is the only one that has special meaning).

If you get into a situation where it’s going to be expensive to keep computing the schema for the data you can precompute it and provide a concrete formula to `lm`, like:

``````sch = schema(data)
for covariate in names(df[!, r"X"])
f = apply_schema(term(:Y) ~ term(covariate), sch, LinearModel)
lm(f, data)
end
``````

(You need the third argument of `apply_schema` to be `LinearModel` to get the “implicit intercept” behavior that you probably expect)

But that shouldn’t make a huge difference unless your data is really big, because there’s minimal overlap between the schema that’s required for each model (just `Y`). But if you were re-fitting models that have more overlapping terms and the dataset is very large then you might benefit from precomputing the schema.

2 Likes