Univariate linear regression with each covariate of dataframe

sergevic · October 15, 2019, 5:48pm

I have a dataframe with 5 covariates X1, X2, X3, X4, X5 and a dependent variable Y. I have to perform a univariate regression of Y on each covariate. I don’t want to do

using GLM
lm(@formula(Y~X1), data)
lm(@formula(Y~X2), data)
lm(@formula(Y~X3), data)
lm(@formula(Y~X4), data)
lm(@formula(Y~X5), data)

because it’s counterproductive, specially if I have a large number of covariates. I’m trying to build a loop, but I don’t know how to call each covariate from the dataframe with @formula. I tried

for i=1:5 
     lm(@formula(Y~data[:,i]), data)
end

but it does not work. Any hint?

pdeffebach · October 15, 2019, 6:06pm

See the StatsModels docs on constructing a formula programmatically here.

In particular, you don’t want to use the @formula macro here.

for covariate in names(df[!, r"X"])
     lm(term(:Y) ~ term(covariate), data)
end

Pining @dave.f.kleinschmidt in case there are any further developments that would make this process easier.

sergevic · October 15, 2019, 6:15pm

Where does the covariate appear in the loop? And there is no name for term(name)… Is it term(covariate) instead? Finally, is it possible in that loop to keep the iteration number so that I can save the results of each iteration?

pdeffebach · October 15, 2019, 6:21pm

apologies, I didn’t test the code enough. I edited the example and now it should work.

You can use enumerate so the iterator returns a Pair of the item and the counter.

You can also do term(Symbol(:X, i)) inside the loop to construct the vairable name programmatically.

One thing you can’t do is do term(1) to construct the covariate. That does something else (not really sure what)

sergevic · October 15, 2019, 6:29pm

Yes, I just edited my question too because I figured out that it must be term(covariate) And for the iteration, thank you! It worked I did

for (index,covariate) in enumerate(names(df[!, r"X"]))
...
end

dave.f.kleinschmidt · October 15, 2019, 7:16pm

That’s about what I’d recommend! The term(1) constructs the same thing that 1 in a formula does (e.g., @formula(y ~ 1 + x) is the same as term(:y) ~ term(1) + term(:x)). Same with other numbers (0 is the only one that has special meaning).

If you get into a situation where it’s going to be expensive to keep computing the schema for the data you can precompute it and provide a concrete formula to lm, like:

sch = schema(data)
for covariate in names(df[!, r"X"])
    f = apply_schema(term(:Y) ~ term(covariate), sch, LinearModel)
    lm(f, data)
end

(You need the third argument of apply_schema to be LinearModel to get the “implicit intercept” behavior that you probably expect)

But that shouldn’t make a huge difference unless your data is really big, because there’s minimal overlap between the schema that’s required for each model (just Y). But if you were re-fitting models that have more overlapping terms and the dataset is very large then you might benefit from precomputing the schema.

Topic		Replies	Views
Using GLM programmatically General Usage question , metaprogramming , glm	8	926	October 8, 2024
How to create a loop for a regression model? Statistics	8	2150	December 29, 2022
Using all independent variables with @formula in a multiple linear model New to Julia glm	18	4292	January 29, 2023
How to use lm inside a for loop for different variables Statistics	2	608	June 8, 2022
How do I do a regression using programatically defined column names? General Usage question , dataframes , glm	4	228	July 24, 2023

Univariate linear regression with each covariate of dataframe

Related topics