How do I do a regression using programatically defined column names?

I have a dataframe with many columns. For example,

df = DataFrame(A = randn(10), B = randn(10), C = randn(10), D = randn(10))

nms = names(df)

I want to do a regression of column A on one or more of the other columns.

Suppose I want to use the second and third columns as the regressors. I could do

ols = lm(@formula(A ~ B + C), df)

This works fine.

But I do not know which columns I want to use as regressors in advance. They are determined as the program runs. I tried several things but none of them worked. For example, I tried

ols = lm(@formula(A ~ nms[2] + nms[3]), df)

ols = lm(@formula(A ~ $(nms[2]) + $(nms[3]), df))

and other variations using strings and eval. Nothing worked.

Any suggestions on how to do this?

a = :A; b = :B
f = term(a) ~ term(b)

See here

3 Likes

Explicitly:

ols = lm(term(:A) ~ term(nms[2]) + term(nms[3]), df)
1 Like

I don’t understand anything about the merits of the matter, but I have an idea that it might be useful for you to adopt such a scheme for your calculations.

dfmulticols=DataFrame(rand(100,100),:auto)

indvar=filter(n->endswith(n,'0'),names(dfmulticols))

ols = lm(term(:x1) ~ sum(term.(indvar)), dfmulticols)

julia> ols = lm(term(:x1) ~ sum(term.(indvar)), dfmulticols)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

x1 ~ 1 + x10 + x20 + x30 + x40 + x50 + x60 + x70 + x80 + x90 + x100

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   0.480743     0.184909   2.60    0.0109   0.113334    0.848153
x10           0.0122329    0.11747    0.10    0.9173  -0.221178    0.245644
x20          -0.188598     0.114481  -1.65    0.1030  -0.41607     0.038873
x30          -0.123767     0.12706   -0.97    0.3327  -0.376233    0.128698
x40          -0.0337328    0.107398  -0.31    0.7542  -0.24713     0.179664
x50           0.0502589    0.109816   0.46    0.6483  -0.167944    0.268462
x60          -0.0251403    0.106903  -0.24    0.8146  -0.237554    0.187273
x70           0.0458067    0.111152   0.41    0.6813  -0.175051    0.266664
x80           0.112373     0.118723   0.95    0.3464  -0.123527    0.348273
x90           0.033742     0.116282   0.29    0.7724  -0.197308    0.264792
x100          0.12444      0.106621   1.17    0.2463  -0.0874146   0.336294
───────────────────────────────────────────────────────────────────────────
1 Like

I got three good answers. Had to choose one. I did something slightly different.

ts = term.([1; [eval(indep_var_cols[i]) for i in n_indep_vars]])
frm = term(:LogReturns) ~ foldl(+, ts) + term(factor_col)
ols = lm(frm, df1)