I have a dataframe with many columns. For example,
df = DataFrame(A = randn(10), B = randn(10), C = randn(10), D = randn(10))
nms = names(df)
I want to do a regression of column A on one or more of the other columns.
Suppose I want to use the second and third columns as the regressors. I could do
ols = lm(@formula(A ~ B + C), df)
This works fine.
But I do not know which columns I want to use as regressors in advance. They are determined as the program runs. I tried several things but none of them worked. For example, I tried
ols = lm(@formula(A ~ nms[2] + nms[3]), df)
ols = lm(@formula(A ~ $(nms[2]) + $(nms[3]), df))
and other variations using strings and eval. Nothing worked.
Any suggestions on how to do this?
a = :A; b = :B
f = term(a) ~ term(b)
See here
3 Likes
Explicitly:
ols = lm(term(:A) ~ term(nms[2]) + term(nms[3]), df)
1 Like
I donβt understand anything about the merits of the matter, but I have an idea that it might be useful for you to adopt such a scheme for your calculations.
dfmulticols=DataFrame(rand(100,100),:auto)
indvar=filter(n->endswith(n,'0'),names(dfmulticols))
ols = lm(term(:x1) ~ sum(term.(indvar)), dfmulticols)
julia> ols = lm(term(:x1) ~ sum(term.(indvar)), dfmulticols)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
x1 ~ 1 + x10 + x20 + x30 + x40 + x50 + x60 + x70 + x80 + x90 + x100
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.480743 0.184909 2.60 0.0109 0.113334 0.848153
x10 0.0122329 0.11747 0.10 0.9173 -0.221178 0.245644
x20 -0.188598 0.114481 -1.65 0.1030 -0.41607 0.038873
x30 -0.123767 0.12706 -0.97 0.3327 -0.376233 0.128698
x40 -0.0337328 0.107398 -0.31 0.7542 -0.24713 0.179664
x50 0.0502589 0.109816 0.46 0.6483 -0.167944 0.268462
x60 -0.0251403 0.106903 -0.24 0.8146 -0.237554 0.187273
x70 0.0458067 0.111152 0.41 0.6813 -0.175051 0.266664
x80 0.112373 0.118723 0.95 0.3464 -0.123527 0.348273
x90 0.033742 0.116282 0.29 0.7724 -0.197308 0.264792
x100 0.12444 0.106621 1.17 0.2463 -0.0874146 0.336294
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 Like
I got three good answers. Had to choose one. I did something slightly different.
ts = term.([1; [eval(indep_var_cols[i]) for i in n_indep_vars]])
frm = term(:LogReturns) ~ foldl(+, ts) + term(factor_col)
ols = lm(frm, df1)