The complicating factor is that the stat.id* have a random ID, which will be different every time the program runs and therefore must be determined at runtime. To make matters worse, the column names have . in them.
Given this limitation, how would I do a linear regression using the GLM.jl package? The canonical example below does not work with dynamic field names like I have.
I am not sure what your response is named, but if you just wanted to include every stat.id variable as an explanatory variable,
using DataFrames, GLM
# Some Values
data = DataFrame(rand(100, 3))
# This is what you have
names!(data, Symbol.(["a.3", "a.4", "y"]))
# Fix names
names!(data, Symbol.(replace.(string.(names(data)), Ref("." => "_"))))
# Find all explanatory variables
model = filter(x -> occursin(r"^a_", x), string.(names(data))) |>
# Have them all as main effects
(x -> reduce((x, y) -> string(x, " + ", y), x)) |>
# Add response WLOG y
(rhs -> string("y ~ ", rhs)) |>
# Create the Expr
Meta.parse |>
# Use it to fit the model
(fm -> fit(LinearModel,
@eval(@formula($fm)),
data))
The advantage of this method is that OP can then play with regression specifications. For a very easy way of including all variables in the DataFrame as regressors I use colwise and vec:
Z = hcat(ones(size(data, 1)), colwise(vec, data)...)
lm(Z, y)
The reason to use @formula is that it will preserve the names of the columns in the fitted model. See the StatsModels.jl docs for more info on splicing in column names like this (you didn’t say what the predictors and response variable are in your model so I just made the first column the response and the rest the predictors):