GLM.jl with unknown column names

statistics
regression

#1

Hello,

I’m using Julia for some basic statistical analysis. I’d like to run a simple linear regression on a small dataset using the GLM.jl package.

My data looks something like this:

│ Row │ stat.id12312.somemetadata │ stat.id43.othermetadata │ timestamp  │
│     │ Float64                   │ Float64                 │ Int64      │
├─────┼───────────────────────────┼─────────────────────────┼────────────┤
│ 1   │ 22443.6                   │ 22453.8                 │ 1549942644 │
│ 2   │ 10423.1                   │ 12918.1                 │ 1513421321 │
 ⋮
│ 30  │ 22443.6                   │ 22453.8                 │ 1491231819 │

The complicating factor is that the stat.id* have a random ID, which will be different every time the program runs and therefore must be determined at runtime. To make matters worse, the column names have . in them.

Given this limitation, how would I do a linear regression using the GLM.jl package? The canonical example below does not work with dynamic field names like I have.

lm(@formula(Y ~ X), data)

Any guidance is greatly appreciated. Thank you.


#2

You can specify a “design matrix” X and a response vector y, instead of the @formula:

using GLM

x2 = rand(100)
x3 = rand(100)
y = 5 .+ 2 .* x2 .+ 3 .* x3
X = hcat(ones(100), x2, x3)

lm(X, y)

#3

I am not sure what your response is named, but if you just wanted to include every stat.id variable as an explanatory variable,

using DataFrames, GLM
# Some Values
data = DataFrame(rand(100, 3))
# This is what you have
names!(data, Symbol.(["a.3", "a.4", "y"]))
# Fix names
names!(data, Symbol.(replace.(string.(names(data)), Ref("." => "_"))))
# Find all explanatory variables
model = filter(x -> occursin(r"^a_", x), string.(names(data))) |>
        # Have them all as main effects
        (x -> reduce((x, y) -> string(x, " + ", y), x)) |>
        # Add response WLOG y
        (rhs -> string("y ~ ", rhs)) |>
        # Create the Expr
        Meta.parse |>
        # Use it to fit the model
        (fm -> fit(LinearModel,
                   @eval(@formula($fm)),
                   data))


#4

The advantage of this method is that OP can then play with regression specifications. For a very easy way of including all variables in the DataFrame as regressors I use colwise and vec:

Z = hcat(ones(size(data, 1)), colwise(vec, data)...)
lm(Z, y)