GLM.jl with unknown column names

Hello,

I’m using Julia for some basic statistical analysis. I’d like to run a simple linear regression on a small dataset using the GLM.jl package.

My data looks something like this:

│ Row │ stat.id12312.somemetadata │ stat.id43.othermetadata │ timestamp  │
│     │ Float64                   │ Float64                 │ Int64      │
├─────┼───────────────────────────┼─────────────────────────┼────────────┤
│ 1   │ 22443.6                   │ 22453.8                 │ 1549942644 │
│ 2   │ 10423.1                   │ 12918.1                 │ 1513421321 │
 ⋮
│ 30  │ 22443.6                   │ 22453.8                 │ 1491231819 │

The complicating factor is that the stat.id* have a random ID, which will be different every time the program runs and therefore must be determined at runtime. To make matters worse, the column names have . in them.

Given this limitation, how would I do a linear regression using the GLM.jl package? The canonical example below does not work with dynamic field names like I have.

lm(@formula(Y ~ X), data)

Any guidance is greatly appreciated. Thank you.

You can specify a “design matrix” X and a response vector y, instead of the @formula:

using GLM

x2 = rand(100)
x3 = rand(100)
y = 5 .+ 2 .* x2 .+ 3 .* x3
X = hcat(ones(100), x2, x3)

lm(X, y)

I am not sure what your response is named, but if you just wanted to include every stat.id variable as an explanatory variable,

using DataFrames, GLM
# Some Values
data = DataFrame(rand(100, 3))
# This is what you have
names!(data, Symbol.(["a.3", "a.4", "y"]))
# Fix names
names!(data, Symbol.(replace.(string.(names(data)), Ref("." => "_"))))
# Find all explanatory variables
model = filter(x -> occursin(r"^a_", x), string.(names(data))) |>
        # Have them all as main effects
        (x -> reduce((x, y) -> string(x, " + ", y), x)) |>
        # Add response WLOG y
        (rhs -> string("y ~ ", rhs)) |>
        # Create the Expr
        Meta.parse |>
        # Use it to fit the model
        (fm -> fit(LinearModel,
                   @eval(@formula($fm)),
                   data))

The advantage of this method is that OP can then play with regression specifications. For a very easy way of including all variables in the DataFrame as regressors I use colwise and vec:

Z = hcat(ones(size(data, 1)), colwise(vec, data)...)
lm(Z, y)

The reason to use @formula is that it will preserve the names of the columns in the fitted model. See the StatsModels.jl docs for more info on splicing in column names like this (you didn’t say what the predictors and response variable are in your model so I just made the first column the response and the rest the predictors):

julia> using DataFrames, GLM

julia> data = DataFrame(Symbol("stat.id1234") => rand(10), Symbol("stat.id2345") => rand(10), :timestamp => 1:10)
10×3 DataFrame
│ Row │ stat.id1234 │ stat.id2345 │ timestamp │
│     │ Float64     │ Float64     │ Int64     │
├─────┼─────────────┼─────────────┼───────────┤
│ 1   │ 0.159338    │ 0.803518    │ 1         │
│ 2   │ 0.451714    │ 0.295973    │ 2         │
│ 3   │ 0.201918    │ 0.674945    │ 3         │
│ 4   │ 0.887095    │ 0.642161    │ 4         │
│ 5   │ 0.449446    │ 0.0822832   │ 5         │
│ 6   │ 0.200409    │ 0.408909    │ 6         │
│ 7   │ 0.80776     │ 0.551713    │ 7         │
│ 8   │ 0.71635     │ 0.339787    │ 8         │
│ 9   │ 0.504568    │ 0.475694    │ 9         │
│ 10  │ 0.176176    │ 0.00159093  │ 10        │

julia> names(data)
3-element Array{Symbol,1}:
 Symbol("stat.id1234")
 Symbol("stat.id2345")
 :timestamp

julia> predictors = names(data)[2:end]
2-element Array{Symbol,1}:
 Symbol("stat.id2345")
 :timestamp

julia> response = names(data)[1]
Symbol("stat.id1234")

julia> f = @eval(@formula($response ~ (+)(1, $(predictors...))))
Formula: stat.id1234 ~ 1 + stat.id2345 + timestamp

julia> lm(f, data)
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: stat.id1234 ~ 1 + stat.id2345 + timestamp

Coefficients:
              Estimate Std.Error  t value Pr(>|t|)
(Intercept)    0.16048  0.372155 0.431219   0.6793
stat.id2345     0.3059  0.458779 0.666769   0.5263
timestamp    0.0298503  0.038875 0.767854   0.4677

Once StatsModels.jl#71 is merged, constructing a formula programmatically will be much easier:


(tmp) pkg> st
    Status `/tmp/Project.toml`
  [a93c6f00] DataFrames v0.17.1
  [38e38edf] GLM v1.1.0
  [3eaba693] StatsModels v0.5.0+ [`~/.julia/dev/StatsModels`]

julia> f = term(resp) ~ term(1) + term.((predictors...))
FormulaTerm
Response:
  stat.id1234(unknown)
Predictors:
  1
  stat.id2345(unknown)
  timestamp(unknown)

julia> lm(f, data)
StatsModels.TableRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

stat.id1234 ~ 1 + stat.id2345 + timestamp

Coefficients:
               Estimate Std.Error   t value Pr(>|t|)
(Intercept)     0.69988  0.298233   2.34676   0.0513
stat.id2345   0.0269236  0.400148 0.0672841   0.9482
timestamp    -0.0239801 0.0385167  -0.62259   0.5533
1 Like