GLM.jl with unknown column names

george · February 12, 2019, 4:08am

Hello,

I’m using Julia for some basic statistical analysis. I’d like to run a simple linear regression on a small dataset using the GLM.jl package.

My data looks something like this:

│ Row │ stat.id12312.somemetadata │ stat.id43.othermetadata │ timestamp  │
│     │ Float64                   │ Float64                 │ Int64      │
├─────┼───────────────────────────┼─────────────────────────┼────────────┤
│ 1   │ 22443.6                   │ 22453.8                 │ 1549942644 │
│ 2   │ 10423.1                   │ 12918.1                 │ 1513421321 │
 ⋮
│ 30  │ 22443.6                   │ 22453.8                 │ 1491231819 │

The complicating factor is that the stat.id* have a random ID, which will be different every time the program runs and therefore must be determined at runtime. To make matters worse, the column names have . in them.

Given this limitation, how would I do a linear regression using the GLM.jl package? The canonical example below does not work with dynamic field names like I have.

lm(@formula(Y ~ X), data)

Any guidance is greatly appreciated. Thank you.

amrods · February 12, 2019, 5:26am

You can specify a “design matrix” X and a response vector y, instead of the @formula:

using GLM

x2 = rand(100)
x3 = rand(100)
y = 5 .+ 2 .* x2 .+ 3 .* x3
X = hcat(ones(100), x2, x3)

lm(X, y)

Nosferican · February 12, 2019, 6:08am

I am not sure what your response is named, but if you just wanted to include every stat.id variable as an explanatory variable,

using DataFrames, GLM
# Some Values
data = DataFrame(rand(100, 3))
# This is what you have
names!(data, Symbol.(["a.3", "a.4", "y"]))
# Fix names
names!(data, Symbol.(replace.(string.(names(data)), Ref("." => "_"))))
# Find all explanatory variables
model = filter(x -> occursin(r"^a_", x), string.(names(data))) |>
        # Have them all as main effects
        (x -> reduce((x, y) -> string(x, " + ", y), x)) |>
        # Add response WLOG y
        (rhs -> string("y ~ ", rhs)) |>
        # Create the Expr
        Meta.parse |>
        # Use it to fit the model
        (fm -> fit(LinearModel,
                   @eval(@formula($fm)),
                   data))

amrods · February 12, 2019, 9:00am

The advantage of this method is that OP can then play with regression specifications. For a very easy way of including all variables in the DataFrame as regressors I use colwise and vec:

Z = hcat(ones(size(data, 1)), colwise(vec, data)...)
lm(Z, y)

dave.f.kleinschmidt · February 19, 2019, 5:47pm

The reason to use @formula is that it will preserve the names of the columns in the fitted model. See the StatsModels.jl docs for more info on splicing in column names like this (you didn’t say what the predictors and response variable are in your model so I just made the first column the response and the rest the predictors):

julia> using DataFrames, GLM

julia> data = DataFrame(Symbol("stat.id1234") => rand(10), Symbol("stat.id2345") => rand(10), :timestamp => 1:10)
10×3 DataFrame
│ Row │ stat.id1234 │ stat.id2345 │ timestamp │
│     │ Float64     │ Float64     │ Int64     │
├─────┼─────────────┼─────────────┼───────────┤
│ 1   │ 0.159338    │ 0.803518    │ 1         │
│ 2   │ 0.451714    │ 0.295973    │ 2         │
│ 3   │ 0.201918    │ 0.674945    │ 3         │
│ 4   │ 0.887095    │ 0.642161    │ 4         │
│ 5   │ 0.449446    │ 0.0822832   │ 5         │
│ 6   │ 0.200409    │ 0.408909    │ 6         │
│ 7   │ 0.80776     │ 0.551713    │ 7         │
│ 8   │ 0.71635     │ 0.339787    │ 8         │
│ 9   │ 0.504568    │ 0.475694    │ 9         │
│ 10  │ 0.176176    │ 0.00159093  │ 10        │

julia> names(data)
3-element Array{Symbol,1}:
 Symbol("stat.id1234")
 Symbol("stat.id2345")
 :timestamp

julia> predictors = names(data)[2:end]
2-element Array{Symbol,1}:
 Symbol("stat.id2345")
 :timestamp

julia> response = names(data)[1]
Symbol("stat.id1234")

julia> f = @eval(@formula($response ~ (+)(1, $(predictors...))))
Formula: stat.id1234 ~ 1 + stat.id2345 + timestamp

julia> lm(f, data)
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: stat.id1234 ~ 1 + stat.id2345 + timestamp

Coefficients:
              Estimate Std.Error  t value Pr(>|t|)
(Intercept)    0.16048  0.372155 0.431219   0.6793
stat.id2345     0.3059  0.458779 0.666769   0.5263
timestamp    0.0298503  0.038875 0.767854   0.4677

Once StatsModels.jl#71 is merged, constructing a formula programmatically will be much easier:


(tmp) pkg> st
    Status `/tmp/Project.toml`
  [a93c6f00] DataFrames v0.17.1
  [38e38edf] GLM v1.1.0
  [3eaba693] StatsModels v0.5.0+ [`~/.julia/dev/StatsModels`]

julia> f = term(resp) ~ term(1) + term.((predictors...))
FormulaTerm
Response:
  stat.id1234(unknown)
Predictors:
  1
  stat.id2345(unknown)
  timestamp(unknown)

julia> lm(f, data)
StatsModels.TableRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

stat.id1234 ~ 1 + stat.id2345 + timestamp

Coefficients:
               Estimate Std.Error   t value Pr(>|t|)
(Intercept)     0.69988  0.298233   2.34676   0.0513
stat.id2345   0.0269236  0.400148 0.0672841   0.9482
timestamp    -0.0239801 0.0385167  -0.62259   0.5533

Topic		Replies	Views
How can I perform GLM regression on column names with spaces? Statistics dataframes , regression , glm , linear-regression	6	1000	March 22, 2023
Very basic GLM help New to Julia regression , fit , glm	6	3537	November 9, 2020
How do I do a regression using programatically defined column names? General Usage question , dataframes , glm	4	231	July 24, 2023
How to use all variable in @formula like R language? New to Julia	4	644	September 30, 2022
How to define a formula with a vector of string names for GLM.jl? General Usage	3	1237	August 14, 2020

GLM.jl with unknown column names

Related topics