How can I perform GLM regression on column names with spaces?

bertulli · March 22, 2023, 11:18am

Hi all,

very noob question but I can’t find an answer. I’m trying to do a linear regression on some data I have:

julia> df_under_test[:, [instruction_sym, measure_power_sym, binary_weight_sym]]
1165×3 DataFrame
  Row │ Instruction          Base power (W)     Binary weight 
      │ String               Quantity…          Int64         
──────┼───────────────────────────────────────────────────────
    1 │ add_r0_r0_r0_ror_23  0.108844±7.4e-5 W             12
    2 │ add_r0_r0_r0_ror_21  0.108616±7.4e-5 W             11
    3 │ add_r0_r0_r0_ror_22  0.108496±7.3e-5 W             11
    4 │ add_r0_r0_r0_ror_25  0.108466±7.5e-5 W             11
    5 │ add_r0_r0_r0_ror_27  0.108389±7.4e-5 W             12
    6 │ add_r0_r0_r0_ror_18  0.108376±7.6e-5 W             10
    7 │ add_r0_r0_r0_ror_24  0.108354±7.5e-5 W             10
    8 │ add_r0_r0_r0_ror_7   0.108344±7.5e-5 W             11
    9 │ add_r0_r0_r0_ror_9   0.108283±7.6e-5 W             10
   10 │ add_r0_r0_r0_ror_26  0.108269±7.4e-5 W             11
   11 │ add_r0_r0_r0_ror_6   0.108252±7.4e-5 W             10
   12 │ add_r0_r0_r0_ror_20  0.108202±7.5e-5 W             10
   13 │ add_r0_r0_r0_ror_19  0.108196±7.5e-5 W             11
   14 │ add_r0_r0_r0_ror_14   0.10818±7.6e-5 W             11
  ⋮   │          ⋮                   ⋮                ⋮
 1152 │ add_r6_r6_r1         0.073308±6.3e-5 W              5
 1153 │ add_r2_r5_r2         0.073302±6.0e-5 W              5
 1154 │ add_r1_r5_r1         0.073237±6.2e-5 W              5
 1155 │ add_r3_r3_r1         0.073213±6.3e-5 W              5
 1156 │ add_r4_r1_r4         0.073161±6.0e-5 W              4
 1157 │ add_r0_r0_r0          0.07305±6.2e-5 W              2
 1158 │ add_r4_r4_r1         0.072969±6.1e-5 W              4
 1159 │ add_r2_r1_r2         0.072941±5.9e-5 W              4
 1160 │ add_r0_r0_r5         0.072929±5.9e-5 W              4
 1161 │ add_r1_r1_r1         0.072881±6.1e-5 W              4
 1162 │ add_r2_r2_r1         0.072863±6.2e-5 W              4
 1163 │ add_r0_r5_r0         0.072772±6.0e-5 W              4
 1164 │ add_r0_r0_r1         0.072522±6.1e-5 W              3
 1165 │ add_r0_r1_r0         0.072425±6.1e-5 W              3
                                             1137 rows omitted

I want to correlate the power with the binary weight, like this:

ols = lm(@formula(measure_power_sym ~ 1 + binary_weight_sym), df_under_test)

where measure_power_sym and binary_weight_sym are the Symbols corresponding to the columns, as you can see in the first listing.

However, this throws

julia> ols = lm(@formula(measure_power_sym ~ 1 + binary_weight_sym), df_under_test)
ERROR: ArgumentError: There isn't a variable called 'measure_power_sym' in your data; the nearest names appear to be: 
Stacktrace:
 [1] ModelFrame(f::FormulaTerm{Term, Tuple{ConstantTerm{Int64}, Term}}, data::NamedTuple{(:Instruction, Symbol("Base power (W)"), Symbol("Is conditional"), Symbol("Barrel shift amount"), Symbol("Has immediate operand"), Symbol("Immediate amount"), Symbol("Dest reg == source reg"), Symbol("Binary encoding"), Symbol("Binary weight"), :mnemonic), Tuple{SubArray{String, 1, Vector{String}, Tuple{Vector{Int64}}, false}, SubArray{Quantity{Measurement{Float64}, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}, 1, Vector{Quantity{Measurement{Float64}, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, Tuple{Vector{Int64}}, false}, SubArray{Bool, 1, BitVector, Tuple{Vector{Int64}}, false}, SubArray{Signed, 1, Vector{Signed}, Tuple{Vector{Int64}}, false}, SubArray{Bool, 1, BitVector, Tuple{Vector{Int64}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}, SubArray{Bool, 1, BitVector, Tuple{Vector{Int64}}, false}, SubArray{String15, 1, Vector{String15}, Tuple{Vector{Int64}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}, SubArray{String, 1, Vector{String}, Tuple{Vector{Int64}}, false}}}; model::Type{LinearModel}, contrasts::Dict{Symbol, Any})
   @ StatsModels ~/.julia/packages/StatsModels/G1ClG/src/modelframe.jl:78
 [2] fit(::Type{LinearModel}, f::FormulaTerm{Term, Tuple{ConstantTerm{Int64}, Term}}, data::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}, args::Nothing; contrasts::Dict{Symbol, Any}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ StatsModels ~/.julia/packages/StatsModels/G1ClG/src/statsmodel.jl:85
 [3] fit
   @ ~/.julia/packages/StatsModels/G1ClG/src/statsmodel.jl:78 [inlined]
 [4] #lm#5
   @ ~/.julia/packages/GLM/4A2DM/src/lm.jl:157 [inlined]
 [5] lm (repeats 2 times)
   @ ~/.julia/packages/GLM/4A2DM/src/lm.jl:157 [inlined]
 [6] top-level scope
   @ REPL[34]:1

So, how should I do regression if the DataFrame’s column names have spaces in them?

Thanks!

tbeason · March 22, 2023, 1:16pm

Well, the first thing I’d do is to never have a column name with a space

Otherwise you can try

lm(@formula(y~ 1 + x), (y=df_under_test[:,measure_power_sym],x=df_under_test[:,binary_weight_sym]))

pdeffebach · March 22, 2023, 1:47pm

You can do term("outcome var") ~ term("Independent var")

See here

bertulli · March 22, 2023, 3:59pm

I’m starting to realize its importance, the data source had those and I wanted to keep the naming scheme, but I can patch it up.

Great suggestion!

Great suggestion as well, it’s what I ended up doing by looking at the docs more thorough.

Thank you very much to both!

nilshg · March 22, 2023, 4:09pm

If you are getting your data from a csv, there’s a normalizenames=false kwarg which you can set to true to automatically fix the column names up into something usable.

tbeason · March 22, 2023, 4:13pm

If the original names are important, early in the analysis just save them and rename to something usable, but later on for displaying, etc you can restore the original names. I’ve done stuff like that before. You might also be able to use DataFrames metadata for this: Metadata · DataFrames.jl

bertulli · March 22, 2023, 5:06pm

Wonderful suggestions, thanks!

Topic		Replies	Views
DataFrame column names into GLM as variable names General Usage glm	2	552	March 24, 2022
GLM.jl with unknown column names Statistics statistics , regression , glm	4	1874	February 19, 2019
How do I do a regression using programatically defined column names? General Usage question , dataframes , glm	4	234	July 24, 2023
Query - column names with spaces General Usage query , dataframes	5	1490	April 6, 2023
Can't refer to columns with spaces in names in @mutate New to Julia dataframes	4	174	December 29, 2024

How can I perform GLM regression on column names with spaces?

Related topics