How can I perform GLM regression on column names with spaces?

Hi all,

very noob question but I can’t find an answer. I’m trying to do a linear regression on some data I have:

julia> df_under_test[:, [instruction_sym, measure_power_sym, binary_weight_sym]]
1165×3 DataFrame
  Row │ Instruction          Base power (W)     Binary weight 
      │ String               Quantity…          Int64         
──────┼───────────────────────────────────────────────────────
    1 │ add_r0_r0_r0_ror_23  0.108844±7.4e-5 W             12
    2 │ add_r0_r0_r0_ror_21  0.108616±7.4e-5 W             11
    3 │ add_r0_r0_r0_ror_22  0.108496±7.3e-5 W             11
    4 │ add_r0_r0_r0_ror_25  0.108466±7.5e-5 W             11
    5 │ add_r0_r0_r0_ror_27  0.108389±7.4e-5 W             12
    6 │ add_r0_r0_r0_ror_18  0.108376±7.6e-5 W             10
    7 │ add_r0_r0_r0_ror_24  0.108354±7.5e-5 W             10
    8 │ add_r0_r0_r0_ror_7   0.108344±7.5e-5 W             11
    9 │ add_r0_r0_r0_ror_9   0.108283±7.6e-5 W             10
   10 │ add_r0_r0_r0_ror_26  0.108269±7.4e-5 W             11
   11 │ add_r0_r0_r0_ror_6   0.108252±7.4e-5 W             10
   12 │ add_r0_r0_r0_ror_20  0.108202±7.5e-5 W             10
   13 │ add_r0_r0_r0_ror_19  0.108196±7.5e-5 W             11
   14 │ add_r0_r0_r0_ror_14   0.10818±7.6e-5 W             11
  ⋮   │          ⋮                   ⋮                ⋮
 1152 │ add_r6_r6_r1         0.073308±6.3e-5 W              5
 1153 │ add_r2_r5_r2         0.073302±6.0e-5 W              5
 1154 │ add_r1_r5_r1         0.073237±6.2e-5 W              5
 1155 │ add_r3_r3_r1         0.073213±6.3e-5 W              5
 1156 │ add_r4_r1_r4         0.073161±6.0e-5 W              4
 1157 │ add_r0_r0_r0          0.07305±6.2e-5 W              2
 1158 │ add_r4_r4_r1         0.072969±6.1e-5 W              4
 1159 │ add_r2_r1_r2         0.072941±5.9e-5 W              4
 1160 │ add_r0_r0_r5         0.072929±5.9e-5 W              4
 1161 │ add_r1_r1_r1         0.072881±6.1e-5 W              4
 1162 │ add_r2_r2_r1         0.072863±6.2e-5 W              4
 1163 │ add_r0_r5_r0         0.072772±6.0e-5 W              4
 1164 │ add_r0_r0_r1         0.072522±6.1e-5 W              3
 1165 │ add_r0_r1_r0         0.072425±6.1e-5 W              3
                                             1137 rows omitted

I want to correlate the power with the binary weight, like this:

ols = lm(@formula(measure_power_sym ~ 1 + binary_weight_sym), df_under_test)

where measure_power_sym and binary_weight_sym are the Symbols corresponding to the columns, as you can see in the first listing.

However, this throws

julia> ols = lm(@formula(measure_power_sym ~ 1 + binary_weight_sym), df_under_test)
ERROR: ArgumentError: There isn't a variable called 'measure_power_sym' in your data; the nearest names appear to be: 
Stacktrace:
 [1] ModelFrame(f::FormulaTerm{Term, Tuple{ConstantTerm{Int64}, Term}}, data::NamedTuple{(:Instruction, Symbol("Base power (W)"), Symbol("Is conditional"), Symbol("Barrel shift amount"), Symbol("Has immediate operand"), Symbol("Immediate amount"), Symbol("Dest reg == source reg"), Symbol("Binary encoding"), Symbol("Binary weight"), :mnemonic), Tuple{SubArray{String, 1, Vector{String}, Tuple{Vector{Int64}}, false}, SubArray{Quantity{Measurement{Float64}, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}, 1, Vector{Quantity{Measurement{Float64}, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, Tuple{Vector{Int64}}, false}, SubArray{Bool, 1, BitVector, Tuple{Vector{Int64}}, false}, SubArray{Signed, 1, Vector{Signed}, Tuple{Vector{Int64}}, false}, SubArray{Bool, 1, BitVector, Tuple{Vector{Int64}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}, SubArray{Bool, 1, BitVector, Tuple{Vector{Int64}}, false}, SubArray{String15, 1, Vector{String15}, Tuple{Vector{Int64}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}, SubArray{String, 1, Vector{String}, Tuple{Vector{Int64}}, false}}}; model::Type{LinearModel}, contrasts::Dict{Symbol, Any})
   @ StatsModels ~/.julia/packages/StatsModels/G1ClG/src/modelframe.jl:78
 [2] fit(::Type{LinearModel}, f::FormulaTerm{Term, Tuple{ConstantTerm{Int64}, Term}}, data::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}, args::Nothing; contrasts::Dict{Symbol, Any}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ StatsModels ~/.julia/packages/StatsModels/G1ClG/src/statsmodel.jl:85
 [3] fit
   @ ~/.julia/packages/StatsModels/G1ClG/src/statsmodel.jl:78 [inlined]
 [4] #lm#5
   @ ~/.julia/packages/GLM/4A2DM/src/lm.jl:157 [inlined]
 [5] lm (repeats 2 times)
   @ ~/.julia/packages/GLM/4A2DM/src/lm.jl:157 [inlined]
 [6] top-level scope
   @ REPL[34]:1

So, how should I do regression if the DataFrame’s column names have spaces in them?

Thanks!

Well, the first thing I’d do is to never have a column name with a space :slight_smile:

Otherwise you can try

lm(@formula(y~ 1 + x), (y=df_under_test[:,measure_power_sym],x=df_under_test[:,binary_weight_sym]))
1 Like

You can do term("outcome var") ~ term("Independent var")

See here

3 Likes

I’m starting to realize its importance, the data source had those and I wanted to keep the naming scheme, but I can patch it up.

Great suggestion!

Great suggestion as well, it’s what I ended up doing by looking at the docs more thorough.

Thank you very much to both!

If you are getting your data from a csv, there’s a normalizenames=false kwarg which you can set to true to automatically fix the column names up into something usable.

1 Like

If the original names are important, early in the analysis just save them and rename to something usable, but later on for displaying, etc you can restore the original names. I’ve done stuff like that before. You might also be able to use DataFrames metadata for this: Metadata · DataFrames.jl

2 Likes

Wonderful suggestions, thanks!