What is the difference between `:` and `!` indexing in DataFrames with Unitful?

Hi all!

I wanted to decorate some data I have in a DataFrame with a unit of measure from Unitful. See:

julia> df_results_master = DataFrame(CSV.File(f))
468×3 DataFrame
 Row │ Instruction        Base power mean (W)  Base power std (W) 
     │ String31           Float64              Float64            
─────┼────────────────────────────────────────────────────────────
   1 │ add_r0_r0_0                  0.0821032          6.01459e-5
   2 │ add_r0_r0_1                  0.0833665          6.42208e-5
   3 │ add_r0_r0_10                 0.0845937          6.07616e-5
   4 │ add_r0_r0_100                0.0858542          6.14341e-5
   5 │ add_r0_r0_110                0.0866378          6.27856e-5
   6 │ add_r0_r0_120                0.0854615          5.98497e-5
   7 │ add_r0_r0_130                0.085026           5.94211e-5
   8 │ add_r0_r0_140                0.0857897          6.00773e-5
   9 │ add_r0_r0_150                0.0870989          6.08059e-5
  10 │ add_r0_r0_160                0.0852618          5.9398e-5
  11 │ add_r0_r0_170                0.0869853          6.09971e-5
  12 │ add_r0_r0_180                0.086765           6.23517e-5
  13 │ add_r0_r0_190                0.0868934          6.12408e-5
  14 │ add_r0_r0_20                 0.0848691          6.14883e-5
  ⋮  │         ⋮                   ⋮                   ⋮
 455 │ teq_r0_r5                    0.0838562          6.1484e-5
 456 │ tst_r0_0                     0.082128           6.00686e-5
 457 │ tst_r0_10                    0.0828279          6.11928e-5
 458 │ tst_r0_255                   0.0848542          6.00921e-5
 459 │ tst_r0_r0                    0.0706668          6.36107e-5
 460 │ tst_r0_r1                    0.0709795          6.26681e-5
 461 │ tst_r0_r2                    0.0711447          6.12671e-5
 462 │ tst_r0_r3                    0.0714723          6.46041e-5
 463 │ tst_r0_r4                    0.0710547          6.23303e-5
 464 │ tst_r0_r5                    0.0713467          6.54201e-5
 465 │ umlal_r1_r2_r4_r2            0.0886796          6.6934e-5
 466 │ umlal_r1_r5_r4_r2            0.0888859          6.66033e-5
 467 │ umull_r1_r5_r4_r0            0.0841324          6.91616e-5
 468 │ umull_r1_r5_r4_r2            0.0845281          6.81085e-5
                                                  440 rows omitted

julia> df_results_master[!,mean_power_sym] = df_results_master[:,mean_power_sym] .* u"W"
468-element Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}:
 0.08210324232296967 W
 0.08336646810436063 W
 0.08459367066671744 W
 0.08585415776092814 W
 0.08663776307205824 W
 0.08546146136499712 W
 0.08502599612798299 W
   0.085789709515833 W
 0.08709888469094718 W
 0.08526179889220828 W
  0.0869853359596325 W
 0.08676498037041373 W
 0.08689338616983189 W
 0.08486912612099981 W
 0.08581189837734816 W
 0.08688021452099513 W
                     ⋮
 0.08378983035530244 W
 0.08352483082288234 W
 0.08385624601375233 W
  0.0821280456260354 W
 0.08282787622372532 W
 0.08485424308684404 W
 0.07066680131581576 W
 0.07097953591863063 W
 0.07114470404972599 W
     0.0714722791075 W
 0.07105470125324514 W
  0.0713467156844716 W
 0.08867955558597908 W
 0.08888591524844881 W
 0.08413238639453503 W
 0.08452812752398081 W

As you can see, using the DF on the LHS requires indexing it with !. If instead I do:

julia> df_results_master[:,mean_power_sym] = df_results_master[:,mean_power_sym] .* u"W"
ERROR: DimensionError:  and W are not dimensionally compatible.
Stacktrace:
  [1] #s81#159
    @ ~/.julia/packages/Unitful/ApCuY/src/conversion.jl:12 [inlined]
  [2] var"#s81#159"(::Any, s::Any, t::Any)
    @ Unitful ./none:0
  [3] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core ./boot.jl:582
  [4] uconvert(a::Unitful.FreeUnits{(), NoDims, nothing}, x::Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}})
    @ Unitful ~/.julia/packages/Unitful/ApCuY/src/conversion.jl:78
  [5] convert(#unused#::Type{Float64}, y::Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}})
    @ Unitful ~/.julia/packages/Unitful/ApCuY/src/conversion.jl:145
  [6] setindex!
    @ ./array.jl:966 [inlined]
  [7] macro expansion
    @ ./multidimensional.jl:946 [inlined]
  [8] macro expansion
    @ ./cartesian.jl:64 [inlined]
  [9] _unsafe_setindex!(#unused#::IndexLinear, A::Vector{Float64}, x::Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, I::Base.Slice{Base.OneTo{Int64}})
    @ Base ./multidimensional.jl:941
 [10] _setindex!
    @ ./multidimensional.jl:930 [inlined]
 [11] setindex!(A::Vector{Float64}, v::Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, I::Function)
    @ Base ./abstractarray.jl:1344
 [12] setindex!(df::DataFrame, v::Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, row_inds::Colon, col_ind::Symbol)
    @ DataFrames ~/.julia/packages/DataFrames/bza1S/src/dataframe/dataframe.jl:725
 [13] top-level scope
    @ REPL[159]:1

Why is it so? From the DataFrame documentation and this SO answer I think the point is that ! is a special argument that instructs getindex() to return the underlying data structure. This makes sense (to use it on a LHS), but if that’s true, then why the following (without the measurement unit) works?

julia> df_results_master[:,mean_power_sym] = 2 .* df_results_master[:,mean_power_sym] #.* u"W"
468-element Vector{Float64}:
 0.16420648464593934
 0.16673293620872126
 0.16918734133343488
 0.17170831552185628
 0.17327552614411648
 0.17092292272999424
 0.17005199225596598
 0.171579419031666
 0.17419776938189435
 0.17052359778441656
 0.173970671919265
 0.17352996074082747
 0.17378677233966378
 0.16973825224199962
 0.17162379675469633
 0.17376042904199027
 ⋮
 0.16757966071060487
 0.16704966164576468
 0.16771249202750466
 0.1642560912520708
 0.16565575244745065
 0.16970848617368808
 0.14133360263163153
 0.14195907183726125
 0.14228940809945198
 0.142944558215
 0.1421094025064903
 0.1426934313689432
 0.17735911117195816
 0.17777183049689763
 0.16826477278907007
 0.16905625504796162

Thanks!

Does not work in the same way as the following would not work for matrices:

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> x[:, 1] = x[:, 1] / 2
ERROR: InexactError: Int64(0.5)

If you are using : on LHS you indicate that you want an in-place operation. This means that target must be able to store the source.

If you use ! on LHS as a selector you indicate that you want to replace the column with new data, overwriting the type of the column:

julia> using DataFrames

julia> df = DataFrame(a=1:2, b=["a", "b"])
2×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     2  b

julia> df[:, :a] = df[:, :b] # fails, as it is in place
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> df[!, :a] = df[:, :b] # works, as it is a replace operation
2-element Vector{String}:
 "a"
 "b"

julia> df
2×2 DataFrame
 Row │ a       b
     │ String  String
─────┼────────────────
   1 │ a       a
   2 │ b       b

Note that writing df[!, :a] is the same as writing df.a.

All this is explained in DataFrames.jl indexing rules | Blog by Bogumił Kamiński.

3 Likes

Thanks! Now it’s clearer. A minor oddity I can see is that ! on the LHS forces a copy, whereas on the RHS it forces NOT to copy. But the blog post you linked explains everything IIUC. My Google-fu failed :slight_smile:
Thank you!

It does not. ! on LHS does not copy (it REPLACES but WITHOUT copy):

julia> df = DataFrame(a=[1, 2])
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> b = [2, 3]
2-element Vector{Int64}:
 2
 3

julia> df[!, :b] = b
2-element Vector{Int64}:
 2
 3

julia> df.b === b
true

To force a copy you need to broadcast:

julia> df[!, :c] .= c
2-element Vector{Int64}:
 4
 5

julia> df.c === c
false

The point is that there are three (NOT TWO) possible behaviors that we need to handle:

  • in-place (: on LHS)
  • replace without copy (! on LHS with assignment =)
  • replace with copy (! on LHS with broadcasted assignment .=)

Ah ok sorry, I think I missed the fact that it behaves differently when broadcasting. I should have said “it (re)creates a new piece of memory” (so “copying” in the sense you end up with a different memory chunk, tho I recognize “allocating” would have been a better term). So:

  • Using : in an expression (or RHS) performs a copy
julia> df_imm[:, 2] == eachcol(df_imm)[2]
true

julia> df_imm[:, 2] === eachcol(df_imm)[2]
false
  • Using : on an LHS performs operations in-place (no memory chunk is changed). If it is used to add new columns, a copy of the RHS is done, and obviously a new piece of the LHS is allocated.
  • Using ! on an expression (or RHS) doesn’t perform any copy, but return the underlying vector (or view)
  • Using ! on an LHS depends on the type of assignment. With normal assignment it replaces the chunk of memory, but it doesn’t reallocate it: it simply replaces it with the RHS; with broadcast it takes the RHS, performs a copy of it, and places the copy in the memory chunk of the LHS (thus performing reallocation)

If I get all of this right, then I need to better investigate it to grasp the rationale behind it. I mean,

  • on LHS, : performs a copy; ! doesn’t
  • on RHS, : works in place, ! doesn’t

I’d ask “why?”, but I think my confusion arises from the fact I haven’t get the idea behind the “in-place/not-in-place” behavior in Julia Base.

The reason is exactly what you commented. In Base Julia : copies on RHS and does in-place assignment on LHS, so we need to respect this behavior.

Therefore, as in DataFrames.jl we needed also other behaviors we introduced ! to handle them.

Also note the following logic. Both:

df[!, col]

and

df[!, col] = something

do not make any memory copying nor allocation. They are both only operations on pointers only (so they are both very fast, i.e. nanoseconds, even if a data frame has millions of columns).

1 Like