What is the difference between `:` and `!` indexing in DataFrames with Unitful?

bertulli · November 19, 2022, 2:25pm

Hi all!

I wanted to decorate some data I have in a DataFrame with a unit of measure from Unitful. See:

julia> df_results_master = DataFrame(CSV.File(f))
468×3 DataFrame
 Row │ Instruction        Base power mean (W)  Base power std (W) 
     │ String31           Float64              Float64            
─────┼────────────────────────────────────────────────────────────
   1 │ add_r0_r0_0                  0.0821032          6.01459e-5
   2 │ add_r0_r0_1                  0.0833665          6.42208e-5
   3 │ add_r0_r0_10                 0.0845937          6.07616e-5
   4 │ add_r0_r0_100                0.0858542          6.14341e-5
   5 │ add_r0_r0_110                0.0866378          6.27856e-5
   6 │ add_r0_r0_120                0.0854615          5.98497e-5
   7 │ add_r0_r0_130                0.085026           5.94211e-5
   8 │ add_r0_r0_140                0.0857897          6.00773e-5
   9 │ add_r0_r0_150                0.0870989          6.08059e-5
  10 │ add_r0_r0_160                0.0852618          5.9398e-5
  11 │ add_r0_r0_170                0.0869853          6.09971e-5
  12 │ add_r0_r0_180                0.086765           6.23517e-5
  13 │ add_r0_r0_190                0.0868934          6.12408e-5
  14 │ add_r0_r0_20                 0.0848691          6.14883e-5
  ⋮  │         ⋮                   ⋮                   ⋮
 455 │ teq_r0_r5                    0.0838562          6.1484e-5
 456 │ tst_r0_0                     0.082128           6.00686e-5
 457 │ tst_r0_10                    0.0828279          6.11928e-5
 458 │ tst_r0_255                   0.0848542          6.00921e-5
 459 │ tst_r0_r0                    0.0706668          6.36107e-5
 460 │ tst_r0_r1                    0.0709795          6.26681e-5
 461 │ tst_r0_r2                    0.0711447          6.12671e-5
 462 │ tst_r0_r3                    0.0714723          6.46041e-5
 463 │ tst_r0_r4                    0.0710547          6.23303e-5
 464 │ tst_r0_r5                    0.0713467          6.54201e-5
 465 │ umlal_r1_r2_r4_r2            0.0886796          6.6934e-5
 466 │ umlal_r1_r5_r4_r2            0.0888859          6.66033e-5
 467 │ umull_r1_r5_r4_r0            0.0841324          6.91616e-5
 468 │ umull_r1_r5_r4_r2            0.0845281          6.81085e-5
                                                  440 rows omitted

julia> df_results_master[!,mean_power_sym] = df_results_master[:,mean_power_sym] .* u"W"
468-element Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}:
 0.08210324232296967 W
 0.08336646810436063 W
 0.08459367066671744 W
 0.08585415776092814 W
 0.08663776307205824 W
 0.08546146136499712 W
 0.08502599612798299 W
   0.085789709515833 W
 0.08709888469094718 W
 0.08526179889220828 W
  0.0869853359596325 W
 0.08676498037041373 W
 0.08689338616983189 W
 0.08486912612099981 W
 0.08581189837734816 W
 0.08688021452099513 W
                     ⋮
 0.08378983035530244 W
 0.08352483082288234 W
 0.08385624601375233 W
  0.0821280456260354 W
 0.08282787622372532 W
 0.08485424308684404 W
 0.07066680131581576 W
 0.07097953591863063 W
 0.07114470404972599 W
     0.0714722791075 W
 0.07105470125324514 W
  0.0713467156844716 W
 0.08867955558597908 W
 0.08888591524844881 W
 0.08413238639453503 W
 0.08452812752398081 W

As you can see, using the DF on the LHS requires indexing it with !. If instead I do:

julia> df_results_master[:,mean_power_sym] = df_results_master[:,mean_power_sym] .* u"W"
ERROR: DimensionError:  and W are not dimensionally compatible.
Stacktrace:
  [1] #s81#159
    @ ~/.julia/packages/Unitful/ApCuY/src/conversion.jl:12 [inlined]
  [2] var"#s81#159"(::Any, s::Any, t::Any)
    @ Unitful ./none:0
  [3] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any})
    @ Core ./boot.jl:582
  [4] uconvert(a::Unitful.FreeUnits{(), NoDims, nothing}, x::Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}})
    @ Unitful ~/.julia/packages/Unitful/ApCuY/src/conversion.jl:78
  [5] convert(#unused#::Type{Float64}, y::Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}})
    @ Unitful ~/.julia/packages/Unitful/ApCuY/src/conversion.jl:145
  [6] setindex!
    @ ./array.jl:966 [inlined]
  [7] macro expansion
    @ ./multidimensional.jl:946 [inlined]
  [8] macro expansion
    @ ./cartesian.jl:64 [inlined]
  [9] _unsafe_setindex!(#unused#::IndexLinear, A::Vector{Float64}, x::Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, I::Base.Slice{Base.OneTo{Int64}})
    @ Base ./multidimensional.jl:941
 [10] _setindex!
    @ ./multidimensional.jl:930 [inlined]
 [11] setindex!(A::Vector{Float64}, v::Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, I::Function)
    @ Base ./abstractarray.jl:1344
 [12] setindex!(df::DataFrame, v::Vector{Quantity{Float64, 𝐋^2 𝐌 𝐓^-3, Unitful.FreeUnits{(W,), 𝐋^2 𝐌 𝐓^-3, nothing}}}, row_inds::Colon, col_ind::Symbol)
    @ DataFrames ~/.julia/packages/DataFrames/bza1S/src/dataframe/dataframe.jl:725
 [13] top-level scope
    @ REPL[159]:1

Why is it so? From the DataFrame documentation and this SO answer I think the point is that ! is a special argument that instructs getindex() to return the underlying data structure. This makes sense (to use it on a LHS), but if that’s true, then why the following (without the measurement unit) works?

julia> df_results_master[:,mean_power_sym] = 2 .* df_results_master[:,mean_power_sym] #.* u"W"
468-element Vector{Float64}:
 0.16420648464593934
 0.16673293620872126
 0.16918734133343488
 0.17170831552185628
 0.17327552614411648
 0.17092292272999424
 0.17005199225596598
 0.171579419031666
 0.17419776938189435
 0.17052359778441656
 0.173970671919265
 0.17352996074082747
 0.17378677233966378
 0.16973825224199962
 0.17162379675469633
 0.17376042904199027
 ⋮
 0.16757966071060487
 0.16704966164576468
 0.16771249202750466
 0.1642560912520708
 0.16565575244745065
 0.16970848617368808
 0.14133360263163153
 0.14195907183726125
 0.14228940809945198
 0.142944558215
 0.1421094025064903
 0.1426934313689432
 0.17735911117195816
 0.17777183049689763
 0.16826477278907007
 0.16905625504796162

Thanks!

bkamins · November 19, 2022, 2:33pm

Does not work in the same way as the following would not work for matrices:

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> x[:, 1] = x[:, 1] / 2
ERROR: InexactError: Int64(0.5)

If you are using : on LHS you indicate that you want an in-place operation. This means that target must be able to store the source.

If you use ! on LHS as a selector you indicate that you want to replace the column with new data, overwriting the type of the column:

julia> using DataFrames

julia> df = DataFrame(a=1:2, b=["a", "b"])
2×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     2  b

julia> df[:, :a] = df[:, :b] # fails, as it is in place
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> df[!, :a] = df[:, :b] # works, as it is a replace operation
2-element Vector{String}:
 "a"
 "b"

julia> df
2×2 DataFrame
 Row │ a       b
     │ String  String
─────┼────────────────
   1 │ a       a
   2 │ b       b

Note that writing df[!, :a] is the same as writing df.a.

All this is explained in DataFrames.jl indexing rules | Blog by Bogumił Kamiński.

bertulli · November 19, 2022, 2:45pm

Thanks! Now it’s clearer. A minor oddity I can see is that ! on the LHS forces a copy, whereas on the RHS it forces NOT to copy. But the blog post you linked explains everything IIUC. My Google-fu failed
Thank you!

bkamins · November 19, 2022, 4:28pm

It does not. ! on LHS does not copy (it REPLACES but WITHOUT copy):

julia> df = DataFrame(a=[1, 2])
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> b = [2, 3]
2-element Vector{Int64}:
 2
 3

julia> df[!, :b] = b
2-element Vector{Int64}:
 2
 3

julia> df.b === b
true

To force a copy you need to broadcast:

julia> df[!, :c] .= c
2-element Vector{Int64}:
 4
 5

julia> df.c === c
false

The point is that there are three (NOT TWO) possible behaviors that we need to handle:

in-place (: on LHS)
replace without copy (! on LHS with assignment =)
replace with copy (! on LHS with broadcasted assignment .=)

bertulli · November 19, 2022, 5:14pm

Ah ok sorry, I think I missed the fact that it behaves differently when broadcasting. I should have said “it (re)creates a new piece of memory” (so “copying” in the sense you end up with a different memory chunk, tho I recognize “allocating” would have been a better term). So:

Using : in an expression (or RHS) performs a copy

julia> df_imm[:, 2] == eachcol(df_imm)[2]
true

julia> df_imm[:, 2] === eachcol(df_imm)[2]
false

Using : on an LHS performs operations in-place (no memory chunk is changed). If it is used to add new columns, a copy of the RHS is done, and obviously a new piece of the LHS is allocated.
Using ! on an expression (or RHS) doesn’t perform any copy, but return the underlying vector (or view)
Using ! on an LHS depends on the type of assignment. With normal assignment it replaces the chunk of memory, but it doesn’t reallocate it: it simply replaces it with the RHS; with broadcast it takes the RHS, performs a copy of it, and places the copy in the memory chunk of the LHS (thus performing reallocation)

If I get all of this right, then I need to better investigate it to grasp the rationale behind it. I mean,

on LHS, : performs a copy; ! doesn’t
on RHS, : works in place, ! doesn’t

I’d ask “why?”, but I think my confusion arises from the fact I haven’t get the idea behind the “in-place/not-in-place” behavior in Julia Base.

bkamins · November 19, 2022, 6:37pm

The reason is exactly what you commented. In Base Julia : copies on RHS and does in-place assignment on LHS, so we need to respect this behavior.

Therefore, as in DataFrames.jl we needed also other behaviors we introduced ! to handle them.

Also note the following logic. Both:

df[!, col]

and

df[!, col] = something

do not make any memory copying nor allocation. They are both only operations on pointers only (so they are both very fast, i.e. nanoseconds, even if a data frame has millions of columns).

Topic		Replies	Views
Different of df[!, :name] and df[:, :name] New to Julia dataframes	3	167	May 12, 2023
Indexing DataFrame with : does not generate a copy Specific Domains dataframes	2	765	March 17, 2022
DataFrames Package: `getindex(df::DataFrame, col_inds::Union{AbstractVector, Regex, Not})` is deprecated General Usage question , package	8	1851	June 4, 2020
! vs @view for indexing DataFrame Data dataframes	5	380	October 29, 2022
DataFrame colon : vs bang ! indexing New to Julia	4	710	November 7, 2022

What is the difference between `:` and `!` indexing in DataFrames with Unitful?

Related topics