Release announcements for DataFrames.jl

The rules are described here, be warned though that people say that clicking this link it is like playing chess with Mikhail Tal, whose motto was :smiley: :

“You must take your opponent into a deep, dark forest where 2+2=5 and the path leading out is only wide enough for one.”

Now back to business. There are two layers to the issue.

Layer one is mental model. If you see df.col you should be able to confidently know that it will do exactly the same as writing df[!, :col]. It is a basic principle that these two operations should be the same. They were (and under Julia 1.6 are) inconsistent, which means that users have to learn exceptions when they differ.

Layer two is that for indexing data frame is a collection of columns (similarly to e.g. select/transform/subset/combine but as opposed to other operations like sort/filter/dropmissing/unique where we tend to look at it as a collection of rows - I have warned you that this is a deep dark forest - the short story is that in some operations people tend to find column-oriented view more natural and for other operations row-oriented). Clearly for indexing if you write df.col this is column oriented. Why? Because e.g. if you write:

df.col .= 1

you would like for this operation to work unconditionally. In particular if df is missing column :col you want it created (which is clearly not in-place) - and I hope you agree that most people will want it to work. So think of df.col .= 1 as broadcasting into a df not into a column :col of this data frame (so essentially you are broadcasting into a vector of vectors - as this is an underlying structure that holds columns of a DataFrame).

Now what is the benefit? Before moving forward think of what result you would expect from the following operation:

df = DataFrame(a=1:3)
df.a .= 'x'
df

Now scroll down:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.a .= 'x'
3-element Vector{Int64}:
 120
 120
 120

julia> df
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │   120
   2 │   120
   3 │   120

although it is consistent with broadcasting rules of Julia Base for vectors I assume that this is not what most people will want when they write df.col .= 'x'. I bet that a majority probably expected a vector of 'x'. Similarly you have:

julia> df = DataFrame(a='a':'c')
3×1 DataFrame
 Row │ a    
     │ Char 
─────┼──────
   1 │ a
   2 │ b
   3 │ c

julia> df.a .= 1
3-element Vector{Char}:
 '\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
 '\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
 '\x01': ASCII/Unicode U+0001 (category Cc: Other, control)

julia> df
3×1 DataFrame
 Row │ a    
     │ Char 
─────┼──────
   1 │ \x01
   2 │ \x01
   3 │ \x01

sadly - we have just failed to create a column of constant term for e.g. linear regression model (although it is consistent with Julia broadcasting rules).

Also most likely we do not want an error thrown in this case:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.a .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

if you are in the middle of 10-step @chain pipeline.

Such considerations are the second layer why we prefer in Julia 1.7 to force df.col .= 1 to replace columns rather than update them in place.

We are aware that being a replace and not in-place operation is sacrifices speed (which I bet 99% of users will never notice), but it is achieved at the benefit of lower surprise (you are sure to get what you most likely expect to get and be sure that the operation will not error) and higher consistency (you know that df.col and df[!, :col] are just aliases).

Finally we have made sure you can do an in-place broadcasting if you want - just write df[:, :col] .= value.

16 Likes