The rules are described here, be warned though that people say that clicking this link it is like playing chess with Mikhail Tal, whose motto was :
“You must take your opponent into a deep, dark forest where 2+2=5 and the path leading out is only wide enough for one.”
Now back to business. There are two layers to the issue.
Layer one is mental model. If you see df.col
you should be able to confidently know that it will do exactly the same as writing df[!, :col]
. It is a basic principle that these two operations should be the same. They were (and under Julia 1.6 are) inconsistent, which means that users have to learn exceptions when they differ.
Layer two is that for indexing data frame is a collection of columns (similarly to e.g. select
/transform
/subset
/combine
but as opposed to other operations like sort
/filter
/dropmissing
/unique
where we tend to look at it as a collection of rows - I have warned you that this is a deep dark forest - the short story is that in some operations people tend to find column-oriented view more natural and for other operations row-oriented). Clearly for indexing if you write df.col
this is column oriented. Why? Because e.g. if you write:
df.col .= 1
you would like for this operation to work unconditionally. In particular if df
is missing column :col
you want it created (which is clearly not in-place) - and I hope you agree that most people will want it to work. So think of df.col .= 1
as broadcasting into a df
not into a column :col
of this data frame (so essentially you are broadcasting into a vector of vectors - as this is an underlying structure that holds columns of a DataFrame
).
Now what is the benefit? Before moving forward think of what result you would expect from the following operation:
df = DataFrame(a=1:3)
df.a .= 'x'
df
Now scroll down:
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> df.a .= 'x'
3-element Vector{Int64}:
120
120
120
julia> df
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 120
2 │ 120
3 │ 120
although it is consistent with broadcasting rules of Julia Base for vectors I assume that this is not what most people will want when they write df.col .= 'x'
. I bet that a majority probably expected a vector of 'x'
. Similarly you have:
julia> df = DataFrame(a='a':'c')
3×1 DataFrame
Row │ a
│ Char
─────┼──────
1 │ a
2 │ b
3 │ c
julia> df.a .= 1
3-element Vector{Char}:
'\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
'\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
'\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
julia> df
3×1 DataFrame
Row │ a
│ Char
─────┼──────
1 │ \x01
2 │ \x01
3 │ \x01
sadly - we have just failed to create a column of constant term for e.g. linear regression model (although it is consistent with Julia broadcasting rules).
Also most likely we do not want an error thrown in this case:
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> df.a .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
if you are in the middle of 10-step @chain
pipeline.
Such considerations are the second layer why we prefer in Julia 1.7 to force df.col .= 1
to replace columns rather than update them in place.
We are aware that being a replace and not in-place operation is sacrifices speed (which I bet 99% of users will never notice), but it is achieved at the benefit of lower surprise (you are sure to get what you most likely expect to get and be sure that the operation will not error) and higher consistency (you know that df.col
and df[!, :col]
are just aliases).
Finally we have made sure you can do an in-place broadcasting if you want - just write df[:, :col] .= value
.