df.column
is allowed and is the same as df[!, :column]
.
Now the reasons why df[:column]
and df[columns_vector]
is disallowed (currently deprecated) are the following (various points were raised by various people - I will comment which is most important for me below):
- consistency 1: data frame is a two dimensional container; therefore technically it should be indexed-into using both row and column indices; in particular in Base using a single index, like
maxtrix[idx]
is performing linear indexing (as opposed to column selection - what data frames did)
- consistency 2: even if we wanted to allow single-index indexing using this index to select columns goes against βcollection of rowsβ understanding of a data frame in other functions, like
filter
or sort
- safety 1: ultimately
df[:col]
and df[col_vector]
are unsafe and discouraged syntax; the reason is that they give you access to βrawβ underlying columns (without copying it); this is the major source of bug reports we get from the user; eg. they do df2 = df[col_vector]
then push!
a row to df2
or sort df2
and in consequence corrupt the consistency of df
. Sometimes this unsafe operation is desirable (as it is very fast, because it is non-copying) therefore we want to keep allowing it, but decided a clear visual signal should be given so that the user immediately is warned that this is an unsafe operation; using !
is a standard in Julia to indicate an operation that might lead to mutation of the argument (and this is essentially what we do here, by writing df[!, :col]
we extract βprivateβ column of df
)
- safety 2: also a very common pattern that lead to bugs was writing something like
df[df.x1 .< 0.25]
where people thought they were selecting rows (also this kind of bug was reported to happen); actually it used to select columns of a data frame, while df[df.x1 .< 0.25, :]
was an indented syntax
For me, personally, the βsafetyβ reasons were more important than the βconsistencyβ reasons but both are valid. Simply - we want the syntax to help users do less bugs. Still we provide df.col
syntax as we acknowledge it is a convenient way to pick a column (and unless you programatically generate :col
you do not need to write df[!, :col]
if you do not want to).
Finally - we have decided that df[!, :col]
is not that much more verbose than df[:col]
- it is only 2 characters more. we could leave it out and deprecate df[:col]
with getproperty(df, :col)
and df[col_vector]
with select(df, colvector, copycols=false)
but this would be an overkill so we have decided to add a special syntax using !
.
Finally - I think that having !
syntax will help new users better understand the nature of a data frame. As opposed to Base container like a Matrix
a data frame is a nested structure so you have to have a clear mind distinguishing:
df[!, col]
which βaccessesβ the internal structure of a data frame
df[:, col]
which is essentially the same, but copying, so we treat a data frame as βa wholeβ (not mutating its internal structure)
To better understand what I want to say is best viewed with the example. This is the old (deprecated) behavior:
julia> df = DataFrame()
0Γ0 DataFrame
julia> df[:, :a] = [1,2,3]
β Warning: `setindex!(df::DataFrame, v::AbstractVector, ::Colon, col_ind::ColumnIndex)` is deprecated, use `begin
β df[!, col_ind] = v
β df
β end` instead.
β caller = top-level scope at none:0
β @ Core none:0
3-element Array{Int64,1}:
1
2
3
julia> df
3Γ1 DataFrame
β Row β a β
β β Int64 β
βββββββΌββββββββ€
β 1 β 1 β
β 2 β 2 β
β 3 β 3 β
as you can see you could add the column to a data frame using :
which was inconsistent. You mutated the βinternalsβ of the data frame using :
which should not allow to do this. Now it is clear that if you want to do such an operation you should use !
not :
which: a) is consistent, b) warns you that you are going to significantly influence an internal structure of the data frame.
With the !
and :
distinction we were able to write a consistent set of rules what each operation does here which is very easy to remember (at least this is what I think). The only thing you have to know is that :
works like for matrices and !
is mutating.
Also this has a consequence that we will be able (after deprecation period) to significantly reduce the size of the code base used for defining indexing (much lower number of methods needs to be specified, as the rules are simpler).