Regarding view
vs !
the difference is the following:
!
allows you to mutate the bindings in internal structure of df
. Therefore when you write:
df[!, :col] = vector
you can create a column if it does not exist and replace it if it exists. The same is with broadcasting assignment like:
df[!, :col] .= x
using a view
would not allow this.
Now for getindex
side, there is an ongoing difference what we should do (see Make `getproperty(df, col)` return a full length view of the column ยท Issue #1844 ยท JuliaData/DataFrames.jl ยท GitHub). First, the current major difference is speed (this probably could be improved):
julia> @benchmark df[!, 1]
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.746 ns (0.00% GC)
median time: 15.948 ns (0.00% GC)
mean time: 18.275 ns (0.00% GC)
maximum time: 157.673 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark @view df[:, 1]
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 1
--------------
minimum time: 308.907 ns (0.00% GC)
median time: 312.146 ns (0.00% GC)
mean time: 358.267 ns (0.66% GC)
maximum time: 5.262 ฮผs (93.72% GC)
--------------
samples: 10000
evals/sample: 247
then the view
and the raw-access to a vector is not always the same thing type-wise (it may affect method dispatch downstream, it does for example for CategoricalArray
).
Finally - people already complained that df[!, :col]
is longer to write than df[:col]
and @view df[:, :col]
would be even longer to write.
In summary - we needed !
for โwriteโ side of the getindex
/setindex!
/broadcasting assignment combo anyway. Then we had do define what it does for โreadโ side to be consistent. As of today we have decided that it will do the same what df[:col]
did (as removing an equivalent of df[:col]
for reading data from a data frame would: hurt performance and lead to code breakage).
Note that with what we have implemented you simply add !,
to your code and you know it will work as it used to without having to thing about it; if we switched to @view
approach - which was considered - a lot of code would be broken; Actually the deprecation of df[:col]
would be parent(@view df[:, col])
but we considered that this would not be acceptable. With a package like DataFrames.jl we had to consider the fact that people have 5+ years of accumulated code using it and if we were going to be breaking (which we decided to do) the โfixingโ should be easy and not noisy visually.