I am a bit puzzled that the following code allocates. According to the docs, a DataFrameRow should be a view and this type of field setting in a Dict should not allocate IMHO. Am I using DataFrames wrong? NamedTuple behaves as expected.
using DataFrames
using BenchmarkTools
df = DataFrame(a = 1.0, b = 0.0)
# should be a view according to the docs
row = df[1,:]
nt = (a = 1.0, b = 0.0)
# target structure
target = Dict{Symbol, Float64}()
set_from_row!(target, row) = (target[:out] = row.a)
set_from_named_tuple!(target, nt) = (target[:out] = nt.a)
@benchmark set_from_row!($target, $row)
@benchmark set_from_named_tuple!($target, $nt)
If you want to perform a single operation then it probably does not matter;
If you want to do millions of such operations then:
either use higher-level functions provided by DataFrames.jl like select or combine and they will be efficient;
if you want to use low-level operations, like loops, then:
if your data frame is not wide then convert it to NamedTuple with Tables.columntable - this operation will be cheap and later all you do with it is type stable;
if your data frame is very wide but you do not need to process all columns then drop unneeded columns and do what I described in point above
if your data frame is very wide and you need all columns then you have a problem - this is the case when writing type stable code is hard and you should rather consider using combine or select as they are optimized to efficiently handle such cases.
In summary - being type stable is not a free lunch as it heavily burdens the Julia compiler. DataFrames.jl was designed to be maximally flexible, but this means that it must be type unstable (otherwise you would not be able to e.g. dynamically add columns to a data frame). Also functions provided by DataFrames.jl were optimized to automatically βenableβ type-stability of operations. Finally - as I have said - if your data is narrow then turning it to a type-stable NamedTuple is cheap.
Thanks! For my application the Tables.rowtable / namedtupleiterator seems to be a good solution.
One remark though: As an βend-userβ of DataFrames it is confusing for me that directly accessing the field (e.g. df[1,:a]) seems to be type stable, whereas accessing in the way shown above is not. I understand your points, but I find it extremely difficult to use DataFrames in performance-critical applications. It is a quite narrow edge between extremely fast operations in DataFrames.jl and extremely slow ones.
DataFrames.jl is not intended for performance-critical work. It is meant to:
be a flexible package for pre- and post- processing the data
provide efficient implementations of common data transformation patterns (split-apply-combine, joins, reshaping etc.)
For performance critical work Julia has dozens of specialized packages optimized for various use-cases (like static arrays, GPU computing etc.). It is impossible to cover all these in DataFrames.jl - therefore we decided to specialize it for non-performance critical operations + common transformations.
Also let me comment what is performance critical in case of DataFrames.jl: the operations that will be slow is processing data row by row when your data frame has millions of rows. In such a case use Tables.namedtupleiterator or similar. But if your operation can be performed columnwise then DataFrames.jl will be fast. E.g. if you want to apply fun function to all elements of column :a just do fun.(df.a) and this will be fast. What will be slow is [fun(df[i, :a]) for i in 1:nrow(df)]. The former will be fast as you create a function barrier. The latter is slow because df[i, :a] is type unstable.
And this is with vectorized operations, where one cannot reuse the original elementwise f(x) function definition. sum(f, eachrow(df)) is much slower still.
In my past experience, applying vectorized functions to dataframes does get you high performance, but only when the tables themselves are large. For repeated operations on small or medium-sized tables, the overhead is often very significant, and can dominate the total runtime.
Accessing individual values/rows is pretty convenient for many algorithms, and there is a wide set of type-stable tables to choose in Julia. Including Vector{NamedTuple} that donβt even require any packages.
This is benchmarking a different thing as what you do above since:
the code you use for Vector{NamedTuple} does not allocate since in your data frame example you used broadcasting;
you use a different memory layout: in data frame memory layout is column-wise, while in Vector{NamedTuple} it is row-wise. Of course both layouts have their pros and cons in different situations. In your example row-wise layout is more CPU cache friendly.
Just to give a complete picture of the situation:
julia> @btime sum(x -> x[1] - x[2]^2, zip($df.a, $df.b)) # no broadcasting cost, but cost of dynamic dispatch
228.680 ns (4 allocations: 112 bytes)
9.914290936797833
julia> a, b = df.a, df.b;
julia> @btime sum(x -> x[1] - x[2]^2, zip($a, $b)) # no cost of dynamic dispatch, but memory layout cost
78.920 ns (0 allocations: 0 bytes)
9.914290936797833
julia> z = collect(zip(a, b));
julia> @btime sum(x -> x[1] - x[2]^2, $z) # improved memory layout
27.867 ns (0 allocations: 0 bytes)
9.914290936797839
In conclusion - Julia is a very good language for writing high performance code. DataFrames.jl for sure will not solve all these problems but in many cases if is quite efficient, especially for e.g. split-apply-combine or joins and if your data is big.
However, if you care about nanoseconds in your code then different data structures are preferrable.
Actually, the columnwise layout is more efficient here, as evidenced by StructArrays being ~1.5x faster than vector of namedtuples.
Sure, it uses broadcasting. Iβm not aware of a better way, using idiomatic dataframes.jl operations.
All the more performant solutions in your post lose the column names as well when doing computations. Itβs more error-prone even for two variables, even more for 3 or 5.
The key point here is βif your table is bigβ.
Maybe Iβm doing something wrong, but the difference for simple split-apply-combine operations is quite large - not nanoseconds, but tens of microseconds:
Iβm not following the details here too closely, but DataFramesMeta also has some utility functions for fast row iteration.
The @with macro constructs an anonymous function and passes columns to that functions, so it is type-stable. Similarly, the @eachrow macro uses the same tricks as @with to make it seem like you are doing eachrow(df) but with faster performance.