Accessing a column value from DataFrameRow allocates

Hi,

I am a bit puzzled that the following code allocates. According to the docs, a DataFrameRow should be a view and this type of field setting in a Dict should not allocate IMHO. Am I using DataFrames wrong? NamedTuple behaves as expected.

using DataFrames
using BenchmarkTools

df = DataFrame(a = 1.0, b = 0.0)

# should be a view according to the docs
row = df[1,:]

nt = (a = 1.0, b = 0.0)

# target structure
target = Dict{Symbol, Float64}()

set_from_row!(target, row) = (target[:out] = row.a)
set_from_named_tuple!(target, nt) = (target[:out] = nt.a)

@benchmark set_from_row!($target, $row)

@benchmark set_from_named_tuple!($target, $nt)

Output:

BenchmarkTools.Trial: 10000 samples with 932 evaluations.
 Range (min … max):  107.207 ns …  2.561 ΞΌs  β”Š GC (min … max): 0.00% … 94.44%
 Time  (median):     108.146 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   109.610 ns Β± 34.763 ns  β”Š GC (mean Β± Οƒ):  0.44% Β±  1.34%

  β–β–‡β–ˆβ–‡β–†β–„β–‚β–‚β–‚β–‚β–‚β–ƒβ–„β–ƒβ–‚β–β–β–β–β–β–                                        β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–‡β–…β–…β–…β–…β–„β–…β–„β–…β–ƒβ–…β–„β–…β–…β–β–„β–„β–…β–…β–…β–ƒβ–†β–ƒβ–…β–…β–†β–…β–ƒβ–…β–… β–ˆ
  107 ns        Histogram: log(frequency) by time       126 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  8.925 ns … 48.298 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     9.009 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.093 ns Β±  1.060 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–…   β–ˆ   β–…   β–„   β–„   β–‚                                   ▁
  β–‡β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–ˆβ–β–β–ˆβ–β–β–β–†β–β–β–β–‡β–β–β–β–†β–β–β–β–†β–β–β–β–†β–β–β–β–†β–β–β–… β–ˆ
  8.92 ns      Histogram: log(frequency) by time     9.55 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

DataFrameRow is not type stable, so in target[:out] = row.a you have dynamic dispatch and boxing.

Ok, any solution to get this type stable? Or do I have to drop DataFrames once again…

It depends what you want to do:

  1. If you want to perform a single operation then it probably does not matter;
  2. If you want to do millions of such operations then:
    • either use higher-level functions provided by DataFrames.jl like select or combine and they will be efficient;
    • if you want to use low-level operations, like loops, then:
      • if your data frame is not wide then convert it to NamedTuple with Tables.columntable - this operation will be cheap and later all you do with it is type stable;
      • if your data frame is very wide but you do not need to process all columns then drop unneeded columns and do what I described in point above
      • if your data frame is very wide and you need all columns then you have a problem - this is the case when writing type stable code is hard and you should rather consider using combine or select as they are optimized to efficiently handle such cases.

In summary - being type stable is not a free lunch as it heavily burdens the Julia compiler. DataFrames.jl was designed to be maximally flexible, but this means that it must be type unstable (otherwise you would not be able to e.g. dynamically add columns to a data frame). Also functions provided by DataFrames.jl were optimized to automatically β€œenable” type-stability of operations. Finally - as I have said - if your data is narrow then turning it to a type-stable NamedTuple is cheap.

2 Likes

Thanks! For my application the Tables.rowtable / namedtupleiterator seems to be a good solution.

One remark though: As an β€œend-user” of DataFrames it is confusing for me that directly accessing the field (e.g. df[1,:a]) seems to be type stable, whereas accessing in the way shown above is not. I understand your points, but I find it extremely difficult to use DataFrames in performance-critical applications. It is a quite narrow edge between extremely fast operations in DataFrames.jl and extremely slow ones.

df[1,:a] is not type stable.

I find it extremely difficult to use DataFrames in performance-critical applications.

In performance critical applications, if what you want to do cannot be achieved with the provided higher-order functions like combine or select then do not use DataFrames.jl. Either perform a conversion to a NamedTuple or use a barrier function. See Julia-DataFrames-Tutorial/11_performance.ipynb at master Β· bkamins/Julia-DataFrames-Tutorial Β· GitHub.

DataFrames.jl is not intended for performance-critical work. It is meant to:

  • be a flexible package for pre- and post- processing the data
  • provide efficient implementations of common data transformation patterns (split-apply-combine, joins, reshaping etc.)

For performance critical work Julia has dozens of specialized packages optimized for various use-cases (like static arrays, GPU computing etc.). It is impossible to cover all these in DataFrames.jl - therefore we decided to specialize it for non-performance critical operations + common transformations.

Also let me comment what is performance critical in case of DataFrames.jl: the operations that will be slow is processing data row by row when your data frame has millions of rows. In such a case use Tables.namedtupleiterator or similar. But if your operation can be performed columnwise then DataFrames.jl will be fast. E.g. if you want to apply fun function to all elements of column :a just do fun.(df.a) and this will be fast. What will be slow is [fun(df[i, :a]) for i in 1:nrow(df)]. The former will be fast as you create a function barrier. The latter is slow because df[i, :a] is type unstable.

1 Like

Often, one doesn’t even need any specialized packages to achieve close-to-optimal performance.
Pure Base Julia tables perform quite well:

julia> f(x) = x.a - x.b^2
julia> tbl = [(a=rand(), b=rand()) for _ in 1:100]
julia> @btime sum(f, $tbl)
  30.805 ns (0 allocations: 0 bytes)

Here, it’s within a factor of 2 from a somewhat more specialized (but still very general) StructArrays package:

julia> tbl = [(a=rand(), b=rand()) for _ in 1:100] |> StructArray;
julia> @btime sum(f, $tbl)
  18.739 ns (0 allocations: 0 bytes)

For comparisons, DataFrames:

julia> df = DataFrame(tbl);
julia> @btime sum($df[!, :a] .- $df[!, :b].^2)
  870.204 ns (8 allocations: 1.06 KiB)

And this is with vectorized operations, where one cannot reuse the original elementwise f(x) function definition. sum(f, eachrow(df)) is much slower still.

In my past experience, applying vectorized functions to dataframes does get you high performance, but only when the tables themselves are large. For repeated operations on small or medium-sized tables, the overhead is often very significant, and can dominate the total runtime.

Accessing individual values/rows is pretty convenient for many algorithms, and there is a wide set of type-stable tables to choose in Julia. Including Vector{NamedTuple} that don’t even require any packages.

This is benchmarking a different thing as what you do above since:

  • the code you use for Vector{NamedTuple} does not allocate since in your data frame example you used broadcasting;
  • you use a different memory layout: in data frame memory layout is column-wise, while in Vector{NamedTuple} it is row-wise. Of course both layouts have their pros and cons in different situations. In your example row-wise layout is more CPU cache friendly.

Just to give a complete picture of the situation:

julia> @btime sum(x -> x[1] - x[2]^2, zip($df.a, $df.b)) # no broadcasting cost, but cost of dynamic dispatch
  228.680 ns (4 allocations: 112 bytes)
9.914290936797833

julia> a, b = df.a, df.b;

julia> @btime sum(x -> x[1] - x[2]^2, zip($a, $b)) # no cost of dynamic dispatch, but memory layout cost
  78.920 ns (0 allocations: 0 bytes)
9.914290936797833

julia> z = collect(zip(a, b));

julia> @btime sum(x -> x[1] - x[2]^2, $z) # improved memory layout
  27.867 ns (0 allocations: 0 bytes)
9.914290936797839

In conclusion - Julia is a very good language for writing high performance code. DataFrames.jl for sure will not solve all these problems but in many cases if is quite efficient, especially for e.g. split-apply-combine or joins and if your data is big.

However, if you care about nanoseconds in your code then different data structures are preferrable.

Actually, the columnwise layout is more efficient here, as evidenced by StructArrays being ~1.5x faster than vector of namedtuples.

Sure, it uses broadcasting. I’m not aware of a better way, using idiomatic dataframes.jl operations.

All the more performant solutions in your post lose the column names as well when doing computations. It’s more error-prone even for two variables, even more for 3 or 5.

The key point here is β€œif your table is big”.

Maybe I’m doing something wrong, but the difference for simple split-apply-combine operations is quite large - not nanoseconds, but tens of microseconds:

# StructArrays + SplitApplyCombine:
julia> tbl = [(a=rand(1:1000), b=rand()) for _ in 1:100] |> StructArray;
julia> @btime map(gr -> sum(>(0.5), gr.b), groupview(x -> x.a % 3, $tbl))
  2.048 ΞΌs (18 allocations: 2.12 KiB)
3-element Dictionaries.Dictionary{Int64, Int64}
 1 β”‚ 21
 2 β”‚ 20
 0 β”‚ 18

# DataFrames:
julia> fdf(df) = let
       df[!, :key] = df[!, :a] .% 3
       combine(groupby(df, :key), :b => bs -> sum(>(0.5), bs))
       end
julia> @btime fdf($df)  setup=(df=DataFrame(tbl))
  58.462 ΞΌs (367 allocations: 21.99 KiB)
3Γ—2 DataFrame
 Row β”‚ key    b_function 
     β”‚ Int64  Int64      
─────┼───────────────────
   1 β”‚     0          18
   2 β”‚     1          21
   3 β”‚     2          20

Yes - because setting up a DataFrame object is expensive.

I’m not following the details here too closely, but DataFramesMeta also has some utility functions for fast row iteration.

The @with macro constructs an anonymous function and passes columns to that functions, so it is type-stable. Similarly, the @eachrow macro uses the same tricks as @with to make it seem like you are doing eachrow(df) but with faster performance.

1 Like