How much performance potential does DataFrames have?

This benchmark shows that DataFrames is much slower and more verbose than data.table:
https://h2oai.github.io/db-benchmark/

I was wondering if DataFrames could reach data.table’s performance in future.

From what I understand, a ton (of potential). First of all, it isn’t always that far behind in those benchmarks. Second of all, last I remember the DataFrames.jl squad were thinking about how to best optimize grouping operations, including when it might be appropriate to parallelize automatically. For example, if you are just taking the sum or mean of a column within a group, as most of those benchmarks do, that can obviously just be done in parallel.

Given what happened with CSV.jl (it was no good on benchmarks → now it is the fastest CSV reader out there), I completely think it is only a matter of time before DataFrames.jl steps out ahead. But the team isn’t solely focused on performance, usability is important too, so lots of thought and planning takes place before blindly implementing new things that might speed it up.

6 Likes

This benchmark site is quite good imho, but it only covers two data frame functionalities, group by and join.
For the first, DataFrames.jl is already quite good (usually top 3), but data.tables (which received much more resources for optimizations) is still a bit ahead. Join performance is something the DataFrames.jl developers are actively working on.
Other important data frame functionalities not tested by this benchmark are:

  1. File IO - here CSV.jl is top of class, and Arrow.jl should also be quite competitive.
  2. Vector- operations (e.g. df.c = df.a .* df.b) - these are pure Julia array operations in DataFrames.jl, and they are very fast (at least on par with Numpy, see NumpyJuliaPerformance.jl · GitHub).
  3. Row-wise operations (iterating over rows): if made type-stable, they are also very fast. For Pandas (and probably data.tables) row-wise operations, potentially with dependencies between rows, cannot be done efficiently (i.e. at C-like speed), at least without relying on code compilation (like Numba / Cython).
1 Like

Note that faster joins have already started to arrive on master: https://github.com/JuliaData/DataFrames.jl/pull/2612

3 Likes

Why Julia arrays are fast but DataFrames are not so fast?

I’m not sure what you mean by this, but DataFrames are just collections of arrays. Everything implemented in DataFrames is implemented on base julia arrays. All the operations in DataFrames, like

transform(df, [:x, :y] => ((x, y) -> x .+ y .- cor(x, y)) => :z)

will be just as fast as doing

df.x .+ df.y .- cor(df.x, df.y)
1 Like

If you want speed, you need to use TypedTables where the compiler has knowledge of the type of each column when iterating over rows.

Notice, if you have a lot of columns, forcing type information of each column through the compiler will make things slower because the compiler is the limiting fact but then.

This isn’t strictly correct. If you want to have functions like

function f(df)
   s = 0.0
   for i in 1:nrow(df)
       s += df.x[i] + df.y[i]
    end
    s
end

then yes, this would benefit from TypesTables, since the compiler doesn’t know what types df.x and df.y are.

But if you add any function barrier, you will recover performance.

function add(x, y)
    s = 0.0
    for i in 1:length(x)
        s += x[i] + y[i]
    end
end

add(df.x, df.y)

DataFramesMeta does this automatically, with @with. The following will be just as fast as above

@with df begin 
    s = 0.0
    for i in 1:length(:x)
        s += :x[i] + :y[i]
    end 
    s 
end

So yes, TypesTables is faster in one sense, but it’s very easy to work with DataFrames so that you recover all performance losses.

5 Likes