How much performance potential does DataFrames have?

Yifan_Liu · February 17, 2021, 5:28am

This benchmark shows that DataFrames is much slower and more verbose than data.table:
https://h2oai.github.io/db-benchmark/

I was wondering if DataFrames could reach data.table’s performance in future.

tbeason · February 17, 2021, 5:43am

From what I understand, a ton (of potential). First of all, it isn’t always that far behind in those benchmarks. Second of all, last I remember the DataFrames.jl squad were thinking about how to best optimize grouping operations, including when it might be appropriate to parallelize automatically. For example, if you are just taking the sum or mean of a column within a group, as most of those benchmarks do, that can obviously just be done in parallel.

Given what happened with CSV.jl (it was no good on benchmarks → now it is the fastest CSV reader out there), I completely think it is only a matter of time before DataFrames.jl steps out ahead. But the team isn’t solely focused on performance, usability is important too, so lots of thought and planning takes place before blindly implementing new things that might speed it up.

lungben · February 17, 2021, 8:50am

This benchmark site is quite good imho, but it only covers two data frame functionalities, group by and join.
For the first, DataFrames.jl is already quite good (usually top 3), but data.tables (which received much more resources for optimizations) is still a bit ahead. Join performance is something the DataFrames.jl developers are actively working on.
Other important data frame functionalities not tested by this benchmark are:

File IO - here CSV.jl is top of class, and Arrow.jl should also be quite competitive.
Vector- operations (e.g. df.c = df.a .* df.b) - these are pure Julia array operations in DataFrames.jl, and they are very fast (at least on par with Numpy, see NumpyJuliaPerformance.jl · GitHub).
Row-wise operations (iterating over rows): if made type-stable, they are also very fast. For Pandas (and probably data.tables) row-wise operations, potentially with dependencies between rows, cannot be done efficiently (i.e. at C-like speed), at least without relying on code compilation (like Numba / Cython).

nilshg · February 17, 2021, 9:23am

Note that faster joins have already started to arrive on master: https://github.com/JuliaData/DataFrames.jl/pull/2612

Yifan_Liu · February 18, 2021, 11:39pm

Why Julia arrays are fast but DataFrames are not so fast?

pdeffebach · February 18, 2021, 11:41pm

I’m not sure what you mean by this, but DataFrames are just collections of arrays. Everything implemented in DataFrames is implemented on base julia arrays. All the operations in DataFrames, like

transform(df, [:x, :y] => ((x, y) -> x .+ y .- cor(x, y)) => :z)

will be just as fast as doing

df.x .+ df.y .- cor(df.x, df.y)

jling · February 18, 2021, 11:42pm

If you want speed, you need to use TypedTables where the compiler has knowledge of the type of each column when iterating over rows.

Notice, if you have a lot of columns, forcing type information of each column through the compiler will make things slower because the compiler is the limiting fact but then.

pdeffebach · February 18, 2021, 11:48pm

This isn’t strictly correct. If you want to have functions like

function f(df)
   s = 0.0
   for i in 1:nrow(df)
       s += df.x[i] + df.y[i]
    end
    s
end

then yes, this would benefit from TypesTables, since the compiler doesn’t know what types df.x and df.y are.

But if you add any function barrier, you will recover performance.

function add(x, y)
    s = 0.0
    for i in 1:length(x)
        s += x[i] + y[i]
    end
end

add(df.x, df.y)

DataFramesMeta does this automatically, with @with. The following will be just as fast as above

@with df begin 
    s = 0.0
    for i in 1:length(:x)
        s += :x[i] + :y[i]
    end 
    s 
end

So yes, TypesTables is faster in one sense, but it’s very easy to work with DataFrames so that you recover all performance losses.

Topic		Replies	Views
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	463	August 27, 2020
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	946	November 10, 2021
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9359	January 1, 2025
[ANN] RowTables.jl Data announcement	6	1121	July 26, 2018

How much performance potential does DataFrames have?

Related topics