DataFrames.jl data engineering performance compared with other softwares

Julia fares pretty well in these cross software comparisons

As you can see from the dates of running of the benchmarks they are quite outdated. Unfortunately this benchmark is not maintained any more by its owners.


Does someone have a server that can run them again:

The dates in the screen shot I shared are from ~5 months ago. Is that outdated already?

DataFrames.jl has certainly seen a lot of updates in those last few months.


Taking top 5 from the screen shot: polars, data.table, DataFrames.jl, ClickHouse, cuDF - all had releases since the benchmark has been run. Since performance improvements are most likely marked as patch or minor release by their developers (they do not change the API) the top of the table might have changed. In particular Polars is now 0.10.18, so I would expect significant improvements there as it seems there has been a lot of development effort there.

Now one of my students is working on re-running the H2O benchmarks independently, but it is not that easy (as usual when you have multiple technologies things get complicated very quickly). In the long term I think I will try to develop at some point (probably in 2022) with JuliaLab@MIT a benchmark that will be easier to set-up and run than H2O benchmarks.


Also note that in general, while still we work on performance improvements, we are at the stage that in most of cases the top frameworks (polars, data.table, and DataFrames.jl) have a speed that will not lead to noticeable latency. E.g. the tests above are for 1,000,000 rows which is almost the maximum one would plan to handle in memory. The timings at the top are total times of running all experiments. If you look below in e.g. groupby operations all frameworks have roughly sub-second processing times. In particular in Julia it is at the point where compilation time starts to affect the benchmarks noticeably.

What you can expect to improve in DataFrames.jl are operations that were not fast in the past. E.g. here is a benchmark of an operation that will be fast in DataFrames.jl 1.3 (that will be released very soon - we are waiting for Julia 1.7 release to happen first to make sure we are in sync) and is not fast currently. Consider row-wise summation of data:

julia> mat = rand(10_000, 10_000);

julia> df = DataFrame(mat, :auto);

julia> @btime sum($mat, dims=2); # this is the best you can reasonably expect
  47.351 ms (6 allocations: 78.28 KiB)

julia> @btime combine($df, AsTable(:) => ByRow(sum)); # this is what we have now
  51.300 ms (19652 allocations: 1.14 MiB)

julia> @btime sum(eachcol($mat)); # this is a rough equivalent in Base, as df stores each column as a separate object
  278.262 ms (19999 allocations: 763.63 MiB)

Still - as you can see everything is just fast and the data I process has ~1GB.