DataFrames.jl data engineering performance compared with other softwares

laut · November 9, 2021, 10:06pm

Julia fares pretty well in these cross software comparisons
https://h2oai.github.io/db-benchmark/

bkamins · November 9, 2021, 10:11pm

As you can see from the dates of running of the benchmarks they are quite outdated. Unfortunately this benchmark is not maintained any more by its owners.

laut · November 9, 2021, 10:18pm

Does someone have a server that can run them again:
https://github.com/h2oai/db-benchmark

laut · November 9, 2021, 10:24pm

The dates in the screen shot I shared are from ~5 months ago. Is that outdated already?

viralbshah · November 10, 2021, 3:09am

DataFrames.jl has certainly seen a lot of updates in those last few months.

bkamins · November 10, 2021, 6:45am

Taking top 5 from the screen shot: polars, data.table, DataFrames.jl, ClickHouse, cuDF - all had releases since the benchmark has been run. Since performance improvements are most likely marked as patch or minor release by their developers (they do not change the API) the top of the table might have changed. In particular Polars is now 0.10.18, so I would expect significant improvements there as it seems there has been a lot of development effort there.

Now one of my students is working on re-running the H2O benchmarks independently, but it is not that easy (as usual when you have multiple technologies things get complicated very quickly). In the long term I think I will try to develop at some point (probably in 2022) with JuliaLab@MIT a benchmark that will be easier to set-up and run than H2O benchmarks.

bkamins · November 10, 2021, 7:26am

Also note that in general, while still we work on performance improvements, we are at the stage that in most of cases the top frameworks (polars, data.table, and DataFrames.jl) have a speed that will not lead to noticeable latency. E.g. the tests above are for 1,000,000 rows which is almost the maximum one would plan to handle in memory. The timings at the top are total times of running all experiments. If you look below in e.g. groupby operations all frameworks have roughly sub-second processing times. In particular in Julia it is at the point where compilation time starts to affect the benchmarks noticeably.

What you can expect to improve in DataFrames.jl are operations that were not fast in the past. E.g. here is a benchmark of an operation that will be fast in DataFrames.jl 1.3 (that will be released very soon - we are waiting for Julia 1.7 release to happen first to make sure we are in sync) and is not fast currently. Consider row-wise summation of data:

julia> mat = rand(10_000, 10_000);

julia> df = DataFrame(mat, :auto);

julia> @btime sum($mat, dims=2); # this is the best you can reasonably expect
  47.351 ms (6 allocations: 78.28 KiB)

julia> @btime combine($df, AsTable(:) => ByRow(sum)); # this is what we have now
  51.300 ms (19652 allocations: 1.14 MiB)

julia> @btime sum(eachcol($mat)); # this is a rough equivalent in Base, as df stores each column as a separate object
  278.262 ms (19999 allocations: 763.63 MiB)

Still - as you can see everything is just fast and the data I process has ~1GB.

Topic		Replies	Views
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9370	January 1, 2025
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
How much performance potential does DataFrames have? Offtopic question	7	4429	February 18, 2021
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7777	August 27, 2021
R's dplyr and data.table 2x faster than Julia's DataFrames.jl + libraries New to Julia	9	1708	September 30, 2020

DataFrames.jl data engineering performance compared with other softwares

Related topics