The state of DataFrames.jl H2O benchmark

bkamins · May 8, 2021, 12:43pm

H2O benchmarks for DataFrames.jl 1.1.0 are out (so expect that the next time they will be run we will be a bit better as current 1.1.1 release has some performance improvements), here is a link for a reference Database-like ops benchmark.

Overall: we have a good starting point for improvements post 1.0 release, but there is more work to do especially with joins (although at least we are not super bad any more).

Details:

groupby
- 0.5 GB: compilation cost kills comparisons; apart from this we are good
- 5GB: we are already very good; if we excluded compilation cost we would be the fastest solution (and I expect some small improvements when 1.1.1 release is out)
- 50GB: we are one of few that pass these tests; we fail only on one operation + in general we could improve performance in cases where there are very many groups which is a known issue, but maybe there is a tradeoff in the design as we are fast for few groups (@nalimilan - we need to investigate it)
join
- 0.5GB: we are OK (not super good but acceptable)
- 5GB: we are acceptable but still a lot of work to be done here (we clearly do not scale well when moving from 0.5GB do 5GB) - especially by adding more multi-threading support to the operations (@quinnj is also looking into this issue currently - however, what is clear that we have huge variability in timing; second run can be much longer than the first, which clearly shows that we spend way too much time in GC - a thing that we knew would hurt us in the benchmarks and was already recently discussed with @jameson and @oxinabox; hopefully we can find a solution for this)
- 50GB: we run out of memory (as most solutions) - but maybe we could do something about it

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5783	January 23, 2019
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	461	August 27, 2020
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	946	November 10, 2021
How much performance potential does DataFrames have? Offtopic question	7	4423	February 18, 2021

The state of DataFrames.jl H2O benchmark

Related topics