The state of DataFrames.jl H2O benchmark

H2O benchmarks for DataFrames.jl 1.1.0 are out (so expect that the next time they will be run we will be a bit better as current 1.1.1 release has some performance improvements), here is a link for a reference Database-like ops benchmark.

Overall: we have a good starting point for improvements post 1.0 release, but there is more work to do especially with joins (although at least we are not super bad any more).

Details:

  • groupby
    • 0.5 GB: compilation cost kills comparisons; apart from this we are good
    • 5GB: we are already very good; if we excluded compilation cost we would be the fastest solution (and I expect some small improvements when 1.1.1 release is out)
    • 50GB: we are one of few that pass these tests; we fail only on one operation + in general we could improve performance in cases where there are very many groups which is a known issue, but maybe there is a tradeoff in the design as we are fast for few groups (@nalimilan - we need to investigate it)
  • join
    • 0.5GB: we are OK (not super good but acceptable)
    • 5GB: we are acceptable but still a lot of work to be done here (we clearly do not scale well when moving from 0.5GB do 5GB) - especially by adding more multi-threading support to the operations (@quinnj is also looking into this issue currently - however, what is clear that we have huge variability in timing; second run can be much longer than the first, which clearly shows that we spend way too much time in GC - a thing that we knew would hurt us in the benchmarks and was already recently discussed with @jameson and @oxinabox; hopefully we can find a solution for this)
    • 50GB: we run out of memory (as most solutions) - but maybe we could do something about it
35 Likes