The state of DataFrames.jl H2O benchmark

@jangorecki has released the newest Database-like ops benchmark (by the way - thank you Jan for jour fantastic work!). I have decided to write this post as for the first time with DataFrames.jl 0.21.4 + CSV.jl 0.7.1 we have managed to successfully pass all groupby tests (and it is really selective, for 50GB size; e.g. pandas or dplyr do not pass it). Passing this last hurdle was possible mainly due to work of @quinnj so :clap: (of course many others have worked hard for years to make this happen).

What are the key take-aways for groupby tests:

  • if we ignore compilation time, we are roughly on par with the fastest options if number of groups is not very large;
  • if there are many groups the results are mixed, but there are cases when we are very bad; @nalimilan has been recently working on improving it (and in particular taking advantage of multi threading) - so I am anxiously waiting to see what he came up with;
  • when pandas or dplyr work (these I would say are typical packages regular users work with) we are either on par or significantly better

What we learn from join tests:

  • we are very, very bad here (in terms of both: performance and memory usage) and this is the key think we should focus on improving (we knew it, but it is just confirmed again)
53 Likes

Congrats for jumping over the 50G hurdle! It’s huge!

Just curious, do we know the reasons why joins don’t perform well yet?

4 Likes

This is a great display of progress.

do we know the reasons why joins don’t perform well yet?

The “low hanging fruit” AFAICT is that if left table is large and right table is small we still match right to left, not the other way around. This should be a relatively easy fix reusing the current code we have.

In general - efficient joins are hard and many cases have to be considered. The code doing joins in DataFrames.jl has not been touched for 3 years, and this is ages in case of Julia as you know. My ideal solution would be to have a general package that would perform row matching for joins getting some type stable input, and in DataFrames.jl we would just delegate this part of work to it and would concentrate only on composing the result.

9 Likes

A side question, is JuliaDB still under maintenance? If yes, what will be the relationship between DataFrames and JuliaDB?

1 Like

yes, and maintained by Julia computing; well, JuliaDB is ~ for persistent storage and DataFrames are in-memory data structure.

1 Like

Hi @bkamins,

How does it compare with data.table? Whenever I need to hit 50GBs of data in R, I reach for data.table instead of dplyr because it is so much faster.

1 Like

It’s in the benchmarks. In general, data.table is faster for almost all standard things.

However, for highly customised code, I think Julia has an edge. But for me 90%+ of uses are standard stuff.

Just curious, thank you!

Isn’t a big use case of JuliaDB, with IndexTables in particular, also in-memory filtering, grouping, and joins? The API is very similar to DataFrames.jl.

Just to add - data.table is using multi-threading, while currently DataFrames.jl is single threaded. @nalimilan has recently been investigating adding threading to DataFrames.jl.

7 Likes

Hello.

Do this benchmarks
https://h2oai.github.io/db-benchmark/

already include multithreading and the following optimizations…?

https://github.com/JuliaData/DataFrames.jl/pull/2736
https://github.com/JuliaData/DataFrames.jl/pull/2733
https://github.com/JuliaData/DataFrames.jl/issues/2660

The first line in the script code…

and

instructs Julia to use 20 threads but I’m not sure the benchmark was performed with that option because the results are quite slow compared to data.table and Polars.

No they do not.

These benchmarks fail because the server on which H2O benchmarks are run does not support -S option as we learned.

We are waiting for running of the benchmarks with a workaround.

The current benchmarks are not reflecting the state of the DataFrames.jl package (actually they are worse than if we run it on a single thread due to a bug in testing code).

H2O team promised to run correct benchmark this week.

6 Likes

H2O benchmarks for DataFrames.jl 1.1.0 are out (so expect that the next time they will be run we will be a bit better as current 1.1.1 release has some performance improvements), here is a link for a reference Database-like ops benchmark.

Overall: we have a good starting point for improvements post 1.0 release, but there is more work to do especially with joins (although at least we are not super bad any more).

Details:

  • groupby
    • 0.5 GB: compilation cost kills comparisons; apart from this we are good
    • 5GB: we are already very good; if we excluded compilation cost we would be the fastest solution (and I expect some small improvements when 1.1.1 release is out)
    • 50GB: we are one of few that pass these tests; we fail only on one operation + in general we could improve performance in cases where there are very many groups which is a known issue, but maybe there is a tradeoff in the design as we are fast for few groups (@nalimilan - we need to investigate it)
  • join
    • 0.5GB: we are OK (not super good but acceptable)
    • 5GB: we are acceptable but still a lot of work to be done here (we clearly do not scale well when moving from 0.5GB do 5GB) - especially by adding more multi-threading support to the operations (@quinnj is also looking into this issue currently - however, what is clear that we have huge variability in timing; second run can be much longer than the first, which clearly shows that we spend way too much time in GC - a thing that we knew would hurt us in the benchmarks and was already recently discussed with @jameson and @oxinabox; hopefully we can find a solution for this)
    • 50GB: we run out of memory (as most solutions) - but maybe we could do something about it
35 Likes

I wonder what happened here, where the second invocation was slower. GC?

GC would be my guess as well.

Yes - exactly. So we know that this is at least that much GC, but it can be more (still the gap to polars has other reasons than just GC, but GC is huge).

Se also eg. (note that compilation time is at most 1-2 seconds)


and

2 Likes

Sorry, I missed that :slight_smile:

1 Like

I’m glad to see that dataframes.jl has just moved to the first places in the H2O benchmarks.
The joins still need improvement.

The joins are slow mainly due to GC issues, which are outside DataFrames.jl reach. We managed to reduce GC time by 50% (thus in 5GB case it is down from 800 seconds to 400 seconds). Hopefully in 1 month we can have a fix (most likely in CSV.jl - @quinnj is working on it currently) that will put less strain on Julia GC, and I expect to have a timing at around 100 seconds. similar to data.table. If the machine that is used to run the tests had more RAM the comparison would be similar to how it looks for 0.5GB case.

@nilshg recently tested how switching from using String type to ShortStrings.jl significantly boosted his joins in actual production code.

So in summary: we are around data.table speed both in groupby and in join tasks, except that joins put very high pressure on GC in Julia due to the fact how String type is implemented (in e.g. R strings are handled differently).

12 Likes