The state of DataFrames.jl H2O benchmark

@jangorecki has released the newest Database-like ops benchmark (by the way - thank you Jan for jour fantastic work!). I have decided to write this post as for the first time with DataFrames.jl 0.21.4 + CSV.jl 0.7.1 we have managed to successfully pass all groupby tests (and it is really selective, for 50GB size; e.g. pandas or dplyr do not pass it). Passing this last hurdle was possible mainly due to work of @quinnj so :clap: (of course many others have worked hard for years to make this happen).

What are the key take-aways for groupby tests:

  • if we ignore compilation time, we are roughly on par with the fastest options if number of groups is not very large;
  • if there are many groups the results are mixed, but there are cases when we are very bad; @nalimilan has been recently working on improving it (and in particular taking advantage of multi threading) - so I am anxiously waiting to see what he came up with;
  • when pandas or dplyr work (these I would say are typical packages regular users work with) we are either on par or significantly better

What we learn from join tests:

  • we are very, very bad here (in terms of both: performance and memory usage) and this is the key think we should focus on improving (we knew it, but it is just confirmed again)

Congrats for jumping over the 50G hurdle! It’s huge!

Just curious, do we know the reasons why joins don’t perform well yet?


This is a great display of progress.

do we know the reasons why joins don’t perform well yet?

The “low hanging fruit” AFAICT is that if left table is large and right table is small we still match right to left, not the other way around. This should be a relatively easy fix reusing the current code we have.

In general - efficient joins are hard and many cases have to be considered. The code doing joins in DataFrames.jl has not been touched for 3 years, and this is ages in case of Julia as you know. My ideal solution would be to have a general package that would perform row matching for joins getting some type stable input, and in DataFrames.jl we would just delegate this part of work to it and would concentrate only on composing the result.


A side question, is JuliaDB still under maintenance? If yes, what will be the relationship between DataFrames and JuliaDB?

1 Like

yes, and maintained by Julia computing; well, JuliaDB is ~ for persistent storage and DataFrames are in-memory data structure.

1 Like

Hi @bkamins,

How does it compare with data.table? Whenever I need to hit 50GBs of data in R, I reach for data.table instead of dplyr because it is so much faster.

1 Like

It’s in the benchmarks. In general, data.table is faster for almost all standard things.

However, for highly customised code, I think Julia has an edge. But for me 90%+ of uses are standard stuff.

Just curious, thank you!

Isn’t a big use case of JuliaDB, with IndexTables in particular, also in-memory filtering, grouping, and joins? The API is very similar to DataFrames.jl.

Just to add - data.table is using multi-threading, while currently DataFrames.jl is single threaded. @nalimilan has recently been investigating adding threading to DataFrames.jl.