@jangorecki has released the newest Database-like ops benchmark (by the way - thank you Jan for jour fantastic work!). I have decided to write this post as for the first time with DataFrames.jl 0.21.4 + CSV.jl 0.7.1 we have managed to successfully pass all groupby tests (and it is really selective, for 50GB size; e.g. pandas or dplyr do not pass it). Passing this last hurdle was possible mainly due to work of @quinnj so (of course many others have worked hard for years to make this happen).
What are the key take-aways for groupby tests:
- if we ignore compilation time, we are roughly on par with the fastest options if number of groups is not very large;
- if there are many groups the results are mixed, but there are cases when we are very bad; @nalimilan has been recently working on improving it (and in particular taking advantage of multi threading) - so I am anxiously waiting to see what he came up with;
- when pandas or dplyr work (these I would say are typical packages regular users work with) we are either on par or significantly better
What we learn from join tests:
- we are very, very bad here (in terms of both: performance and memory usage) and this is the key think we should focus on improving (we knew it, but it is just confirmed again)