The state of DataFrames.jl H2O benchmark

bkamins · July 14, 2020, 9:10pm

@jangorecki has released the newest Database-like ops benchmark (by the way - thank you Jan for jour fantastic work!). I have decided to write this post as for the first time with DataFrames.jl 0.21.4 + CSV.jl 0.7.1 we have managed to successfully pass all groupby tests (and it is really selective, for 50GB size; e.g. pandas or dplyr do not pass it). Passing this last hurdle was possible mainly due to work of @quinnj so (of course many others have worked hard for years to make this happen).

What are the key take-aways for groupby tests:

if we ignore compilation time, we are roughly on par with the fastest options if number of groups is not very large;
if there are many groups the results are mixed, but there are cases when we are very bad; @nalimilan has been recently working on improving it (and in particular taking advantage of multi threading) - so I am anxiously waiting to see what he came up with;
when pandas or dplyr work (these I would say are typical packages regular users work with) we are either on par or significantly better

What we learn from join tests:

we are very, very bad here (in terms of both: performance and memory usage) and this is the key think we should focus on improving (we knew it, but it is just confirmed again)

tk3369 · July 14, 2020, 9:43pm

Congrats for jumping over the 50G hurdle! It’s huge!

Just curious, do we know the reasons why joins don’t perform well yet?

tbeason · July 14, 2020, 9:47pm

This is a great display of progress.

bkamins · July 14, 2020, 10:30pm

do we know the reasons why joins don’t perform well yet?

The “low hanging fruit” AFAICT is that if left table is large and right table is small we still match right to left, not the other way around. This should be a relatively easy fix reusing the current code we have.

In general - efficient joins are hard and many cases have to be considered. The code doing joins in DataFrames.jl has not been touched for 3 years, and this is ages in case of Julia as you know. My ideal solution would be to have a general package that would perform row matching for joins getting some type stable input, and in DataFrames.jl we would just delegate this part of work to it and would concentrate only on composing the result.

Yifan_Liu · July 15, 2020, 2:37am

A side question, is JuliaDB still under maintenance? If yes, what will be the relationship between DataFrames and JuliaDB?

jling · July 15, 2020, 5:58am

yes, and maintained by Julia computing; well, JuliaDB is ~ for persistent storage and DataFrames are in-memory data structure.

Fredo_XVII · July 19, 2020, 12:57am

Hi @bkamins,

How does it compare with data.table? Whenever I need to hit 50GBs of data in R, I reach for data.table instead of dplyr because it is so much faster.

xiaodai · July 19, 2020, 2:02am

It’s in the benchmarks. In general, data.table is faster for almost all standard things.

However, for highly customised code, I think Julia has an edge. But for me 90%+ of uses are standard stuff.

Fredo_XVII · July 19, 2020, 2:13am

Just curious, thank you!

robsmith11 · July 19, 2020, 4:07am

Isn’t a big use case of JuliaDB, with IndexTables in particular, also in-memory filtering, grouping, and joins? The API is very similar to DataFrames.jl.

bkamins · July 19, 2020, 6:09am

Just to add - data.table is using multi-threading, while currently DataFrames.jl is single threaded. @nalimilan has recently been investigating adding threading to DataFrames.jl.

Juan · May 6, 2021, 3:08pm

Hello.

Do this benchmarks
https://h2oai.github.io/db-benchmark/

already include multithreading and the following optimizations…?

https://github.com/JuliaData/DataFrames.jl/pull/2736
https://github.com/JuliaData/DataFrames.jl/pull/2733
https://github.com/JuliaData/DataFrames.jl/issues/2660

The first line in the script code…

github.com

h2oai/db-benchmark/blob/master/juliadf/join-juliadf.jl

#!/usr/bin/env julia

print("# join-juliadf.jl\n"); flush(stdout);

using DataFrames;
using CSV;
using Printf;

# Precompile methods for common patterns
DataFrames.precompile(true)

include("$(pwd())/_helpers/helpers.jl");

pkgmeta = getpkgmeta("DataFrames");
ver = pkgmeta["version"];
git = pkgmeta["git-tree-sha1"];
task = "join";
solution = "juliadf";
fun = "join";
cache = true;

This file has been truncated. show original

and

github.com

h2oai/db-benchmark/blob/master/juliadf/groupby-juliadf.jl

#!/usr/bin/env julia

print("# groupby-juliadf.jl\n"); flush(stdout);

using DataFrames;
using CSV;
using Statistics; # mean function
using Printf;

# Precompile methods for common patterns
DataFrames.precompile(true)

include("$(pwd())/_helpers/helpers.jl");

pkgmeta = getpkgmeta("DataFrames");
ver = pkgmeta["version"];
git = pkgmeta["git-tree-sha1"];
task = "groupby";
solution = "juliadf";
fun = "by";

This file has been truncated. show original

instructs Julia to use 20 threads but I’m not sure the benchmark was performed with that option because the results are quite slow compared to data.table and Polars.

bkamins · May 6, 2021, 4:25pm

No they do not.

These benchmarks fail because the server on which H2O benchmarks are run does not support -S option as we learned.

We are waiting for running of the benchmarks with a workaround.

The current benchmarks are not reflecting the state of the DataFrames.jl package (actually they are worse than if we run it on a single thread due to a bug in testing code).

H2O team promised to run correct benchmark this week.

bkamins · May 8, 2021, 12:43pm

H2O benchmarks for DataFrames.jl 1.1.0 are out (so expect that the next time they will be run we will be a bit better as current 1.1.1 release has some performance improvements), here is a link for a reference Database-like ops benchmark.

Overall: we have a good starting point for improvements post 1.0 release, but there is more work to do especially with joins (although at least we are not super bad any more).

Details:

groupby
- 0.5 GB: compilation cost kills comparisons; apart from this we are good
- 5GB: we are already very good; if we excluded compilation cost we would be the fastest solution (and I expect some small improvements when 1.1.1 release is out)
- 50GB: we are one of few that pass these tests; we fail only on one operation + in general we could improve performance in cases where there are very many groups which is a known issue, but maybe there is a tradeoff in the design as we are fast for few groups (@nalimilan - we need to investigate it)
join
- 0.5GB: we are OK (not super good but acceptable)
- 5GB: we are acceptable but still a lot of work to be done here (we clearly do not scale well when moving from 0.5GB do 5GB) - especially by adding more multi-threading support to the operations (@quinnj is also looking into this issue currently - however, what is clear that we have huge variability in timing; second run can be much longer than the first, which clearly shows that we spend way too much time in GC - a thing that we knew would hurt us in the benchmarks and was already recently discussed with @jameson and @oxinabox; hopefully we can find a solution for this)
- 50GB: we run out of memory (as most solutions) - but maybe we could do something about it

jzr · May 12, 2021, 8:44pm

I wonder what happened here, where the second invocation was slower. GC?

Oscar_Smith · May 12, 2021, 8:47pm

GC would be my guess as well.

bkamins · May 12, 2021, 9:35pm

Yes - exactly. So we know that this is at least that much GC, but it can be more (still the gap to polars has other reasons than just GC, but GC is huge).

Se also eg. (note that compilation time is at most 1-2 seconds)

and

jzr · May 12, 2021, 10:31pm

Sorry, I missed that

Juan · May 17, 2021, 10:26am

I’m glad to see that dataframes.jl has just moved to the first places in the H2O benchmarks.
The joins still need improvement.

bkamins · May 17, 2021, 11:01am

The joins are slow mainly due to GC issues, which are outside DataFrames.jl reach. We managed to reduce GC time by 50% (thus in 5GB case it is down from 800 seconds to 400 seconds). Hopefully in 1 month we can have a fix (most likely in CSV.jl - @quinnj is working on it currently) that will put less strain on Julia GC, and I expect to have a timing at around 100 seconds. similar to data.table. If the machine that is used to run the tests had more RAM the comparison would be similar to how it looks for 0.5GB case.

@nilshg recently tested how switching from using String type to ShortStrings.jl significantly boosted his joins in actual production code.

So in summary: we are around data.table speed both in groupby and in join tasks, except that joins put very high pressure on GC in Julia due to the fact how String type is implemented (in e.g. R strings are handled differently).

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5792	January 23, 2019
How is the data ecosystem right now for large datasets? Data	35	6696	July 13, 2017
[ANN] DataFrameDBs.jl Data package , announcement	60	4047	May 2, 2020
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	946	November 10, 2021
Who does "better" than DataFrames? Performance dataframes	43	2015	April 6, 2023

The state of DataFrames.jl H2O benchmark

Related topics