The state of DataFrames.jl H2O benchmark

nilshg · May 17, 2021, 11:53am

Just to elaborate a little bit on this: one of my current projects involves wrangling a number of large-ish in-memory tables, with sizes of between 20 and 60 millions rows, and between 5 and 25 columns, most of which are string columns.

The analysis requires quite a lot of join operations as well as groupbys on subsets of those tables, and when I originally wrote the code, it took over an hour to run from top to bottom, as well as requiring writing out intermediate results multiple times in order to restart the Julia process and free up memory.

@oxinabox then told me to use ShortString types for my string data, and convert them into PooledArrays. The results were nothing short of magical, processing times on almost all operations went down between 50 and 90%, and overall I can run the analysis now in about 15 minutes, without having to do any restarts.

As Bogumil mentions there is ongoing work as well as discussions around how to improve GC issues in the data ecosystems as well as potentially more widely, but in the meantime I’d highly recommend trying out ShortStrings and PooledArrays for string-heavy DataFrame workflows!

Storopoli · May 17, 2021, 2:54pm

@nilshg could you share something generic using PooledArrays and ShortString so that whenever I have something as harrier as your analysis I can pull this magic out?

tbeason · May 17, 2021, 3:03pm

I had never heard of this package and would also be interested in learning more.

nilshg · May 17, 2021, 3:44pm

It turns out it’s actually quite hard to produce the same magnitude of speed gains I’m seeing in my application, but here’s an example that roughly shows what I mean:

using DataFrames, Random

string1 = [randstring(4) for _ ∈ 1:5e4]
string2 = [randstring(6)*string(rand(1:10, 3)) for _ ∈ 1:16e6]

df = DataFrame(
    col1 = [rand(string1) for _ ∈ 1:20e6],
    col2 = shuffle!([string2; [missing for _ ∈ 1:4e6]]),
    col3 = shuffle!([string2; [missing for _ ∈ 1:4e6]]), 
    col4 = shuffle!([string2; [missing for _ ∈ 1:4e6]]), 
    col5 = shuffle!([string2; [missing for _ ∈ 1:4e6]]))

to_join = DataFrame(col1 = rand(string1, 500_000),
                    colx = rand(string2, 500_000),
                    val = rand(500_000))

@time out = rename(leftjoin(df, to_join, on = [:col1, :col2 => :colx], matchmissing = :equal), 
                    :val => :val1)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col3 => :colx], matchmissing = :equal), 
                    :val => :val2)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col4 => :colx], matchmissing = :equal), 
                    :val => :val3)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col5 => :colx], matchmissing = :equal), 
                    :val => :val4)

That gives me

 21.597744 seconds (3.03 M allocations: 2.389 GiB, 22.05% gc time, 20.13% compilation time)
 22.779038 seconds (241.43 k allocations: 2.566 GiB, 42.94% gc time, 1.80% compilation time)
 21.833655 seconds (532 allocations: 2.888 GiB, 40.68% gc time)
 53.875572 seconds (558 allocations: 3.223 GiB, 78.24% gc time)

Note that the results are highly variable, depending on when GC gets triggered. Also this is much lower in terms of GC than what I saw in my application, where GC was consistently above 90%.

Now change the example as follows, add:

using ShortStrings, PooledArrays

construct the string vectors as

string1 = ShortString7.([randstring(4) for _ ∈ 1:5e4])
string2 = ShortString15.([randstring(6)*string(rand(1:9, 3)) for _ ∈ 1:16e6])

and then in the DataFrames use PooledArrays like this:

col2 = PooledArray(shuffle!([string2; [missing for _ ∈ 1:4e6]]))

that gives me:

  5.884730 seconds (4.46 M allocations: 1.880 GiB, 16.65% gc time, 53.45% compilation time)
  3.354185 seconds (241.75 k allocations: 1.972 GiB, 25.28% gc time)
  3.153659 seconds (581 allocations: 2.293 GiB, 25.81% gc time)
  3.591859 seconds (611 allocations: 2.629 GiB, 28.89% gc time)

I haven’t done a whole load of exploration as to how much of this is PooledArray vs ShortStrings, and whether e.g. for some larger vectors with many unique entries, PooledArrays become less beneficial, but as I said I’ve only seen massive improvements in speed and memory pressure across all parts of my code that touches large DataFrames with many strings.

oxinabox · May 17, 2021, 3:51pm

ShortStrings.jl is a package that @xiaodai made which encoded strings inside bit-integers (like UInt128).
I believe it was initially made so could have very fast sorting: by comparing the integers.

I found the need to avoid reference types to avoid hurting the GC performance.
And so revamped it a bit for that purprose.
and Scott P Jones came on board and implemented even more.

The basic idea is that you can put all the bits needed to represent a string into a interger and then use bitshifts etc to get them back out as needed.

Juan · August 5, 2021, 11:45am

Would you update the benchmarks with Julia v1.7 and Dataframes.jl v1.2?

tbeason · August 5, 2021, 12:26pm

Julia 1.7 isn’t even released yet! I’m sure the benchmarks will be updated in due time, but we want the DataFrames crew doing the work that will actually lead to improvements in the benchmarks instead.

bkamins · August 5, 2021, 12:56pm

We will update the benchmarks when CSV.jl 1.0 is out, as this is currently the biggest bottleneck (I expect that join performance will be at least 3x better than what we have now with this change for 5GB test). See How to make your joins faster in DataFrames.jl? | Blog by Bogumił Kamiński for the explanation of the underlying difference.

In general - H2O benchmarks are very expensive to run. Respecting the money and carbon footprint I do not want to ask H2O team to re-run the benchmarks with every release. This is essentially along @tbeason comment.

As a general roadmap (you can check the detailed roadmap on GitHub, as we label every issue/PR) in the upcoming 1.3 release we concentrate on API improvements. At some point we will go back to performance, but first we need to add functionalities that users want.

nilshg · August 5, 2021, 1:16pm

I had never noticed this at the bottom of the page:

Benchmark run took around 163.3 hours

Palli · August 5, 2021, 11:48pm

That’s for all 13 competing packages (right?). So could run just one in 1/13 the time, or since Julia is about fastest (well depends on the benchmark) better than that… still seemingly hours.

Juan · August 8, 2021, 4:07pm

I didn’t know it was so expensive to run the tests.
Like you, I also meant to run just the Julia benchmark.
In fact, many of the other languages’ tests are updated much often.

bkamins · August 8, 2021, 9:27pm

OK - I will make a small PR to H2O and ask to re-run the benchmarks.

bkamins · August 8, 2021, 10:51pm

https://github.com/h2oai/db-benchmark/pull/232

xinchin · October 26, 2021, 2:01am

I found a bug in julia code in that website which its fix can double the performance of DataFrames.jl,
If I understand it correctly -t parameter is for parallel computing, currently DataFrames.jl only use 20 out of 40!

Oscar_Smith · October 26, 2021, 3:03am

Also Dataframes 1.3 and Julia 1.7 will speed things up a bunch. Unfortunately, it seems mostly unmaintained currently.

bkamins · October 26, 2021, 6:49am

This is a tricky decision. Using more logical cores than there is physical cores does not have to lead to performance boost. Such things would need to be benchmarked, but as @Oscar_Smith noted - the benchmark is currently not maintained. Hopefully it would change in the future.

xinchin · October 27, 2021, 4:37am

I think the maintainer wouldn’t like to spend too much time on it since data.table is not that much impressive as it used to be few years ago.

Oscar_Smith · October 27, 2021, 4:38am

That’s uncharitable. They were really good benchmarks. People just get busy.

Juan · June 4, 2022, 12:54am

The new DLMReader packages is supposed to improve the benchmarks: [ANN] DLMReader: the most versatile Julia package for reading delimited files yet - #13 by sl-solution

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5776	January 23, 2019
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	460	August 27, 2020
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	941	November 10, 2021
How much performance potential does DataFrames have? Offtopic question	7	4415	February 18, 2021

The state of DataFrames.jl H2O benchmark

Related topics