The state of DataFrames.jl H2O benchmark

Just to elaborate a little bit on this: one of my current projects involves wrangling a number of large-ish in-memory tables, with sizes of between 20 and 60 millions rows, and between 5 and 25 columns, most of which are string columns.

The analysis requires quite a lot of join operations as well as groupbys on subsets of those tables, and when I originally wrote the code, it took over an hour to run from top to bottom, as well as requiring writing out intermediate results multiple times in order to restart the Julia process and free up memory.

@oxinabox then told me to use ShortString types for my string data, and convert them into PooledArrays. The results were nothing short of magical, processing times on almost all operations went down between 50 and 90%, and overall I can run the analysis now in about 15 minutes, without having to do any restarts.

As Bogumil mentions there is ongoing work as well as discussions around how to improve GC issues in the data ecosystems as well as potentially more widely, but in the meantime Iā€™d highly recommend trying out ShortStrings and PooledArrays for string-heavy DataFrame workflows!

11 Likes

@nilshg could you share something generic using PooledArrays and ShortString so that whenever I have something as harrier as your analysis I can pull this magic out? :grinning:

4 Likes

I had never heard of this package and would also be interested in learning more.

2 Likes

It turns out itā€™s actually quite hard to produce the same magnitude of speed gains Iā€™m seeing in my application, but hereā€™s an example that roughly shows what I mean:

using DataFrames, Random

string1 = [randstring(4) for _ āˆˆ 1:5e4]
string2 = [randstring(6)*string(rand(1:10, 3)) for _ āˆˆ 1:16e6]

df = DataFrame(
    col1 = [rand(string1) for _ āˆˆ 1:20e6],
    col2 = shuffle!([string2; [missing for _ āˆˆ 1:4e6]]),
    col3 = shuffle!([string2; [missing for _ āˆˆ 1:4e6]]), 
    col4 = shuffle!([string2; [missing for _ āˆˆ 1:4e6]]), 
    col5 = shuffle!([string2; [missing for _ āˆˆ 1:4e6]]))

to_join = DataFrame(col1 = rand(string1, 500_000),
                    colx = rand(string2, 500_000),
                    val = rand(500_000))

@time out = rename(leftjoin(df, to_join, on = [:col1, :col2 => :colx], matchmissing = :equal), 
                    :val => :val1)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col3 => :colx], matchmissing = :equal), 
                    :val => :val2)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col4 => :colx], matchmissing = :equal), 
                    :val => :val3)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col5 => :colx], matchmissing = :equal), 
                    :val => :val4)

That gives me

 21.597744 seconds (3.03 M allocations: 2.389 GiB, 22.05% gc time, 20.13% compilation time)
 22.779038 seconds (241.43 k allocations: 2.566 GiB, 42.94% gc time, 1.80% compilation time)
 21.833655 seconds (532 allocations: 2.888 GiB, 40.68% gc time)
 53.875572 seconds (558 allocations: 3.223 GiB, 78.24% gc time)

Note that the results are highly variable, depending on when GC gets triggered. Also this is much lower in terms of GC than what I saw in my application, where GC was consistently above 90%.

Now change the example as follows, add:

using ShortStrings, PooledArrays

construct the string vectors as

string1 = ShortString7.([randstring(4) for _ āˆˆ 1:5e4])
string2 = ShortString15.([randstring(6)*string(rand(1:9, 3)) for _ āˆˆ 1:16e6])

and then in the DataFrames use PooledArrays like this:

col2 = PooledArray(shuffle!([string2; [missing for _ āˆˆ 1:4e6]]))

that gives me:

  5.884730 seconds (4.46 M allocations: 1.880 GiB, 16.65% gc time, 53.45% compilation time)
  3.354185 seconds (241.75 k allocations: 1.972 GiB, 25.28% gc time)
  3.153659 seconds (581 allocations: 2.293 GiB, 25.81% gc time)
  3.591859 seconds (611 allocations: 2.629 GiB, 28.89% gc time)

I havenā€™t done a whole load of exploration as to how much of this is PooledArray vs ShortStrings, and whether e.g. for some larger vectors with many unique entries, PooledArrays become less beneficial, but as I said Iā€™ve only seen massive improvements in speed and memory pressure across all parts of my code that touches large DataFrames with many strings.

10 Likes

ShortStrings.jl is a package that @xiaodai made which encoded strings inside bit-integers (like UInt128).
I believe it was initially made so could have very fast sorting: by comparing the integers.

I found the need to avoid reference types to avoid hurting the GC performance.
And so revamped it a bit for that purprose.
and Scott P Jones came on board and implemented even more.

The basic idea is that you can put all the bits needed to represent a string into a interger and then use bitshifts etc to get them back out as needed.

11 Likes

Would you update the benchmarks with Julia v1.7 and Dataframes.jl v1.2?

Julia 1.7 isnā€™t even released yet! Iā€™m sure the benchmarks will be updated in due time, but we want the DataFrames crew doing the work that will actually lead to improvements in the benchmarks instead.

2 Likes

We will update the benchmarks when CSV.jl 1.0 is out, as this is currently the biggest bottleneck (I expect that join performance will be at least 3x better than what we have now with this change for 5GB test). See How to make your joins faster in DataFrames.jl? | Blog by Bogumił Kamiński for the explanation of the underlying difference.

In general - H2O benchmarks are very expensive to run. Respecting the money and carbon footprint I do not want to ask H2O team to re-run the benchmarks with every release. This is essentially along @tbeason comment.

As a general roadmap (you can check the detailed roadmap on GitHub, as we label every issue/PR) in the upcoming 1.3 release we concentrate on API improvements. At some point we will go back to performance, but first we need to add functionalities that users want.

14 Likes

I had never noticed this at the bottom of the page:

Benchmark run took around 163.3 hours

:flushed: :flushed:

9 Likes

Thatā€™s for all 13 competing packages (right?). So could run just one in 1/13 the time, or since Julia is about fastest (well depends on the benchmark) better than thatā€¦ still seemingly hours.

I didnā€™t know it was so expensive to run the tests.
Like you, I also meant to run just the Julia benchmark.
In fact, many of the other languagesā€™ tests are updated much often.

OK - I will make a small PR to H2O and ask to re-run the benchmarks.

1 Like

https://github.com/h2oai/db-benchmark/pull/232

1 Like

I found a bug in julia code in that website which its fix can double the performance of DataFrames.jl, :cool:
If I understand it correctly -t parameter is for parallel computing, currently DataFrames.jl only use 20 out of 40!

Also Dataframes 1.3 and Julia 1.7 will speed things up a bunch. Unfortunately, it seems mostly unmaintained currently.

This is a tricky decision. Using more logical cores than there is physical cores does not have to lead to performance boost. Such things would need to be benchmarked, but as @Oscar_Smith noted - the benchmark is currently not maintained. Hopefully it would change in the future.

1 Like

I think the maintainer wouldnā€™t like to spend too much time on it since data.table is not that much impressive as it used to be few years ago. :thinking:

Thatā€™s uncharitable. They were really good benchmarks. People just get busy.

6 Likes

The new DLMReader packages is supposed to improve the benchmarks: [ANN] DLMReader: the most versatile Julia package for reading delimited files yet - #13 by sl-solution

3 Likes