The state of DataFrames.jl H2O benchmark

It turns out it’s actually quite hard to produce the same magnitude of speed gains I’m seeing in my application, but here’s an example that roughly shows what I mean:

using DataFrames, Random

string1 = [randstring(4) for _ ∈ 1:5e4]
string2 = [randstring(6)*string(rand(1:10, 3)) for _ ∈ 1:16e6]

df = DataFrame(
    col1 = [rand(string1) for _ ∈ 1:20e6],
    col2 = shuffle!([string2; [missing for _ ∈ 1:4e6]]),
    col3 = shuffle!([string2; [missing for _ ∈ 1:4e6]]), 
    col4 = shuffle!([string2; [missing for _ ∈ 1:4e6]]), 
    col5 = shuffle!([string2; [missing for _ ∈ 1:4e6]]))

to_join = DataFrame(col1 = rand(string1, 500_000),
                    colx = rand(string2, 500_000),
                    val = rand(500_000))

@time out = rename(leftjoin(df, to_join, on = [:col1, :col2 => :colx], matchmissing = :equal), 
                    :val => :val1)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col3 => :colx], matchmissing = :equal), 
                    :val => :val2)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col4 => :colx], matchmissing = :equal), 
                    :val => :val3)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col5 => :colx], matchmissing = :equal), 
                    :val => :val4)

That gives me

 21.597744 seconds (3.03 M allocations: 2.389 GiB, 22.05% gc time, 20.13% compilation time)
 22.779038 seconds (241.43 k allocations: 2.566 GiB, 42.94% gc time, 1.80% compilation time)
 21.833655 seconds (532 allocations: 2.888 GiB, 40.68% gc time)
 53.875572 seconds (558 allocations: 3.223 GiB, 78.24% gc time)

Note that the results are highly variable, depending on when GC gets triggered. Also this is much lower in terms of GC than what I saw in my application, where GC was consistently above 90%.

Now change the example as follows, add:

using ShortStrings, PooledArrays

construct the string vectors as

string1 = ShortString7.([randstring(4) for _ ∈ 1:5e4])
string2 = ShortString15.([randstring(6)*string(rand(1:9, 3)) for _ ∈ 1:16e6])

and then in the DataFrames use PooledArrays like this:

col2 = PooledArray(shuffle!([string2; [missing for _ ∈ 1:4e6]]))

that gives me:

  5.884730 seconds (4.46 M allocations: 1.880 GiB, 16.65% gc time, 53.45% compilation time)
  3.354185 seconds (241.75 k allocations: 1.972 GiB, 25.28% gc time)
  3.153659 seconds (581 allocations: 2.293 GiB, 25.81% gc time)
  3.591859 seconds (611 allocations: 2.629 GiB, 28.89% gc time)

I haven’t done a whole load of exploration as to how much of this is PooledArray vs ShortStrings, and whether e.g. for some larger vectors with many unique entries, PooledArrays become less beneficial, but as I said I’ve only seen massive improvements in speed and memory pressure across all parts of my code that touches large DataFrames with many strings.