It turns out it’s actually quite hard to produce the same magnitude of speed gains I’m seeing in my application, but here’s an example that roughly shows what I mean:
using DataFrames, Random
string1 = [randstring(4) for _ ∈ 1:5e4]
string2 = [randstring(6)*string(rand(1:10, 3)) for _ ∈ 1:16e6]
df = DataFrame(
col1 = [rand(string1) for _ ∈ 1:20e6],
col2 = shuffle!([string2; [missing for _ ∈ 1:4e6]]),
col3 = shuffle!([string2; [missing for _ ∈ 1:4e6]]),
col4 = shuffle!([string2; [missing for _ ∈ 1:4e6]]),
col5 = shuffle!([string2; [missing for _ ∈ 1:4e6]]))
to_join = DataFrame(col1 = rand(string1, 500_000),
colx = rand(string2, 500_000),
val = rand(500_000))
@time out = rename(leftjoin(df, to_join, on = [:col1, :col2 => :colx], matchmissing = :equal),
:val => :val1)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col3 => :colx], matchmissing = :equal),
:val => :val2)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col4 => :colx], matchmissing = :equal),
:val => :val3)
@time out = rename(leftjoin(out, to_join, on = [:col1, :col5 => :colx], matchmissing = :equal),
:val => :val4)
That gives me
21.597744 seconds (3.03 M allocations: 2.389 GiB, 22.05% gc time, 20.13% compilation time)
22.779038 seconds (241.43 k allocations: 2.566 GiB, 42.94% gc time, 1.80% compilation time)
21.833655 seconds (532 allocations: 2.888 GiB, 40.68% gc time)
53.875572 seconds (558 allocations: 3.223 GiB, 78.24% gc time)
Note that the results are highly variable, depending on when GC gets triggered. Also this is much lower in terms of GC than what I saw in my application, where GC was consistently above 90%.
Now change the example as follows, add:
using ShortStrings, PooledArrays
construct the string vectors as
string1 = ShortString7.([randstring(4) for _ ∈ 1:5e4])
string2 = ShortString15.([randstring(6)*string(rand(1:9, 3)) for _ ∈ 1:16e6])
and then in the DataFrames use PooledArrays
like this:
col2 = PooledArray(shuffle!([string2; [missing for _ ∈ 1:4e6]]))
that gives me:
5.884730 seconds (4.46 M allocations: 1.880 GiB, 16.65% gc time, 53.45% compilation time)
3.354185 seconds (241.75 k allocations: 1.972 GiB, 25.28% gc time)
3.153659 seconds (581 allocations: 2.293 GiB, 25.81% gc time)
3.591859 seconds (611 allocations: 2.629 GiB, 28.89% gc time)
I haven’t done a whole load of exploration as to how much of this is PooledArray
vs ShortString
s, and whether e.g. for some larger vectors with many unique entries, PooledArray
s become less beneficial, but as I said I’ve only seen massive improvements in speed and memory pressure across all parts of my code that touches large DataFrames with many strings.