Julia performs poorly on group-by benchmarks

performance

#1

h20 has published a set of benchmarks that shows Julia’s DataFrames.jl has, in general, the worst group-by performance out of many data packages. JuliaDB.jl was not benchmarked so that may be a good addition. I have done some work before on optimising some benchmarks and I’ve been putting it off until the release of v1.0. Now that v1.0.1 is out, it’s time for me to pick up the work again using FastGroupBy.jl.


What functions/packages should I use to sort and "group by" as fast as possible...?
#2

Yeah the benchmarks leave a lot to be desired. It’s being discussed on discourse here. There are a few soon-to-be-merged PRs that should improve things.


#3

And yes I will get back to this at some point and report on progress with trying out some of the suggestions in that thread…


#4

I created a nextjournal group for this:
https://nextjournal.com/julia-data

Everyone in that group can edit the articles & publish new ones - we can even edit the article together at the same time! I plan to add a minimal julia version for the sum v1 by id1 benchmark, so collaborative editing will be pretty useful! @xiaodai, seems like you’re in the pole position to add a minimal version? I think it would really help to have more people tune the performance!

The sum v1 by id1 contains all the languages & multiple branches of the same packages in the same article in isolated runtimes which run on the same hardware. That makes it pretty easy to see where we are right now:

I hope we can fill in the performance gaps from here on! I put some instructions into sum v1 by id1 on how to add new benchmarks. Let me know if you have any problems with that!

If you want to be part of the group to add a new version, signup and let me know so I can add you to julia-data.


#5

The benchmarks could be redone with the new dataframes v0.15.


Another good options is


#6

Need a pr as their code are old


#7

I actually tried to update the article, but I run out of 15 gb RAM - seems weird, so not sure what’s happening!


#8

With what code exactly? I can’t reproduce locally.


#9

The original H2O benchmark should also automatically update to the new DataFrames whenever it updates itself but it could use a PR as it uses the do syntax that DataFrames docs specifically recommend to avoid if performance matters.


#10

The PR is already merged so we now have to wait and see how it goes (there are some issues with CSV.jl that might need fixing).


#11

And the results are in. Not too good on the performance side but the new syntax is lovely and it’s nice to see that it could also handle the 50GB file (unlike, for example, Pandas)


#12

These benchmarks are informative but datatable/panda/dplyr heavily optimize sum/mean. For any other function Julia may have similar or better performance (because of how slow it is to repeatedly call a function in R compared to Julia).

On the other hand, doing the benchmarkrs on data with missing value may make Julia slower compared to these packages.


#13

There are still performance optimizations that we can add to speed it up. But at least in all the benchmarks we got closer to the fastest options.


#14

Agree. Because InternedStrings.jl is so much better than before I tried an InternedString approach on 100million rows which yielded this approach

using SortingAlgorithms, InternedStrings, DataFrames

function createSynDataFrame(N::Int,K::Int)
    pool = "id".*string.(1:K, pad=3)
    pool1 = "id".*string.(1:N÷K,pad=10)
    nums = round.(rand(100).*100, digits = 4)

    df = DataFrame(
        id1 = intern.(rand(pool,N)),
        id2 = intern.(rand(pool,N)),
        id3 = intern.(rand(pool1,N)),
        id4 = rand(1:K,N),
        id5 = rand(1:K,N),
        id6 = rand(1:(N÷K),N),
        v1 = rand(1:5,N),
        v2 = rand(1:5,N),
        v3 = rand(nums,N))
    return df
end

@time df1 = createSynDataFrame(100_000_000, 100)


function sortandperm(x::Vector{String})
    ap = UInt.(pointer.(x))
    ai = sortperm(ap, alg = RadixSort)
    ap, ai
end

using BenchmarkTools
@benchmark sortedid1, ai = sortandperm(df1.id1)

BenchmarkTools.Trial:
memory estimate: 1.49 GiB
allocs estimate: 7

minimum time: 757.153 ms (5.08% GC)
median time: 913.535 ms (21.46% GC)
mean time: 1.079 s (32.91% GC)
maximum time: 2.023 s (64.62% GC)

samples: 5
evals/sample: 1

As you can on my laptop the timing for the most expensive part which is grouping only takes 2 seconds vs 15 seconds it took for the first group by for Julia (5G), and making a sum after should be a piece of cake and take not much time at all. I think Julia should finish within 4 seconds if we use InternStrings.jl


#15

Where are the results?


#16

The link is in the very first post of this thread.


#17

Will InternedStrings.jl be included by default on DataFrames.jl for string columns?


#18

I don’t think it will because it takes time to intern the strings.


#19

CSV.jl already interns strings, the problem is that we have no way to know whether a Vector{String} column only contains interned strings or not. The best solution we have for now is to use CategoricalArrays (using categorical=true or categorical=0.1, etc.), or PooledArrays (not yet supported). That’s actually even more efficient than interned strings, since we have group indices from 1 to N rather than pointers to strings (which still need to be hashed or sorted).

The H2O benchmark shows poor results for DataFrames.jl because it doesn’t use CategoricalArrays right now, but hopefully it will very soon.


#20

As I see ( /h2oai/db-benchmark/juliadf/setup-juliadf.sh)

  • it is a Julia v1.0.0 benchmark ( now v1.0.2 and v1.1.0 expected )
  • No precompile.

imho: Maybe it is a little effect … but important for the fresh benchmark.