Julia performs poorly on group-by benchmarks

xiaodai · October 18, 2018, 1:29am

h20 has published a set of benchmarks that shows Julia’s DataFrames.jl has, in general, the worst group-by performance out of many data packages. JuliaDB.jl was not benchmarked so that may be a good addition. I have done some work before on optimising some benchmarks and I’ve been putting it off until the release of v1.0. Now that v1.0.1 is out, it’s time for me to pick up the work again using FastGroupBy.jl.

pdeffebach · October 18, 2018, 3:09am

Yeah the benchmarks leave a lot to be desired. It’s being discussed on discourse here. There are a few soon-to-be-merged PRs that should improve things.

nilshg · October 18, 2018, 8:56am

And yes I will get back to this at some point and report on progress with trying out some of the suggestions in that thread…

sdanisch · October 18, 2018, 2:22pm

I created a nextjournal group for this:

Everyone in that group can edit the articles & publish new ones - we can even edit the article together at the same time! I plan to add a minimal julia version for the sum v1 by id1 benchmark, so collaborative editing will be pretty useful! @xiaodai, seems like you’re in the pole position to add a minimal version? I think it would really help to have more people tune the performance!

The sum v1 by id1 contains all the languages & multiple branches of the same packages in the same article in isolated runtimes which run on the same hardware. That makes it pretty easy to see where we are right now:

I hope we can fill in the performance gaps from here on! I put some instructions into sum v1 by id1 on how to add new benchmarks. Let me know if you have any problems with that!

If you want to be part of the group to add a new version, signup and let me know so I can add you to julia-data.

Juan · December 4, 2018, 12:21pm

The benchmarks could be redone with the new dataframes v0.15.

Another good options is

xiaodai · December 4, 2018, 12:35pm

Need a pr as their code are old

sdanisch · December 4, 2018, 2:41pm

I actually tried to update the article, but I run out of 15 gb RAM - seems weird, so not sure what’s happening!

nalimilan · December 4, 2018, 5:14pm

With what code exactly? I can’t reproduce locally.

ValdarT · December 4, 2018, 6:57pm

The original H2O benchmark should also automatically update to the new DataFrames whenever it updates itself but it could use a PR as it uses the do syntax that DataFrames docs specifically recommend to avoid if performance matters.

bkamins · December 4, 2018, 7:11pm

The PR is already merged so we now have to wait and see how it goes (there are some issues with CSV.jl that might need fixing).

ValdarT · December 8, 2018, 9:44am

And the results are in. Not too good on the performance side but the new syntax is lovely and it’s nice to see that it could also handle the 50GB file (unlike, for example, Pandas)

matthieu · December 8, 2018, 3:21pm

These benchmarks are informative but datatable/panda/dplyr heavily optimize sum/mean. For any other function Julia may have similar or better performance (because of how slow it is to repeatedly call a function in R compared to Julia).

On the other hand, doing the benchmarkrs on data with missing value may make Julia slower compared to these packages.

bkamins · December 8, 2018, 3:38pm

There are still performance optimizations that we can add to speed it up. But at least in all the benchmarks we got closer to the fastest options.

xiaodai · December 9, 2018, 3:45am

Agree. Because InternedStrings.jl is so much better than before I tried an InternedString approach on 100million rows which yielded this approach

using SortingAlgorithms, InternedStrings, DataFrames

function createSynDataFrame(N::Int,K::Int)
    pool = "id".*string.(1:K, pad=3)
    pool1 = "id".*string.(1:N÷K,pad=10)
    nums = round.(rand(100).*100, digits = 4)

    df = DataFrame(
        id1 = intern.(rand(pool,N)),
        id2 = intern.(rand(pool,N)),
        id3 = intern.(rand(pool1,N)),
        id4 = rand(1:K,N),
        id5 = rand(1:K,N),
        id6 = rand(1:(N÷K),N),
        v1 = rand(1:5,N),
        v2 = rand(1:5,N),
        v3 = rand(nums,N))
    return df
end

@time df1 = createSynDataFrame(100_000_000, 100)


function sortandperm(x::Vector{String})
    ap = UInt.(pointer.(x))
    ai = sortperm(ap, alg = RadixSort)
    ap, ai
end

using BenchmarkTools
@benchmark sortedid1, ai = sortandperm(df1.id1)

BenchmarkTools.Trial:
memory estimate: 1.49 GiB
allocs estimate: 7

minimum time: 757.153 ms (5.08% GC)
median time: 913.535 ms (21.46% GC)
mean time: 1.079 s (32.91% GC)
maximum time: 2.023 s (64.62% GC)

samples: 5
evals/sample: 1

As you can on my laptop the timing for the most expensive part which is grouping only takes 2 seconds vs 15 seconds it took for the first group by for Julia (5G), and making a sum after should be a piece of cake and take not much time at all. I think Julia should finish within 4 seconds if we use InternStrings.jl

datnamer · December 9, 2018, 4:26am

Where are the results?

ValdarT · December 9, 2018, 11:10am

The link is in the very first post of this thread.

Juan · December 9, 2018, 11:22am

Will InternedStrings.jl be included by default on DataFrames.jl for string columns?

xiaodai · December 9, 2018, 11:29am

I don’t think it will because it takes time to intern the strings.

nalimilan · December 9, 2018, 2:31pm

CSV.jl already interns strings, the problem is that we have no way to know whether a Vector{String} column only contains interned strings or not. The best solution we have for now is to use CategoricalArrays (using categorical=true or categorical=0.1, etc.), or PooledArrays (not yet supported). That’s actually even more efficient than interned strings, since we have group indices from 1 to N rather than pointers to strings (which still need to be hashed or sorted).

The H2O benchmark shows poor results for DataFrames.jl because it doesn’t use CategoricalArrays right now, but hopefully it will very soon.

ImreSamu · December 9, 2018, 6:24pm

As I see ( /h2oai/db-benchmark/juliadf/setup-juliadf.sh)

it is a Julia v1.0.0 benchmark ( now v1.0.2 and v1.1.0 expected )
No precompile.

imho: Maybe it is a little effect … but important for the fresh benchmark.

Topic		Replies	Views
Various by-group strategies compared Data	36	4360	January 30, 2018
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	10001	January 1, 2025
Who does "better" than DataFrames? Performance dataframes	43	2515	April 6, 2023
Bad performance of group_by of DataFrames - updated - General Usage performance	21	1470	October 23, 2019
How is the data ecosystem right now for large datasets? Data	35	7128	July 13, 2017

Julia performs poorly on group-by benchmarks

BenchmarkTools.Trial:
memory estimate: 1.49 GiB
allocs estimate: 7

minimum time: 757.153 ms (5.08% GC)
median time: 913.535 ms (21.46% GC)
mean time: 1.079 s (32.91% GC)
maximum time: 2.023 s (64.62% GC)

Julia performs poorly on group-by benchmarks

BenchmarkTools.Trial: memory estimate: 1.49 GiB allocs estimate: 7

minimum time: 757.153 ms (5.08% GC) median time: 913.535 ms (21.46% GC) mean time: 1.079 s (32.91% GC) maximum time: 2.023 s (64.62% GC)

Related topics

BenchmarkTools.Trial:
memory estimate: 1.49 GiB
allocs estimate: 7

minimum time: 757.153 ms (5.08% GC)
median time: 913.535 ms (21.46% GC)
mean time: 1.079 s (32.91% GC)
maximum time: 2.023 s (64.62% GC)