Julia performs poorly on group-by benchmarks

What are the eltypes of the columns? Bits types, possibly basic ones like Int and Float64, or perhaps a composite bits type?

Can you use narrower types, eg Int16 instead of Int64?

HDF5 should be more useable. It is a pretty well maintained package.

For short-term storage on the same machine (I understand this is your use case) you can use serialize/deserialize from the built-in Serialization module. This is what I normally use as this is the most reliable way (assuming the assumptions of serialize/deserialize are met).

Here are example timings for 100x smaller data:

julia> using DataFrames, Serialization

julia> df = DataFrame(rand(10^7, 10));

julia> @time open("tmp.bin", "w") do f
           serialize(f, df)
       end
  1.743215 seconds (2.04 M allocations: 97.468 MiB, 3.76% gc time)

julia> @time open("tmp.bin", "w") do f
           serialize(f, df)
       end
  0.920587 seconds (5.12 k allocations: 260.501 KiB)

julia> @time df2 = open(deserialize, "tmp.bin");
  1.072732 seconds (715.56 k allocations: 798.918 MiB, 17.02% gc time)

julia> @time df2 = open(deserialize, "tmp.bin");
  0.687661 seconds (139 allocations: 762.947 MiB, 38.90% gc time)

julia> df2 == df
true

You might also consider wrapping it with e.g. https://github.com/bicycle1885/TranscodingStreams.jl for compression (sometimes it helps the performance, but it depends on the data you have).

3 Likes

Hey, that looks useful. I added this to query.jl a while ago, might as well put it there.

Duplicating stuff in DataBench.jl

I tried to follow the instructions, nextjournal seems great but the interface feel too busy and cluttered atm.

Anyway, I tried to follow the following instructions

  1. Remix
  2. click left to the cell ->
  3. Change Runtime ->
  4. Add New Runtime -> select the base image (e.g. julia-1.0 )).
  5. Then just add the right packages + branches to setup your benchmark!

How do I do 5? I tried Pkg.add("MyPackage") but it timed out and complained about out of memory. I am meant to be able to changing the settings for each runtime and give a name I am assuming, but can’t figure out how.

Just updated FastGroupBy.jl for Julia v1 and I have relooked at the benchmarks. For string-based IDs DataFrames.jl is still faster, but for integer id’s FastGroupBy is faster. FastGroupBy uses radixsort as the grouping mechanism, so potentially, that’s something DataFrames.jl should adopt for performance reasons

ID4_ID6

using Revise
using FastGroupBy
using DataFrames

N = 100_000_000;
K = 100;

# faster string sort
#svec = rand("id".*string.(1:N÷K, pad=10), N);
#svec = rand("id".*string.(1:K, pad=3), N);
v1 = rand(1:5, N)
id4 = rand(1:100, N)
id6 = rand(1:N÷K, N)

df = DataFrame(v1 = v1, id4 = id4, id6 = id6)

using BenchmarkTools
fastby_id4 = @belapsed fastby(sum, df, :id4, :v1)
by_id4 = @belapsed by(df, :id4, v1 = :v1 => sum)

fastby_id6 = @belapsed fastby(sum, df, :id6, :v1)
by_id6 = @belapsed by(df, :id6, v1 = :v1 => sum)

using Plots
using StatPlots
groupedbar(
    repeat(["ID4", "ID6"], inner=2),
    [fastby_id4, by_id4, fastby_id6, by_id6],
    group = repeat(["FastGroupBy","DataFrames"], outer = 2),
    title = "FastGroupBy performance (100m rows)")
savefig("benchmark/ID4_ID6.png")
2 Likes

It should be relatively easy to add optimized grouping methods for specific types: we just need to add more methods for row_group_slots (currently we have one CategoricalArrays method and one generic fallback). Please feel free to file PRs so that we can take advantage of your work in FastGroupBy. We could also use the CategoricalArray method for vectors of small integers.

BTW, we’ve just merged my PR adding optimized methods for grouped reductions, making sums about 6-10 times faster (excluding time to perform grouping).

There seems to have been massive improvements to the group-bys!!! Congrats!

5 Likes