Julia performs poorly on group-by benchmarks

Tamas_Papp · December 15, 2018, 4:10pm

What are the eltypes of the columns? Bits types, possibly basic ones like Int and Float64, or perhaps a composite bits type?

Can you use narrower types, eg Int16 instead of Int64?

pdeffebach · December 15, 2018, 4:54pm

HDF5 should be more useable. It is a pretty well maintained package.

bkamins · December 15, 2018, 9:24pm

For short-term storage on the same machine (I understand this is your use case) you can use serialize/deserialize from the built-in Serialization module. This is what I normally use as this is the most reliable way (assuming the assumptions of serialize/deserialize are met).

Here are example timings for 100x smaller data:

julia> using DataFrames, Serialization

julia> df = DataFrame(rand(10^7, 10));

julia> @time open("tmp.bin", "w") do f
           serialize(f, df)
       end
  1.743215 seconds (2.04 M allocations: 97.468 MiB, 3.76% gc time)

julia> @time open("tmp.bin", "w") do f
           serialize(f, df)
       end
  0.920587 seconds (5.12 k allocations: 260.501 KiB)

julia> @time df2 = open(deserialize, "tmp.bin");
  1.072732 seconds (715.56 k allocations: 798.918 MiB, 17.02% gc time)

julia> @time df2 = open(deserialize, "tmp.bin");
  0.687661 seconds (139 allocations: 762.947 MiB, 38.90% gc time)

julia> df2 == df
true

You might also consider wrapping it with e.g. https://github.com/bicycle1885/TranscodingStreams.jl for compression (sometimes it helps the performance, but it depends on the data you have).

floswald · December 16, 2018, 12:23pm

Hey, that looks useful. I added this to query.jl a while ago, might as well put it there.

github.com

queryverse/Query.jl/blob/master/benchmark/Rdatatable.jl

module Rdatatable

using StatsBase, RCall, Query, DataFrames, DataFramesMeta, DataTables
using IndexedTables
using IndexedTables.Table

function R_datatable(N,K)

    R"""
    library(data.table)
    N <- $N
    K <- $K
    # copied from
    # https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping
    set.seed(1)
    DT <- data.table(
      id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
      id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
      id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
      id4 = sample(K, N, TRUE),                          # large groups (int)

This file has been truncated. show original

xiaodai · December 16, 2018, 12:31pm

Duplicating stuff in DataBench.jl

xiaodai · January 15, 2019, 11:41am

I tried to follow the instructions, nextjournal seems great but the interface feel too busy and cluttered atm.

Anyway, I tried to follow the following instructions

Remix
click left to the cell ->
Change Runtime ->
Add New Runtime -> select the base image (e.g. julia-1.0 )).
Then just add the right packages + branches to setup your benchmark!

How do I do 5? I tried Pkg.add("MyPackage") but it timed out and complained about out of memory. I am meant to be able to changing the settings for each runtime and give a name I am assuming, but can’t figure out how.

xiaodai · January 16, 2019, 12:19pm

Just updated FastGroupBy.jl for Julia v1 and I have relooked at the benchmarks. For string-based IDs DataFrames.jl is still faster, but for integer id’s FastGroupBy is faster. FastGroupBy uses radixsort as the grouping mechanism, so potentially, that’s something DataFrames.jl should adopt for performance reasons

ID4_ID6

using Revise
using FastGroupBy
using DataFrames

N = 100_000_000;
K = 100;

# faster string sort
#svec = rand("id".*string.(1:N÷K, pad=10), N);
#svec = rand("id".*string.(1:K, pad=3), N);
v1 = rand(1:5, N)
id4 = rand(1:100, N)
id6 = rand(1:N÷K, N)

df = DataFrame(v1 = v1, id4 = id4, id6 = id6)

using BenchmarkTools
fastby_id4 = @belapsed fastby(sum, df, :id4, :v1)
by_id4 = @belapsed by(df, :id4, v1 = :v1 => sum)

fastby_id6 = @belapsed fastby(sum, df, :id6, :v1)
by_id6 = @belapsed by(df, :id6, v1 = :v1 => sum)

using Plots
using StatPlots
groupedbar(
    repeat(["ID4", "ID6"], inner=2),
    [fastby_id4, by_id4, fastby_id6, by_id6],
    group = repeat(["FastGroupBy","DataFrames"], outer = 2),
    title = "FastGroupBy performance (100m rows)")
savefig("benchmark/ID4_ID6.png")

nalimilan · January 16, 2019, 1:13pm

It should be relatively easy to add optimized grouping methods for specific types: we just need to add more methods for row_group_slots (currently we have one CategoricalArrays method and one generic fallback). Please feel free to file PRs so that we can take advantage of your work in FastGroupBy. We could also use the CategoricalArray method for vectors of small integers.

BTW, we’ve just merged my PR adding optimized methods for grouped reductions, making sums about 6-10 times faster (excluding time to perform grouping).

xiaodai · January 23, 2019, 2:57am

There seems to have been massive improvements to the group-bys!!! Congrats!

Topic		Replies	Views
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9316	January 1, 2025
Group-by performance benchmarks and recommendations Data	12	3501	September 2, 2019
JuliaDB Benchmarks Performance announcement	4	1014	February 12, 2019
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	459	August 27, 2020

Julia performs poorly on group-by benchmarks

Related topics