Timings for different groupby approaches

I am running some simulations at work, and I wanted to experiment with different approaches for performing a groupby and then running the simulation. The idea is to run a simulation of each group (identified by an id column) using multiple datasets all keyed by the same id column. There are about 5 DataFrames, and one of the only has on row per id. So running 5 joins seemed wasteful.

Using some advice from the Slack #Data channel, I came up with the following four approaches and their timings on a mock dataset. Here’s the code:

using DataFrames
using Query
using Tables
using BenchmarkTools

randparams(n) = join.([rand('a':'z', 100) for _ in 1:n])

function sim(params, data)
    total = 0.0
    for p in params, x in data
        total += Int(p)*x
    return total

function datan(n_ids, n_data, n_df)
    ids = collect(1:n_ids)
    data_ids = collect(1:n_data)
    a = DataFrame(id=ids, p=randparams(n_ids))
    b = [
            :id => vcat([ids for _ in 1:n_data]...),
            :data_id => [i for i in data_ids for _ in 1:n_data],
            Symbol("x$j") => rand(n_ids*n_data),
        for j in 1:(n_df-1)

function simgroups_naivejoin(a, b)
    ab = join(a, b[1]; on=:id)
    for df in b[2:end]
        ab = join(ab, df; on=[:id, :data_id])
    result = Dict()
    for g in groupby(ab, :id)
        params = g.p[1]
        result[g.id[1]] = 0.0
        for i in 1:length(b)
            data = g[!, Symbol("x$i")]
            result[g.id[1]] += sim(params, data)
    return result

function simgroups_groupbykeys(a, b)
    ag = groupby(a, :id)
    bg = [groupby(df, :id) for df in b]
    a_gk_by_pk = Dict(tuple(k...) => k for k in keys(ag))
    b_gk_by_pk = [Dict(tuple(k...) => k for k in keys(dfg)) for dfg in bg]
    result = Dict()
    for ka in keys(a_gk_by_pk)
        if all(haskey(gk_by_pk, ka) for gk_by_pk in b_gk_by_pk)
            a_gk= a_gk_by_pk[ka]
            params = ag[a_gk].p[1]
            result[ka[1]] = 0.0
            for (i, gk_by_pk, dfg) in zip(1:length(bg), b_gk_by_pk, bg)
                bgkey = gk_by_pk[ka]
                data = dfg[bgkey][!, Symbol("x$i")]
                result[ka[1]] += sim(params, data)
    return result

function simgroups_groupjoinselect(a, b)
    ab = a |> @groupjoin(
        b[1], _.id, _.id,
        (x,y) -> (id=x.id, params=x.p, data=(y,)))
    for (i, df) in zip(2:length(b), b[2:end])
        data_col = Symbol("data$i")
        ab = ab |> @groupjoin(
            df, _.id, _.id,
            (x,y) -> (id=x.id, params=x.params, data=(x.data..., y)))
    result = Dict()
    for g in ab
        result[g.id] = 0.0
        for (i, gdata) in zip(1:length(b), g.data)
            x_col = Symbol("x$i")
            data = vec(Tables.matrix(Tables.select(gdata, x_col)))
            result[g.id] += sim(g.params, data)
    return result

function simgroups_indexview(a, b)
    result = Dict()
    mask = [Array{Bool}(undef, size(df)[1]) for df in b]
    for g in groupby(a, :id)
        id = g.id[1]
        params = g.p[1]
        result[g.id[1]] = 0.0
        for i in 1:length(b)
            df = b[i]
            mask[i] .= df.id .== id
            data = view(df, mask[i], Symbol("x$i"))
            result[g.id[1]] += sim(params, data)
    return result

function benchmark(n, m)
    @assert m >= 2
    a, b = datan(n, n, m)
    println("naive join")
    @btime simgroups_naivejoin($a, $b)
    println("groupby keys")
    @btime simgroups_groupbykeys($a, $b)
    println("groupjoin select")
    @btime simgroups_groupjoinselect($a, $b)
    println("index view")
    @btime simgroups_indexview($a, $b)

And here’s the results:

julia> benchmark(1000, 5)
naive join
  4.281 s (6077351 allocations: 855.46 MiB)
groupby keys
  1.983 s (141394 allocations: 159.02 MiB)
groupjoin select
  1.323 s (171862 allocations: 255.23 MiB)
index view
  9.774 s (99897 allocations: 37.66 MiB)

The Query plus Tables case (groupjoinselect) also provides a pretty flexible approach for handling the data, so it’s pretty encouraging for other use cases too. Obvious the results will vary a lot based on number of columns and how they are used, but I thought other mind find this interesting too.

In Python, I’ve mostly done approach one or two, but that’s largely because of performance considerations, since Pandas is usually the fastest options, and their indexing API is not bad if you’re used to it.

This example is part of my process of learning Julia so I might use it on the job for something more substantial.


This post is still very relevant Group-by performance benchmarks and recommendations


That’s a very interesting post. I book marked it for future reference as I work to solve some different applications. There’s so many different scenarios for groupby and join.