Timings for different groupby approaches

I am running some simulations at work, and I wanted to experiment with different approaches for performing a groupby and then running the simulation. The idea is to run a simulation of each group (identified by an id column) using multiple datasets all keyed by the same id column. There are about 5 DataFrames, and one of the only has on row per id. So running 5 joins seemed wasteful.

Using some advice from the Slack #Data channel, I came up with the following four approaches and their timings on a mock dataset. Here’s the code:

using DataFrames
using Query
using Tables
using BenchmarkTools

randparams(n) = join.([rand('a':'z', 100) for _ in 1:n])

function sim(params, data)
    total = 0.0
    for p in params, x in data
        total += Int(p)*x
    end
    return total
end

function datan(n_ids, n_data, n_df)
    ids = collect(1:n_ids)
    data_ids = collect(1:n_data)
    a = DataFrame(id=ids, p=randparams(n_ids))
    b = [
        DataFrame(
            :id => vcat([ids for _ in 1:n_data]...),
            :data_id => [i for i in data_ids for _ in 1:n_data],
            Symbol("x$j") => rand(n_ids*n_data),
        )
        for j in 1:(n_df-1)
    ]
    a,b
end

function simgroups_naivejoin(a, b)
    ab = join(a, b[1]; on=:id)
    for df in b[2:end]
        ab = join(ab, df; on=[:id, :data_id])
    end
    result = Dict()
    for g in groupby(ab, :id)
        params = g.p[1]
        
        result[g.id[1]] = 0.0
        for i in 1:length(b)
            data = g[!, Symbol("x$i")]
            result[g.id[1]] += sim(params, data)
        end
    end
    return result
end

function simgroups_groupbykeys(a, b)
    ag = groupby(a, :id)
    bg = [groupby(df, :id) for df in b]
    a_gk_by_pk = Dict(tuple(k...) => k for k in keys(ag))
    b_gk_by_pk = [Dict(tuple(k...) => k for k in keys(dfg)) for dfg in bg]
    result = Dict()
    for ka in keys(a_gk_by_pk)
        if all(haskey(gk_by_pk, ka) for gk_by_pk in b_gk_by_pk)
            a_gk= a_gk_by_pk[ka]
            params = ag[a_gk].p[1]
            result[ka[1]] = 0.0
            for (i, gk_by_pk, dfg) in zip(1:length(bg), b_gk_by_pk, bg)
                bgkey = gk_by_pk[ka]
                data = dfg[bgkey][!, Symbol("x$i")]
                result[ka[1]] += sim(params, data)
            end
        end
    end
    return result
end

function simgroups_groupjoinselect(a, b)
    ab = a |> @groupjoin(
        b[1], _.id, _.id,
        (x,y) -> (id=x.id, params=x.p, data=(y,)))
    for (i, df) in zip(2:length(b), b[2:end])
        data_col = Symbol("data$i")
        ab = ab |> @groupjoin(
            df, _.id, _.id,
            (x,y) -> (id=x.id, params=x.params, data=(x.data..., y)))
    end
    result = Dict()
    for g in ab
        result[g.id] = 0.0
        for (i, gdata) in zip(1:length(b), g.data)
            x_col = Symbol("x$i")
            data = vec(Tables.matrix(Tables.select(gdata, x_col)))
            result[g.id] += sim(g.params, data)
        end
    end
    return result
end


function simgroups_indexview(a, b)
    result = Dict()
    mask = [Array{Bool}(undef, size(df)[1]) for df in b]
    for g in groupby(a, :id)
        id = g.id[1]
        params = g.p[1]
        result[g.id[1]] = 0.0
        for i in 1:length(b)
            df = b[i]
            mask[i] .= df.id .== id
            data = view(df, mask[i], Symbol("x$i"))
            result[g.id[1]] += sim(params, data)
        end
    end
    return result
end

function benchmark(n, m)
    @assert m >= 2
    a, b = datan(n, n, m)
    println("naive join")
    @btime simgroups_naivejoin($a, $b)
    println("groupby keys")
    @btime simgroups_groupbykeys($a, $b)
    println("groupjoin select")
    @btime simgroups_groupjoinselect($a, $b)
    println("index view")
    @btime simgroups_indexview($a, $b)
    println("DONE")
end

And here’s the results:

julia> benchmark(1000, 5)
naive join
  4.281 s (6077351 allocations: 855.46 MiB)
groupby keys
  1.983 s (141394 allocations: 159.02 MiB)
groupjoin select
  1.323 s (171862 allocations: 255.23 MiB)
index view
  9.774 s (99897 allocations: 37.66 MiB)
DONE

The Query plus Tables case (groupjoinselect) also provides a pretty flexible approach for handling the data, so it’s pretty encouraging for other use cases too. Obvious the results will vary a lot based on number of columns and how they are used, but I thought other mind find this interesting too.

In Python, I’ve mostly done approach one or two, but that’s largely because of performance considerations, since Pandas is usually the fastest options, and their indexing API is not bad if you’re used to it.

This example is part of my process of learning Julia so I might use it on the job for something more substantial.

2 Likes

This post is still very relevant Group-by performance benchmarks and recommendations - #9 by xiaodai

2 Likes

That’s a very interesting post. I book marked it for future reference as I work to solve some different applications. There’s so many different scenarios for groupby and join.