Who does "better" than DataFrames?

In the grouping algo an open-addressing hashing table with linear probing is used to find groups. Look at row_group_slots! in src/groupeddataframe/utils.jl file of DataFrames.

On my computer, even unique(s) is slower than the whole combine statement (no threading). It is surprising. This is worth digging into, in order to make rest of eco-system more optimized.

UPDATE:

Actually, after a bit more digging, it turns out DataFrames uses a few more assumptions which allow it to make things faster. Essentially, it assumes the groups defined by the :s column are the integers between its extrema: (1, 1_000_000). Which makes the need for any sorting and hashing spurious. And it allows to construct the result immediately. It’s a useful optimization and isn’t cheating, but this wouldn’t have the generality of any of the other methods attempted here.

The relevant code in DataFrames, is DataFrames.refpool_and_array(df.s).

Here is a benchmark, showing such an optimization allows achieving double the speed of DataFrames in custom-tailored way:

function testy(s,r)
    res = fill(0.0, 1_000_000)
    @inbounds for i in eachindex(s)
        res[s[i]] = max(res[s[i]], r[i])
    end
    pairs(res)
end

and

julia> @btime combine(groupby(df, :s),:r=>maximum; threads=false);
  162.666 ms (334 allocations: 55.33 MiB)

julia> @btime testy($s,$r);
  80.950 ms (2 allocations: 7.63 MiB)
3 Likes