Who does "better" than DataFrames?

Dan · April 2, 2023, 5:57pm

In the grouping algo an open-addressing hashing table with linear probing is used to find groups. Look at row_group_slots! in src/groupeddataframe/utils.jl file of DataFrames.

On my computer, even unique(s) is slower than the whole combine statement (no threading). It is surprising. This is worth digging into, in order to make rest of eco-system more optimized.

UPDATE:

Actually, after a bit more digging, it turns out DataFrames uses a few more assumptions which allow it to make things faster. Essentially, it assumes the groups defined by the :s column are the integers between its extrema: (1, 1_000_000). Which makes the need for any sorting and hashing spurious. And it allows to construct the result immediately. It’s a useful optimization and isn’t cheating, but this wouldn’t have the generality of any of the other methods attempted here.

The relevant code in DataFrames, is DataFrames.refpool_and_array(df.s).

Here is a benchmark, showing such an optimization allows achieving double the speed of DataFrames in custom-tailored way:

function testy(s,r)
    res = fill(0.0, 1_000_000)
    @inbounds for i in eachindex(s)
        res[s[i]] = max(res[s[i]], r[i])
    end
    pairs(res)
end

and

julia> @btime combine(groupby(df, :s),:r=>maximum; threads=false);
  162.666 ms (334 allocations: 55.33 MiB)

julia> @btime testy($s,$r);
  80.950 ms (2 allocations: 7.63 MiB)

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5807	January 23, 2019
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9380	January 1, 2025
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020
DataFrames.jl development survey Data question , dataframes	52	2946	September 27, 2020
[ANN] A new lightning fast package for data manipulation in pure Julia Package Announcements data , dataframes , inmemorydatasets	95	10615	July 4, 2022

Who does "better" than DataFrames?

Related topics