All the ways to group-reduce sorted vectors. Ideas?

I have two vectors of the same length feature and target and I wish to do a group-reduce.

We also have that feature is SORTED and possibly contain missing.

I wish to do a group by on feature and sum(target). In SQL form I am doing

group by

In DataFrames.jl this can be expressed directly using groupby-combine but I can’t seem to find any algorithm that exploits the fact that feature is sorted. So I hand-rolled my own (code at bottom) and it’s about 5x faster, I also hand-rolled a foldl which has similar speeds.


I was trying to benchmark some other methods so I was looking into ThreadsX.jl and LazyGroupBy.jl but I can’t find anything that exploits the sorted-ness of feature. LazyGroupBy.jl can grouped by something and apply a function to each group but it can’t be used to group by feature and then operate on target.

I tried to get Transducers.jl working, but can’t find the right tool. I did a version with ScanEmit but it was obviously wrong.

I wonder if you can think of other approaches? In particular, I am interested in the fastest approaches. E.g. something with multi-threading.

Code ```julia

using DataFrames, Pipe, SortingLab
using SortingLab: sorttwo!

feature = Vector{Union{Float64, Missing}}(undef, 150_000)
feature .= rand(Float64, 150_000)
feature[rand(1:150_000, 10_000)] .= missing
target = rand(Bool, 150_000)

using SortingLab: sorttwo!


df = DataFrame(;feature, target)

Way 1

function dataframe_group_reduce(df)
@pipe df[:, [:feature, :target]] |>
sort!(, :feature) |>
, :feature) |>
combine(, :target => sum) |>
, :feature)

@time df_done = dataframe_group_reduce(df)

Way 2 using SortingLab.jl

function sorttwo_group_reduce(f1, t1)
last_f = f1[1]
cnt = 1
for f in f1
if !isequal(last_f, f)
cnt += 1
last_f = f

farr = Vector{eltype(f1)}(undef, cnt)
target_sum = Vector{Int}(undef, cnt)

farr[1] = f1[1]
target_sum[1] = 0

i = 1
last_f = f1[1]

for (f, t) in zip(f1, t1)
    if isequal(last_f, f)
        @inbounds target_sum[i] += t
        last_f = f
        i += 1
        @inbounds farr[i] = f
        @inbounds target_sum[i] = t

farr, target_sum


@time farr, target_sum = sorttwo_group_reduce(feature, target)

isequal(df_done.feature, farr)
isequal(df_done.target_sum, target_sum)

function foldl_groupreduce(feature, target)
zft = zip(feature, target)

f_fold, t_fold = foldl((u, x) -> begin
    last_f = last(u[1])
    target_sum = last(u[2])

    f = x[1]
    t = x[2]

    if isequal(f, last_f)
        u[2][end] += t
        push!(u[1], f)
        push!(u[2], t)
end, zft;
init = (feature[1:1], Int[0]))


f_fold, t_fold = foldl_groupreduce(feature, target)

isequal(df_done.feature, f_fold)
isequal(df_done.target_sum, t_fold)

using BenchmarkTools

benchmark_df = @benchmark dataframe_group_reduce($df)
benchmark_sortinglab = @benchmark sorttwo_group_reduce($feature, $target)
benchmark_foldl = @benchmark foldl_groupreduce($feature, $target)

using Plots

[“DataFrames.jl groupby”, “Hand-rolled”, “foldl”],
[mean(benchmark_df.times), mean(benchmark_sortinglab.times), mean(benchmark_foldl.times)];
seriestype = :bar,
title = “Reduce Group By sorted vector”,
label = “mean Timings”


1 Like

In case anyone mentions indextables.jl and juliadb.jl, the api doesn’t seem to work for me

using IndexedTables

t = table(df)

groupreduce(+, t, :feature) #throws error.

Yeah, unfortunately there is no out-of-the-box way for doing this at the moment. I know how to do this and have been wanting to implement this for a while but I coudln’t find time to actually do this… :sob: Basically, you’d need to write a special reducing fold function (or a transducer) for “partition-by” operation using a similar strategy as explained in the parallel word counting example.


BTW would Floops.jl help?

1 Like

FLoops.jl by itself does not help. This is because FLoops.jl is “just a sugar” in the sense there is nothing you can’t do without it. Also, group-reduce is somewhat tricky to expose via for loop syntax since it is essentially a nested loop. I think it’s possible and beneficial since there are other patterns like this (joins and mapreduce with dims).

Do I need a SortedPartionBy?

The problem is more that the split boundaries and the group boundaries do not necessarily match (rather than the strict sorted property). So, we need to track the unfinished “head”, “body” and “tail” states separately as in the Segment state of the parallel word splitting.

I guess my comment is rather cryptic. It’s probably easier if I just implement it and explain it with the code. If you want this feature, feel free to open an issue in Transducers.jl repository.

1 Like