# All the ways to group-reduce sorted vectors. Ideas?

I have two vectors of the same length `feature` and `target` and I wish to do a group-reduce.

We also have that `feature` is SORTED and possibly contain `missing`.

I wish to do a group by on `feature` and `sum(target)`. In SQL form I am doing

``````select
feature,
sum(target)
from
df
group by
feature
``````

In DataFrames.jl this can be expressed directly using `groupby-combine` but I can’t seem to find any algorithm that exploits the fact that `feature` is sorted. So I hand-rolled my own (code at bottom) and it’s about 5x faster, I also hand-rolled a `foldl` which has similar speeds.

I was trying to benchmark some other methods so I was looking into ThreadsX.jl and LazyGroupBy.jl but I can’t find anything that exploits the sorted-ness of `feature`. LazyGroupBy.jl can grouped by something and apply a function to each group but it can’t be used to group by `feature` and then operate on `target`.

I tried to get Transducers.jl working, but can’t find the right tool. I did a version with `ScanEmit` but it was obviously wrong.

I wonder if you can think of other approaches? In particular, I am interested in the fastest approaches. E.g. something with multi-threading.

Code ```julia

using DataFrames, Pipe, SortingLab
using SortingLab: sorttwo!

feature = Vector{Union{Float64, Missing}}(undef, 150_000)
feature .= rand(Float64, 150_000)
feature[rand(1:150_000, 10_000)] .= missing
sort!(feature)
target = rand(Bool, 150_000)

using SortingLab: sorttwo!

sort!(feature)

df = DataFrame(;feature, target)

# Way 1

function dataframe_group_reduce(df)
@pipe df[:, [:feature, :target]] |>
sort!(, :feature) |>
groupby(
, :feature) |>
combine(, :target => sum) |>
sort!(
, :feature)
end

@time df_done = dataframe_group_reduce(df)

# Way 2 using SortingLab.jl

function sorttwo_group_reduce(f1, t1)
last_f = f1[1]
cnt = 1
for f in f1
if !isequal(last_f, f)
cnt += 1
last_f = f
end
end

``````farr = Vector{eltype(f1)}(undef, cnt)
target_sum = Vector{Int}(undef, cnt)

farr[1] = f1[1]
target_sum[1] = 0

i = 1
last_f = f1[1]

for (f, t) in zip(f1, t1)
if isequal(last_f, f)
@inbounds target_sum[i] += t
else
last_f = f
i += 1
@inbounds farr[i] = f
@inbounds target_sum[i] = t
end
end

farr, target_sum
``````

end

@time farr, target_sum = sorttwo_group_reduce(feature, target)

isequal(df_done.feature, farr)
isequal(df_done.target_sum, target_sum)

function foldl_groupreduce(feature, target)
zft = zip(feature, target)

``````f_fold, t_fold = foldl((u, x) -> begin
last_f = last(u[1])
target_sum = last(u[2])

f = x[1]
t = x[2]

if isequal(f, last_f)
u[2][end] += t
else
push!(u[1], f)
push!(u[2], t)
end
u
end, zft;
init = (feature[1:1], Int[0]))
``````

end

f_fold, t_fold = foldl_groupreduce(feature, target)

isequal(df_done.feature, f_fold)
isequal(df_done.target_sum, t_fold)

using BenchmarkTools

benchmark_df = @benchmark dataframe_group_reduce(\$df)
benchmark_sortinglab = @benchmark sorttwo_group_reduce(\$feature, \$target)
benchmark_foldl = @benchmark foldl_groupreduce(\$feature, \$target)

using Plots

plot(
[“DataFrames.jl groupby”, “Hand-rolled”, “foldl”],
[mean(benchmark_df.times), mean(benchmark_sortinglab.times), mean(benchmark_foldl.times)];
seriestype = :bar,
title = “Reduce Group By sorted vector”,
label = “mean Timings”
)

savefig(“ok.png”)

``</details>``
1 Like

In case anyone mentions indextables.jl and juliadb.jl, the api doesn’t seem to work for me

``````using IndexedTables

t = table(df)

groupreduce(+, t, :feature) #throws error.
``````

Yeah, unfortunately there is no out-of-the-box way for doing this at the moment. I know how to do this and have been wanting to implement this for a while but I coudln’t find time to actually do this… Basically, you’d need to write a special reducing fold function (or a transducer) for “partition-by” operation using a similar strategy as explained in the parallel word counting example.

2 Likes

BTW would Floops.jl help?

1 Like

FLoops.jl by itself does not help. This is because FLoops.jl is “just a sugar” in the sense there is nothing you can’t do without it. Also, group-reduce is somewhat tricky to expose via `for` loop syntax since it is essentially a nested loop. I think it’s possible and beneficial since there are other patterns like this (joins and `mapreduce` with `dims`).

Do I need a `SortedPartionBy`?

The problem is more that the split boundaries and the group boundaries do not necessarily match (rather than the strict sorted property). So, we need to track the unfinished “head”, “body” and “tail” states separately as in the `Segment` state of the parallel word splitting.

I guess my comment is rather cryptic. It’s probably easier if I just implement it and explain it with the code. If you want this feature, feel free to open an issue in Transducers.jl repository.

1 Like