All the ways to group-reduce sorted vectors. Ideas?

xiaodai · August 20, 2020, 4:22am

I have two vectors of the same length feature and target and I wish to do a group-reduce.

We also have that feature is SORTED and possibly contain missing.

I wish to do a group by on feature and sum(target). In SQL form I am doing

select
  feature,
  sum(target)
from
  df
group by
  feature

In DataFrames.jl this can be expressed directly using groupby-combine but I can’t seem to find any algorithm that exploits the fact that feature is sorted. So I hand-rolled my own (code at bottom) and it’s about 5x faster, I also hand-rolled a foldl which has similar speeds.

I was trying to benchmark some other methods so I was looking into ThreadsX.jl and LazyGroupBy.jl but I can’t find anything that exploits the sorted-ness of feature. LazyGroupBy.jl can grouped by something and apply a function to each group but it can’t be used to group by feature and then operate on target.

I tried to get Transducers.jl working, but can’t find the right tool. I did a version with ScanEmit but it was obviously wrong.

I wonder if you can think of other approaches? In particular, I am interested in the fastest approaches. E.g. something with multi-threading.

Code

```julia

using DataFrames, Pipe, SortingLab
using SortingLab: sorttwo!

feature = Vector{Union{Float64, Missing}}(undef, 150_000)
feature .= rand(Float64, 150_000)
feature[rand(1:150_000, 10_000)] .= missing
sort!(feature)
target = rand(Bool, 150_000)

using SortingLab: sorttwo!

sort!(feature)

df = DataFrame(;feature, target)

Way 1

function dataframe_group_reduce(df)
@pipe df[:, [:feature, :target]] |>
sort!(, :feature) |>
groupby(, :feature) |>
combine(, :target => sum) |>
sort!(, :feature)
end

@time df_done = dataframe_group_reduce(df)

Way 2 using SortingLab.jl

function sorttwo_group_reduce(f1, t1)
last_f = f1[1]
cnt = 1
for f in f1
if !isequal(last_f, f)
cnt += 1
last_f = f
end
end

farr = Vector{eltype(f1)}(undef, cnt)
target_sum = Vector{Int}(undef, cnt)

farr[1] = f1[1]
target_sum[1] = 0

i = 1
last_f = f1[1]

for (f, t) in zip(f1, t1)
    if isequal(last_f, f)
        @inbounds target_sum[i] += t
    else
        last_f = f
        i += 1
        @inbounds farr[i] = f
        @inbounds target_sum[i] = t
    end
end

farr, target_sum

end

@time farr, target_sum = sorttwo_group_reduce(feature, target)

isequal(df_done.feature, farr)
isequal(df_done.target_sum, target_sum)

function foldl_groupreduce(feature, target)
zft = zip(feature, target)

f_fold, t_fold = foldl((u, x) -> begin
    last_f = last(u[1])
    target_sum = last(u[2])

    f = x[1]
    t = x[2]

    if isequal(f, last_f)
        u[2][end] += t
    else
        push!(u[1], f)
        push!(u[2], t)
    end
    u
end, zft;
init = (feature[1:1], Int[0]))

end

f_fold, t_fold = foldl_groupreduce(feature, target)

isequal(df_done.feature, f_fold)
isequal(df_done.target_sum, t_fold)

using BenchmarkTools

benchmark_df = @benchmark dataframe_group_reduce($df)
benchmark_sortinglab = @benchmark sorttwo_group_reduce($feature, $target)
benchmark_foldl = @benchmark foldl_groupreduce($feature, $target)

using Plots

plot(
[“DataFrames.jl groupby”, “Hand-rolled”, “foldl”],
[mean(benchmark_df.times), mean(benchmark_sortinglab.times), mean(benchmark_foldl.times)];
seriestype = :bar,
title = “Reduce Group By sorted vector”,
label = “mean Timings”
)

savefig(“ok.png”)

</details>

xiaodai · August 20, 2020, 4:30am

In case anyone mentions indextables.jl and juliadb.jl, the api doesn’t seem to work for me

using IndexedTables

t = table(df)

groupreduce(+, t, :feature) #throws error.

tkf · August 20, 2020, 4:43am

Yeah, unfortunately there is no out-of-the-box way for doing this at the moment. I know how to do this and have been wanting to implement this for a while but I coudln’t find time to actually do this… Basically, you’d need to write a special ~~reducing~~ fold function (or a transducer) for “partition-by” operation using a similar strategy as explained in the parallel word counting example.

xiaodai · August 20, 2020, 4:44am

BTW would Floops.jl help?

tkf · August 20, 2020, 4:48am

FLoops.jl by itself does not help. This is because FLoops.jl is “just a sugar” in the sense there is nothing you can’t do without it. Also, group-reduce is somewhat tricky to expose via for loop syntax since it is essentially a nested loop. I think it’s possible and beneficial since there are other patterns like this (joins and mapreduce with dims).

xiaodai · August 20, 2020, 6:36am

Do I need a SortedPartionBy?

tkf · August 20, 2020, 7:59am

The problem is more that the split boundaries and the group boundaries do not necessarily match (rather than the strict sorted property). So, we need to track the unfinished “head”, “body” and “tail” states separately as in the Segment state of the parallel word splitting.

I guess my comment is rather cryptic. It’s probably easier if I just implement it and explain it with the code. If you want this feature, feel free to open an issue in Transducers.jl repository.