I have two vectors of the same length feature
and target
and I wish to do a group-reduce.
We also have that feature
is SORTED and possibly contain missing
.
I wish to do a group by on feature
and sum(target)
. In SQL form I am doing
select
feature,
sum(target)
from
df
group by
feature
In DataFrames.jl this can be expressed directly using groupby-combine
but I can’t seem to find any algorithm that exploits the fact that feature
is sorted. So I hand-rolled my own (code at bottom) and it’s about 5x faster, I also hand-rolled a foldl
which has similar speeds.
I was trying to benchmark some other methods so I was looking into ThreadsX.jl and LazyGroupBy.jl but I can’t find anything that exploits the sorted-ness of feature
. LazyGroupBy.jl can grouped by something and apply a function to each group but it can’t be used to group by feature
and then operate on target
.
I tried to get Transducers.jl working, but can’t find the right tool. I did a version with ScanEmit
but it was obviously wrong.
I wonder if you can think of other approaches? In particular, I am interested in the fastest approaches. E.g. something with multi-threading.
Code
```juliausing DataFrames, Pipe, SortingLab
using SortingLab: sorttwo!
feature = Vector{Union{Float64, Missing}}(undef, 150_000)
feature .= rand(Float64, 150_000)
feature[rand(1:150_000, 10_000)] .= missing
sort!(feature)
target = rand(Bool, 150_000)
using SortingLab: sorttwo!
sort!(feature)
df = DataFrame(;feature, target)
Way 1
function dataframe_group_reduce(df)
@pipe df[:, [:feature, :target]] |>
sort!(, :feature) |>
groupby(, :feature) |>
combine(, :target => sum) |>
sort!(, :feature)
end
@time df_done = dataframe_group_reduce(df)
Way 2 using SortingLab.jl
function sorttwo_group_reduce(f1, t1)
last_f = f1[1]
cnt = 1
for f in f1
if !isequal(last_f, f)
cnt += 1
last_f = f
end
end
farr = Vector{eltype(f1)}(undef, cnt)
target_sum = Vector{Int}(undef, cnt)
farr[1] = f1[1]
target_sum[1] = 0
i = 1
last_f = f1[1]
for (f, t) in zip(f1, t1)
if isequal(last_f, f)
@inbounds target_sum[i] += t
else
last_f = f
i += 1
@inbounds farr[i] = f
@inbounds target_sum[i] = t
end
end
farr, target_sum
end
@time farr, target_sum = sorttwo_group_reduce(feature, target)
isequal(df_done.feature, farr)
isequal(df_done.target_sum, target_sum)
function foldl_groupreduce(feature, target)
zft = zip(feature, target)
f_fold, t_fold = foldl((u, x) -> begin
last_f = last(u[1])
target_sum = last(u[2])
f = x[1]
t = x[2]
if isequal(f, last_f)
u[2][end] += t
else
push!(u[1], f)
push!(u[2], t)
end
u
end, zft;
init = (feature[1:1], Int[0]))
end
f_fold, t_fold = foldl_groupreduce(feature, target)
isequal(df_done.feature, f_fold)
isequal(df_done.target_sum, t_fold)
using BenchmarkTools
benchmark_df = @benchmark dataframe_group_reduce($df)
benchmark_sortinglab = @benchmark sorttwo_group_reduce($feature, $target)
benchmark_foldl = @benchmark foldl_groupreduce($feature, $target)
using Plots
plot(
[“DataFrames.jl groupby”, “Hand-rolled”, “foldl”],
[mean(benchmark_df.times), mean(benchmark_sortinglab.times), mean(benchmark_foldl.times)];
seriestype = :bar,
title = “Reduce Group By sorted vector”,
label = “mean Timings”
)
savefig(“ok.png”)
</details>