A minor group-by benchmark - DataFrames.jl plenty fast

I once made a FastGroupBy.jl to do some fast group by but now DataFrames.jl is plenty fast by itself!

DataFrames.jl has done lots of optimizations, I wonder what’s been done.

https://h2oai.github.io/db-benchmark/ is showing really promising stuff about DataFrames.jl’s performance.

For 50G size data, DataFrames.jl has some catching up to do with data.table.

Next stop fast joins!

PS as usually, Query.jl’s performance would potentially be an issue with large datasets. That’s the reason why I don’t use it still.

grp10

grp100000

**Expand to see code** using DataFrames, Statistics

using BenchmarkTools

using Pipe

using DataFramesMeta

using FastGroupBy

using Query

using Plots

function plot_bench(df; title = “”)

time2 = @belapsed @pipe df |>

    groupby(_, :a) |>

    combine(_, meanb = :b => mean) # 83ms

time3 = @belapsed @pipe df |>

    @by(_, :a, meanb = mean(:b)) #158

time4 = @belapsed fastby(mean, df, :a, :b)

time1 = @belapsed df |>

    @groupby(_.a) |>

    @map({meanb=mean(_.b)}) |>

    DataFrame # 266.353

plot(

    ["Query.jl", "DataFrames.jl", "DataFramesMeta.jl", "FastGroupBy.jl"],

    [time1, time2, time3, time4];

    title = title,

    seriestype = :bar)

end

df = DataFrame(a=rand(1:100_000, 10_000_000), b=rand(10_000_000))

plot_bench(df; title = “Group By a (100_000 groups) mean(b)”)

savefig(“grp100000.png”)

df10 = DataFrame(a=rand(1:8, 10_000_000), b=rand(10_000_000))

plot_bench(df10; title = “Group By a (10 groups) mean(b)”)

savefig(“grp10.png”)

2 Likes

What is the difference between group-by in DataFrames.jl and DataFramesMeta.jl? What table data structures do Query.jl and FastGroupBy.jl use?

You can expand the code section to see some details.

All use the same input which is a DataFrames.DataFrame.

I DataFrames.jl uses groupby and combine while DataFramesMeta.jl uses the @by which could be using an older algorithm.

I believe Query.jl uses a row-based algorithm which isn’t optimized for the fact that we are dealing with column vectors.

Which code section do you mean? Do you mean the Details section at the bottom of https://h2oai.github.io/db-benchmark/?

I included the source code in this post

1 Like

Yes the new manipulation functions combine, select, etc are very speedy compared to previous versions!