I once made a FastGroupBy.jl to do some fast group by but now DataFrames.jl is plenty fast by itself!
DataFrames.jl has done lots of optimizations, I wonder what’s been done.
Database-like ops benchmark is showing really promising stuff about DataFrames.jl’s performance.
For 50G size data, DataFrames.jl has some catching up to do with data.table.
Next stop fast joins!
PS as usually, Query.jl’s performance would potentially be an issue with large datasets. That’s the reason why I don’t use it still.
**Expand to see code**
using DataFrames, Statisticsusing BenchmarkTools
using Pipe
using DataFramesMeta
using FastGroupBy
using Query
using Plots
function plot_bench(df; title = “”)
time2 = @belapsed @pipe df |>
groupby(_, :a) |>
combine(_, meanb = :b => mean) # 83ms
time3 = @belapsed @pipe df |>
@by(_, :a, meanb = mean(:b)) #158
time4 = @belapsed fastby(mean, df, :a, :b)
time1 = @belapsed df |>
@groupby(_.a) |>
@map({meanb=mean(_.b)}) |>
DataFrame # 266.353
plot(
["Query.jl", "DataFrames.jl", "DataFramesMeta.jl", "FastGroupBy.jl"],
[time1, time2, time3, time4];
title = title,
seriestype = :bar)
end
df = DataFrame(a=rand(1:100_000, 10_000_000), b=rand(10_000_000))
plot_bench(df; title = “Group By a (100_000 groups) mean(b)”)
savefig(“grp100000.png”)
df10 = DataFrame(a=rand(1:8, 10_000_000), b=rand(10_000_000))
plot_bench(df10; title = “Group By a (10 groups) mean(b)”)
savefig(“grp10.png”)