# Who does "better" than DataFrames?

given a list of observations (simulations, trials, results) find the list of “best” results for each simulation.
In the context of DataFrames, using the minilanguage, the problem is solved by calling the following function.

``````using DataFrames
s=repeat(1:10^6, inner=4)
t=repeat(1:4, 10^6)
r=rand(4*10^6)

df=DataFrame(;s,t,r)
combine(groupby(df, :s),:r=>maximum)
``````

or use this to generate a test dataframe

``````df=DataFrame((s=Int[],t=Int[],r=Float64[]))
foreach(r->push!(df,r),[(i, t,rand()) for i in 1:10^6 for t in 1:rand(3:8)])
``````

I have tried, going through different ways, to obtain results comparable to that obtained in DataFrames. None of those tried has obtained a comparable result.
What solution could be a competitor that can beat or match the performance of DataFrames?

Below is a list of attempts that are not even remotely competitive

``````using DataFrames
s=repeat(1:10^6, inner=4)
t=rand(1:100, 4*10^6)
r=rand(4*10^6)

function mgr(df)
grps=groupby(df,:s)
m=Vector{Int}(undef,length(grps))
for (i,g) in  enumerate(grps)
m[i]=parentindices(g)[1][argmax(g.r)]
end
df[m,:]
end

function mgr000(df)
df[[parentindices(g)[1][argmax(g.r)] for g in  groupby(df,:s)],:]
end

function mgr00(df)
df.idx=1:nrow(df)
grps=groupby(df,:s)
m=Vector{Int}(undef,length(grps))
for (i,g) in  enumerate(grps)
m[i]=g.idx[argmax(g.r)]
end
df[m,:]
end

function mgr0(df)
d=Dict{Int, Float64}()
for r in Tables.namedtupleiterator(df)
d[r.s]=max(get!(d,r.s,0), r.r)
end
d
end

function mgr1(df)
d=Dict{Int, Float64}()
for r in eachrow(df)
d[r.s]=max(get!(d,r.s,0), r.r)
end
d
end

function mgr2(df)
d=Dict{Int, Float64}()
for r in 1:nrow(df)
e=df.s[r]
rr=df.r[r]
d[e]=max(get!(d,e,rr), rr)
end
d
end

function mgr3(df)
d=Dict{Int, Float64}()
for (r,e) in enumerate(df.s)
d[e]=max(get!(d,e,0), df.r[r])
end
d
end

function mgr4(s,r)
d=Dict{Int, Float64}()
for i in eachindex(s)
ri=r[i]
e=s[i]
d[e]=max(get!(d,e,ri), ri)
end
d
end
``````
``````

mdf=Matrix(df)
function mgrM(mdf)
d=Dict{Float64, Float64}()
for r in eachrow(mdf)
d[r[1]]=max(get!(d,r[1],0), r[3])
end
d
end

``````
``````
using SplitApplyCombine
vt=tuple.(s,r,t)
map(last,group(x->\$vt[x][1], x->vt[x], sortperm(vt)))
``````
``````using StructArrays
function mgr5(s,r,t)
dfsa=StructArray((;s,r,t))
sdfsa=sort(dfsa)
sdfsa[[findall(!=(0),diff(sdfsa.s));lastindex(sdfsa.s)]]
end
``````

Besides having proposals that are faster than combine(groupby(…)…), I’d like to have some comments on some of the attempts I’ve made and which, while valid in principle, can get better results, by changing the implementation.

I don’t know how to do it and I’m afraid it’s not enough.
Here are some of the results obtained with some of the functions found.
DataFrames is 4 to 25 times faster.

``````
julia> using BenchmarkTools

julia> @btime combine(groupby(df, :s),:r=>maximum);
32.473 ms (348 allocations: 55.33 MiB)

32.479 ms (342 allocations: 55.33 MiB)

julia> function mgr2(df)
d=Dict{Int, Float64}()
for r in 1:nrow(df)
e=df.s[r]
rr=df.r[r]
d[e]=max(get!(d,e,rr), rr)
end
d
end
mgr2 (generic function with 1 method)

julia> @btime mgr2(df);
792.711 ms (23996989 allocations: 431.33 MiB)

julia> function mgr000(df)
df[[parentindices(g)[1][argmax(g.r)] for g in  groupby(df,:s)],:]
end
mgr000 (generic function with 1 method)

julia> @btime mgr000(df);
548.314 ms (11000028 allocations: 412.95 MiB)

julia> function mgr00(df)
df.idx=1:nrow(df)
grps=groupby(df,:s)
m=Vector{Int}(undef,length(grps))
for (i,g) in  enumerate(grps)
m[i]=g.idx[argmax(g.r)]
end
df[m,:]
end
mgr00 (generic function with 1 method)

julia> @btime mgr00(df);
698.183 ms (13999529 allocations: 573.16 MiB)

julia> function mgr4(s,r)
d=Dict{Int, Float64}()
for i in eachindex(s)
ri=r[i]
e=s[i]
d[e]=max(get!(d,e,ri), ri)
end
d
end
mgr4 (generic function with 1 method)

julia> @btime mgr4(s,r);
122.275 ms (54 allocations: 65.17 MiB)
``````

Here’s one simple approach that’s only 2x slower on my system.

``````using BenchmarkTools
using DataFrames

s = repeat(1:10^6, inner=4)
t = repeat(1:4, 10^6)
r = rand(4*10^6)

df = DataFrame(;s,t,r)

function inner_loop!(group_maxima, s, r)
for i in eachindex(s)
@inbounds k = s[i]
@inbounds group_maxima[k] = max(get(group_maxima, k, -Inf), r[i])
end
end

function custom_maximum(df)
s, r = df.s, df.r
group_maxima = Dict{eltype(s), Float64}()
inner_loop!(group_maxima, s, r)
DataFrame(
s = collect(keys(group_maxima)),
r_maximum = collect(values(group_maxima)),
)
end

combine(groupby(df, :s),:r=>maximum)
custom_maximum(df)

@btime custom_maximum(df);
@btime combine(groupby(df, :s),:r=>maximum);
``````

how is it different from this function?

``````
function mgr4(s,r)
d=Dict{Int, Float64}()
for i in eachindex(s)
ri=r[i]
e=s[i]
d[e]=max(get!(d,e,ri), ri)
end
d
end
``````

The core is the same, except that `mgr4` doesn’t actually operate on a DataFrame and doesn’t use `@inbounds`.

It also doesn’t use `get!`, which I don’t understand the value of. Why do an update with information you know is getting overwritten immediately?

1 Like

i don’t know if i measure correctly, but in my environment i get these results

``````julia> @btime custom_maximum(df);
149.093 ms (114 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
32.922 ms (348 allocations: 55.33 MiB)

julia> function mgr4(s,r)
d=Dict{Int, Float64}()
for i in eachindex(s)
ri=r[i]
e=s[i]
d[e]=max(get!(d,e,ri), ri)
end
d
end
mgr4 (generic function with 1 method)

julia> @btime mgr4(s,r);
140.000 ms (54 allocations: 65.17 MiB)
``````

good catch!
even if it doesn’t change the result much.

``````
julia> function mgr41(s,r)
d=Dict{Int, Float64}()
for i in eachindex(s)
ri=r[i]
e=s[i]
d[e]=max(get(d,e,ri), ri)
end
d
end
mgr41 (generic function with 1 method)

julia>

julia>

julia> @btime mgr41(s,r);
114.488 ms (54 allocations: 65.17 MiB)
``````
1 Like

That’s very different from what I get. I’m on Julia 1.9.0-beta4. Is your environment different or is the code you’re using to measure different? I get this:

``````julia> @btime custom_maximum(df);
86.065 ms (90 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
42.170 ms (328 allocations: 55.33 MiB)
``````
``````(@data_science) pkg> st
Status `C:\Users\sprmn\.julia\environments\data_science\Project.toml`
[6e4b80f9] BenchmarkTools v1.3.2
⌃ [336ed68f] CSV v0.8.5
⌃ [a93c6f00] DataFrames v1.4.4
[864edb3b] DataStructures v0.18.13
[85a47980] Dictionaries v0.3.25
[31c24e10] Distributions v0.25.86
⌃ [6394faf6] FlexiMaps v0.1.8
⌃ [d9f16b24] Functors v0.4.3
[02fcd773] ImageTransformations v0.9.5
[916415d5] Images v0.25.2
[6deec6e2] IndexedTables v1.0.0
[de52edbc] Integrals v3.7.0
[c8e1da08] IterTools v1.4.0
[0f8b85d8] JSON3 v1.12.0
[91a5bcdd] Plots v1.38.8
[276daf66] SpecialFunctions v2.2.0
[f3b207a7] StatsPlots v0.15.4
[bd369af6] Tables v1.10.1
[9d95f2ec] TypedTables v1.4.1
[9a3f8284] Random
Info Packages marked with ⌃ have new versions available and may be upgradable.

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
Threads: 4 on 8 virtual cores
Environment:
JULIA_EDITOR = code
``````

Tried again with updated DataFrames and Julia 1.9.0-rc2, still see 2x difference rather than 4x:

``````julia> @btime custom_maximum(df);
88.790 ms (88 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
42.273 ms (312 allocations: 55.33 MiB)
``````

Just to be sure, have you tried copying exactly the code I’m using:

``````using BenchmarkTools
using DataFrames

s = repeat(1:10^6, inner=4)
t = repeat(1:4, 10^6)
r = rand(4*10^6)

df = DataFrame(;s,t,r)

function inner_loop!(group_maxima, s, r)
for i in eachindex(s)
@inbounds k = s[i]
@inbounds group_maxima[k] = max(get(group_maxima, k, -Inf), r[i])
end
end

function custom_maximum(df)
s, r = df.s, df.r
group_maxima = Dict{eltype(s), Float64}()
inner_loop!(group_maxima, s, r)
DataFrame(
s = collect(keys(group_maxima)),
r_maximum = collect(values(group_maxima)),
)
end

combine(groupby(df, :s),:r=>maximum)
custom_maximum(df)

@btime custom_maximum(df);
@btime combine(groupby(df, :s),:r=>maximum);
``````

from a fresh session

Summary
``````julia> df=DataFrame(;s,t,r)
4000000×3 DataFrame
Row │ s        t      r
│ Int64    Int64  Float64
─────────┼───────────────────────────
1 │       1      1  0.354125
2 │       1      2  0.572635
3 │       1      3  0.888769
4 │       1      4  0.298215
5 │       2      1  0.254347

⋮    │    ⋮       ⋮        ⋮

3999997 │ 1000000      1  0.785584
3999998 │ 1000000      2  0.745871
3999999 │ 1000000      3  0.414316
4000000 │ 1000000      4  0.0531609
3999964 rows omitted

julia> function inner_loop!(group_maxima, s, r)
for i in eachindex(s)
@inbounds k = s[i]
@inbounds group_maxima[k] = max(get(group_maxima, k, -Inf), r[i])
end
end
inner_loop! (generic function with 1 method)

julia> function custom_maximum(df)
s, r = df.s, df.r
group_maxima = Dict{eltype(s), Float64}()
inner_loop!(group_maxima, s, r)
DataFrame(
s = collect(keys(group_maxima)),
r_maximum = collect(values(group_maxima)),
)
end
custom_maximum (generic function with 1 method)

julia> @btime custom_maximum(df);
ERROR: LoadError: UndefVarError: @btime not defined
in expression starting at c:\Users\sprmn\.julia\environments\data_science\dataframes1.jl:177

julia> using BenchmarkTools

julia> @btime custom_maximum(df);
197.581 ms (114 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
38.098 ms (348 allocations: 55.33 MiB)

julia> function mgr41(s,r)
d=Dict{Int, Float64}()
for i in eachindex(s)
ri=r[i]
e=s[i]
d[e]=max(get(d,e,ri), ri)
end
d
end
mgr41 (generic function with 1 method)

julia> @btime mgr41(s,r);
132.640 ms (54 allocations: 65.17 MiB)

julia>

julia>

julia> @btime combine(groupby(df, :s),:r=>maximum);
34.255 ms (348 allocations: 55.33 MiB)

julia> @btime custom_maximum(df);
149.525 ms (114 allocations: 95.69 MiB)

julia> @btime mgr41(s,r);
119.712 ms (54 allocations: 65.17 MiB)

``````
1 Like

How does this implementation (starting to get overfit) do on your machine?

``````using BenchmarkTools
using DataFrames

s = repeat(1:10^6, inner=4)
t = repeat(1:4, 10^6)
r = rand(4*10^6)

df = DataFrame(;s,t,r)

function inner_loop!(s, r)
n_unique = 0
unique_values = similar(s, 0)
maxima = Float64[]
group_to_id = Dict{eltype(s), Int}()
for i in eachindex(s)
@inbounds k, v = s[i], r[i]
id = get(group_to_id, k, 0)
if id == 0
n_unique += 1
push!(unique_values, k)
push!(maxima, v)
group_to_id[k] = n_unique
else
@inbounds maxima[id] = max(maxima[id], v)
end
end
unique_values, maxima
end

function custom_maximum(df)
s, r = df.s, df.r
unique_values, maxima = inner_loop!(s, r)
idx = sortperm(unique_values)
DataFrame(
s = unique_values[idx],
r_maximum = maxima[idx],
)
end

combine(groupby(df, :s),:r=>maximum)
custom_maximum(df)

@btime custom_maximum(df);
@btime combine(groupby(df, :s),:r=>maximum);
``````
Summary
``````julia> df = DataFrame(;s,t,r)
4000000×3 DataFrame
Row │ s        t      r
│ Int64    Int64  Float64
─────────┼───────────────────────────
1 │       1      1  0.899072
2 │       1      2  0.744434
3 │       1      3  0.525763
4 │       1      4  0.830084
5 │       2      1  0.608453
6 │       2      2  0.190857
7 │       2      3  0.804784
8 │       2      4  0.028141
9 │       3      1  0.406663
10 │       3      2  0.497017
11 │       3      3  0.639575
12 │       3      4  0.389806
13 │       4      1  0.40229
14 │       4      2  0.500647
15 │       4      3  0.52252
16 │       4      4  0.799556
17 │       5      1  0.189878
18 │       5      2  0.0847746
⋮    │    ⋮       ⋮        ⋮
3999983 │  999996      3  0.734527
3999984 │  999996      4  0.252419
3999985 │  999997      1  0.606366
3999986 │  999997      2  0.298632
3999987 │  999997      3  0.694069
3999988 │  999997      4  0.456197
3999989 │  999998      1  0.901072
3999990 │  999998      2  0.375714
3999991 │  999998      3  0.335106
3999992 │  999998      4  0.827765
3999993 │  999999      1  0.87447
3999994 │  999999      2  0.514698
3999995 │  999999      3  0.802185
3999996 │  999999      4  0.318741
3999997 │ 1000000      1  0.921737
3999998 │ 1000000      2  0.14284
3999999 │ 1000000      3  0.145224
4000000 │ 1000000      4  0.2017
3999964 rows omitted

julia> function inner_loop!(s, r)
n_unique = 0
unique_values = similar(s, 0)
maxima = Float64[]
group_to_id = Dict{eltype(s), Int}()
for i in eachindex(s)
@inbounds k, v = s[i], r[i]
id = get(group_to_id, k, 0)
if id == 0
n_unique += 1
push!(unique_values, k)
push!(maxima, v)
group_to_id[k] = n_unique
else
@inbounds maxima[id] = max(maxima[id], v)
end
end
unique_values, maxima
end
inner_loop! (generic function with 1 method)

julia> function custom_maximum(df)
s, r = df.s, df.r
unique_values, maxima = inner_loop!(s, r)
idx = sortperm(unique_values)
DataFrame(
s = unique_values[idx],
r_maximum = maxima[idx],
)
end
custom_maximum (generic function with 1 method)

julia> combine(groupby(df, :s),:r=>maximum)
1000000×2 DataFrame
Row │ s        r_maximum
│ Int64    Float64
─────────┼────────────────────
1 │       1   0.899072
2 │       2   0.804784
3 │       3   0.639575
4 │       4   0.799556
5 │       5   0.707495
6 │       6   0.72876
7 │       7   0.702494
8 │       8   0.892053
9 │       9   0.6472
10 │      10   0.841624
11 │      11   0.684432
12 │      12   0.884835
13 │      13   0.801812
14 │      14   0.823458
15 │      15   0.881391
16 │      16   0.731436
17 │      17   0.760713
18 │      18   0.489024
⋮    │    ⋮         ⋮
999983 │  999983   0.624175
999984 │  999984   0.805169
999985 │  999985   0.731956
999986 │  999986   0.437265
999987 │  999987   0.757564
999988 │  999988   0.674089
999989 │  999989   0.81043
999990 │  999990   0.236133
999991 │  999991   0.892063
999992 │  999992   0.9613
999993 │  999993   0.80769
999994 │  999994   0.960819
999995 │  999995   0.99415
999996 │  999996   0.794578
999997 │  999997   0.694069
999998 │  999998   0.901072
999999 │  999999   0.87447
1000000 │ 1000000   0.921737
999964 rows omitted

julia> custom_maximum(df)
1000000×2 DataFrame
Row │ s        r_maximum
│ Int64    Float64
─────────┼────────────────────
1 │       1   0.899072
2 │       2   0.804784
3 │       3   0.639575
4 │       4   0.799556
5 │       5   0.707495
6 │       6   0.72876
7 │       7   0.702494
8 │       8   0.892053
9 │       9   0.6472
10 │      10   0.841624
11 │      11   0.684432
12 │      12   0.884835
13 │      13   0.801812
14 │      14   0.823458
15 │      15   0.881391
16 │      16   0.731436
17 │      17   0.760713
18 │      18   0.489024
⋮    │    ⋮         ⋮
999983 │  999983   0.624175
999984 │  999984   0.805169
999985 │  999985   0.731956
999986 │  999986   0.437265
999987 │  999987   0.757564
999988 │  999988   0.674089
999989 │  999989   0.81043
999990 │  999990   0.236133
999991 │  999991   0.892063
999992 │  999992   0.9613
999993 │  999993   0.80769
999994 │  999994   0.960819
999995 │  999995   0.99415
999996 │  999996   0.794578
999997 │  999997   0.694069
999998 │  999998   0.901072
999999 │  999999   0.87447
1000000 │ 1000000   0.921737
999964 rows omitted

julia> @btime custom_maximum(df);
154.487 ms (144 allocations: 122.88 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
32.756 ms (348 allocations: 55.33 MiB)
``````
1 Like

Strange – I’ll have to try building 1.8 on my machine to see if that’s the issue. You get timings very far from the ones I see consistently across different code. Hard to know if that’s just the 1.8 vs 1.9 difference.

For what it’s worth, I get the following timings on my laptop (M1 MacBook Air) with Julia 1.8.2.

``````@btime custom_maximum(df);
138.764 ms (120 allocations: 122.88 MiB)

@btime combine(groupby(df, :s),:r=>maximum);
62.364 ms (338 allocations: 55.33 MiB)
``````
1 Like

Thanks! I built 1.8.5 on my M1 laptop and got similar results.

Still haven’t figured out you’d need to do to catch up with modern DataFrames code. Reading the implementations there it’s not obvious what’s the core insight we’re missing in this thread.

The performance of DataFrames is really impressive in this case. I’ve tried a couple of variations and DataFrames is still better out-of-the-box.
The underlying reason could be less memory-access when doing the maximum. Instead of updating a group maximum in memory after each row, the `groupby` has the row indices by group which allows doing the maximum on the CPU by going over each group separately.
Again props to @bkamins for really optimized code.

3 Likes

Agreed - quite impressive. Have you trying doing that row index collection? It’s not obvious to me why it would be so much better (cache locality?), but would love to see if that’s the missing piece.