Who does "better" than DataFrames?

given a list of observations (simulations, trials, results) find the list of “best” results for each simulation.
In the context of DataFrames, using the minilanguage, the problem is solved by calling the following function.

using DataFrames
s=repeat(1:10^6, inner=4)
t=repeat(1:4, 10^6)
r=rand(4*10^6)

df=DataFrame(;s,t,r)
combine(groupby(df, :s),:r=>maximum)

or use this to generate a test dataframe

df=DataFrame((s=Int[],t=Int[],r=Float64[]))
foreach(r->push!(df,r),[(i, t,rand()) for i in 1:10^6 for t in 1:rand(3:8)])

I have tried, going through different ways, to obtain results comparable to that obtained in DataFrames. None of those tried has obtained a comparable result.
What solution could be a competitor that can beat or match the performance of DataFrames?

Below is a list of attempts that are not even remotely competitive

using DataFrames
s=repeat(1:10^6, inner=4)
t=rand(1:100, 4*10^6)
r=rand(4*10^6)

function mgr(df)
  grps=groupby(df,:s)
  m=Vector{Int}(undef,length(grps))
  for (i,g) in  enumerate(grps)
     m[i]=parentindices(g)[1][argmax(g.r)]
  end
  df[m,:]
end


function mgr000(df)
  df[[parentindices(g)[1][argmax(g.r)] for g in  groupby(df,:s)],:]
end

function mgr00(df)
  df.idx=1:nrow(df)
  grps=groupby(df,:s)
  m=Vector{Int}(undef,length(grps))
  for (i,g) in  enumerate(grps)
    m[i]=g.idx[argmax(g.r)]
  end
  df[m,:]
end


function mgr0(df)
  d=Dict{Int, Float64}()
  for r in Tables.namedtupleiterator(df)
      d[r.s]=max(get!(d,r.s,0), r.r)
  end
  d
end

function mgr1(df)
  d=Dict{Int, Float64}()
  for r in eachrow(df)
      d[r.s]=max(get!(d,r.s,0), r.r)
  end
  d
end

function mgr2(df)
  d=Dict{Int, Float64}()
  for r in 1:nrow(df)
    e=df.s[r]
    rr=df.r[r]
    d[e]=max(get!(d,e,rr), rr)
  end
  d
end


function mgr3(df)
  d=Dict{Int, Float64}()
  for (r,e) in enumerate(df.s)
    d[e]=max(get!(d,e,0), df.r[r])
  end
  d
end


function mgr4(s,r)
  d=Dict{Int, Float64}()
  for i in eachindex(s)
    ri=r[i]
    e=s[i]
    d[e]=max(get!(d,e,ri), ri)
  end
  d
end


mdf=Matrix(df)
function mgrM(mdf)
  d=Dict{Float64, Float64}()
  for r in eachrow(mdf)
      d[r[1]]=max(get!(d,r[1],0), r[3])
  end
  d
end


using SplitApplyCombine
vt=tuple.(s,r,t)
map(last,group(x->$vt[x][1], x->vt[x], sortperm(vt)))
using StructArrays
function mgr5(s,r,t)
  dfsa=StructArray((;s,r,t))
  sdfsa=sort(dfsa)
  sdfsa[[findall(!=(0),diff(sdfsa.s));lastindex(sdfsa.s)]]
end

Besides having proposals that are faster than combine(groupby(…)…), I’d like to have some comments on some of the attempts I’ve made and which, while valid in principle, can get better results, by changing the implementation.

You might need to use threading, since DataFrames does: https://dataframes.juliadata.org/stable/lib/functions/#Multithreading-support

I don’t know how to do it and I’m afraid it’s not enough.
Here are some of the results obtained with some of the functions found.
DataFrames is 4 to 25 times faster.


julia> using BenchmarkTools

julia> @btime combine(groupby(df, :s),:r=>maximum);
  32.473 ms (348 allocations: 55.33 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum; threads=false);    
  32.479 ms (342 allocations: 55.33 MiB)

julia> function mgr2(df)
         d=Dict{Int, Float64}()
         for r in 1:nrow(df)
           e=df.s[r]
           rr=df.r[r]
           d[e]=max(get!(d,e,rr), rr)
         end
         d
       end
mgr2 (generic function with 1 method)

julia> @btime mgr2(df);
  792.711 ms (23996989 allocations: 431.33 MiB)

julia> function mgr000(df)
         df[[parentindices(g)[1][argmax(g.r)] for g in  groupby(df,:s)],:]
       end
mgr000 (generic function with 1 method)

julia> @btime mgr000(df);
  548.314 ms (11000028 allocations: 412.95 MiB)

julia> function mgr00(df)
         df.idx=1:nrow(df)
         grps=groupby(df,:s)
         m=Vector{Int}(undef,length(grps))
         for (i,g) in  enumerate(grps)
           m[i]=g.idx[argmax(g.r)]
         end
         df[m,:]
       end
mgr00 (generic function with 1 method)

julia> @btime mgr00(df);
  698.183 ms (13999529 allocations: 573.16 MiB)

julia> function mgr4(s,r)
         d=Dict{Int, Float64}()
         for i in eachindex(s)
           ri=r[i]
           e=s[i]
           d[e]=max(get!(d,e,ri), ri)
         end
         d
       end
mgr4 (generic function with 1 method)

julia> @btime mgr4(s,r);
  122.275 ms (54 allocations: 65.17 MiB)

Here’s one simple approach that’s only 2x slower on my system.

using BenchmarkTools
using DataFrames

s = repeat(1:10^6, inner=4)
t = repeat(1:4, 10^6)
r = rand(4*10^6)

df = DataFrame(;s,t,r)

function inner_loop!(group_maxima, s, r)
    for i in eachindex(s)
        @inbounds k = s[i]
        @inbounds group_maxima[k] = max(get(group_maxima, k, -Inf), r[i])
    end
end

function custom_maximum(df)
    s, r = df.s, df.r
    group_maxima = Dict{eltype(s), Float64}()
    inner_loop!(group_maxima, s, r)
    DataFrame(
        s = collect(keys(group_maxima)),
        r_maximum = collect(values(group_maxima)),
    )
end

combine(groupby(df, :s),:r=>maximum)
custom_maximum(df)

@btime custom_maximum(df);
@btime combine(groupby(df, :s),:r=>maximum);

how is it different from this function?


function mgr4(s,r)
  d=Dict{Int, Float64}()
  for i in eachindex(s)
    ri=r[i]
    e=s[i]
    d[e]=max(get!(d,e,ri), ri)
  end
  d
end

The core is the same, except that mgr4 doesn’t actually operate on a DataFrame and doesn’t use @inbounds.

It also doesn’t use get!, which I don’t understand the value of. Why do an update with information you know is getting overwritten immediately?

1 Like

i don’t know if i measure correctly, but in my environment i get these results

julia> @btime custom_maximum(df);
  149.093 ms (114 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
  32.922 ms (348 allocations: 55.33 MiB)

julia> function mgr4(s,r)
         d=Dict{Int, Float64}()
         for i in eachindex(s)
           ri=r[i]
           e=s[i]
           d[e]=max(get!(d,e,ri), ri)
         end
         d
       end
mgr4 (generic function with 1 method)

julia> @btime mgr4(s,r);
  140.000 ms (54 allocations: 65.17 MiB)

good catch!
even if it doesn’t change the result much.


julia> function mgr41(s,r)
         d=Dict{Int, Float64}()
         for i in eachindex(s)
           ri=r[i]
           e=s[i]
           d[e]=max(get(d,e,ri), ri)
         end
         d
       end
mgr41 (generic function with 1 method)

julia> 

julia> 

julia> @btime mgr41(s,r);
  114.488 ms (54 allocations: 65.17 MiB)
1 Like

That’s very different from what I get. I’m on Julia 1.9.0-beta4. Is your environment different or is the code you’re using to measure different? I get this:

julia> @btime custom_maximum(df);
  86.065 ms (90 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
  42.170 ms (328 allocations: 55.33 MiB)
(@data_science) pkg> st
Status `C:\Users\sprmn\.julia\environments\data_science\Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
⌃ [336ed68f] CSV v0.8.5
⌃ [a93c6f00] DataFrames v1.4.4
⌃ [02685ad9] DataPipes v0.3.5
  [864edb3b] DataStructures v0.18.13
  [85a47980] Dictionaries v0.3.25
  [31c24e10] Distributions v0.25.86
⌃ [6394faf6] FlexiMaps v0.1.8
⌃ [d9f16b24] Functors v0.4.3
  [02fcd773] ImageTransformations v0.9.5
  [916415d5] Images v0.25.2
  [6deec6e2] IndexedTables v1.0.0
  [de52edbc] Integrals v3.7.0
  [c8e1da08] IterTools v1.4.0
  [0f8b85d8] JSON3 v1.12.0
  [91a5bcdd] Plots v1.38.8
  [1fd47b50] QuadGK v2.8.2
  [276daf66] SpecialFunctions v2.2.0
  [f3b207a7] StatsPlots v0.15.4
  [bd369af6] Tables v1.10.1
  [9d95f2ec] TypedTables v1.4.1
  [9a3f8284] Random
Info Packages marked with ⌃ have new versions available and may be upgradable.

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Tried again with updated DataFrames and Julia 1.9.0-rc2, still see 2x difference rather than 4x:

julia> @btime custom_maximum(df);
  88.790 ms (88 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
  42.273 ms (312 allocations: 55.33 MiB)

Just to be sure, have you tried copying exactly the code I’m using:

using BenchmarkTools
using DataFrames

s = repeat(1:10^6, inner=4)
t = repeat(1:4, 10^6)
r = rand(4*10^6)

df = DataFrame(;s,t,r)

function inner_loop!(group_maxima, s, r)
    for i in eachindex(s)
        @inbounds k = s[i]
        @inbounds group_maxima[k] = max(get(group_maxima, k, -Inf), r[i])
    end
end

function custom_maximum(df)
    s, r = df.s, df.r
    group_maxima = Dict{eltype(s), Float64}()
    inner_loop!(group_maxima, s, r)
    DataFrame(
        s = collect(keys(group_maxima)),
        r_maximum = collect(values(group_maxima)),
    )
end

combine(groupby(df, :s),:r=>maximum)
custom_maximum(df)

@btime custom_maximum(df);
@btime combine(groupby(df, :s),:r=>maximum);

from a fresh session

Summary
julia> df=DataFrame(;s,t,r)
4000000×3 DataFrame
     Row │ s        t      r
         │ Int64    Int64  Float64
─────────┼───────────────────────────
       1 │       1      1  0.354125
       2 │       1      2  0.572635
       3 │       1      3  0.888769
       4 │       1      4  0.298215
       5 │       2      1  0.254347

    ⋮    │    ⋮       ⋮        ⋮
 
 3999997 │ 1000000      1  0.785584
 3999998 │ 1000000      2  0.745871
 3999999 │ 1000000      3  0.414316
 4000000 │ 1000000      4  0.0531609
                 3999964 rows omitted

julia> function inner_loop!(group_maxima, s, r)
           for i in eachindex(s)
               @inbounds k = s[i]
               @inbounds group_maxima[k] = max(get(group_maxima, k, -Inf), r[i])
           end
       end
inner_loop! (generic function with 1 method)

julia> function custom_maximum(df)
           s, r = df.s, df.r
           group_maxima = Dict{eltype(s), Float64}()
           inner_loop!(group_maxima, s, r)
           DataFrame(
               s = collect(keys(group_maxima)),
               r_maximum = collect(values(group_maxima)),
           )
       end
custom_maximum (generic function with 1 method)

julia> @btime custom_maximum(df);
ERROR: LoadError: UndefVarError: @btime not defined
in expression starting at c:\Users\sprmn\.julia\environments\data_science\dataframes1.jl:177

julia> using BenchmarkTools

julia> @btime custom_maximum(df);
  197.581 ms (114 allocations: 95.69 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
  38.098 ms (348 allocations: 55.33 MiB)

julia> function mgr41(s,r)
         d=Dict{Int, Float64}()
         for i in eachindex(s)
           ri=r[i]
           e=s[i]
           d[e]=max(get(d,e,ri), ri)
         end
         d
       end
mgr41 (generic function with 1 method)

julia> @btime mgr41(s,r);
  132.640 ms (54 allocations: 65.17 MiB)

julia> 

julia> 

julia> @btime combine(groupby(df, :s),:r=>maximum);
  34.255 ms (348 allocations: 55.33 MiB)

julia> @btime custom_maximum(df);
  149.525 ms (114 allocations: 95.69 MiB)

julia> @btime mgr41(s,r);
  119.712 ms (54 allocations: 65.17 MiB)

1 Like

How does this implementation (starting to get overfit) do on your machine?

using BenchmarkTools
using DataFrames

s = repeat(1:10^6, inner=4)
t = repeat(1:4, 10^6)
r = rand(4*10^6)

df = DataFrame(;s,t,r)

function inner_loop!(s, r)
    n_unique = 0
    unique_values = similar(s, 0)
    maxima = Float64[]
    group_to_id = Dict{eltype(s), Int}()
    for i in eachindex(s)
        @inbounds k, v = s[i], r[i]
        id = get(group_to_id, k, 0)
        if id == 0
            n_unique += 1
            push!(unique_values, k)
            push!(maxima, v)
            group_to_id[k] = n_unique
        else
            @inbounds maxima[id] = max(maxima[id], v)
        end
    end
    unique_values, maxima
end

function custom_maximum(df)
    s, r = df.s, df.r
    unique_values, maxima = inner_loop!(s, r)
    idx = sortperm(unique_values)
    DataFrame(
        s = unique_values[idx],
        r_maximum = maxima[idx],
    )
end

combine(groupby(df, :s),:r=>maximum)
custom_maximum(df)

@btime custom_maximum(df);
@btime combine(groupby(df, :s),:r=>maximum);
Summary
julia> df = DataFrame(;s,t,r)
4000000×3 DataFrame
     Row │ s        t      r
         │ Int64    Int64  Float64
─────────┼───────────────────────────
       1 │       1      1  0.899072
       2 │       1      2  0.744434
       3 │       1      3  0.525763
       4 │       1      4  0.830084
       5 │       2      1  0.608453
       6 │       2      2  0.190857
       7 │       2      3  0.804784
       8 │       2      4  0.028141
       9 │       3      1  0.406663
      10 │       3      2  0.497017
      11 │       3      3  0.639575
      12 │       3      4  0.389806
      13 │       4      1  0.40229
      14 │       4      2  0.500647
      15 │       4      3  0.52252
      16 │       4      4  0.799556
      17 │       5      1  0.189878
      18 │       5      2  0.0847746
    ⋮    │    ⋮       ⋮        ⋮
 3999983 │  999996      3  0.734527
 3999984 │  999996      4  0.252419
 3999985 │  999997      1  0.606366
 3999986 │  999997      2  0.298632
 3999987 │  999997      3  0.694069
 3999988 │  999997      4  0.456197
 3999989 │  999998      1  0.901072
 3999990 │  999998      2  0.375714
 3999991 │  999998      3  0.335106
 3999992 │  999998      4  0.827765
 3999993 │  999999      1  0.87447
 3999994 │  999999      2  0.514698
 3999995 │  999999      3  0.802185
 3999996 │  999999      4  0.318741
 3999997 │ 1000000      1  0.921737
 3999998 │ 1000000      2  0.14284
 3999999 │ 1000000      3  0.145224
 4000000 │ 1000000      4  0.2017
                 3999964 rows omitted

julia> function inner_loop!(s, r)
           n_unique = 0
           unique_values = similar(s, 0)
           maxima = Float64[]
           group_to_id = Dict{eltype(s), Int}()
           for i in eachindex(s)
               @inbounds k, v = s[i], r[i]
               id = get(group_to_id, k, 0)
               if id == 0
                   n_unique += 1
                   push!(unique_values, k)
                   push!(maxima, v)
                   group_to_id[k] = n_unique
               else
                   @inbounds maxima[id] = max(maxima[id], v)
               end
           end
           unique_values, maxima
       end
inner_loop! (generic function with 1 method)

julia> function custom_maximum(df)
           s, r = df.s, df.r
           unique_values, maxima = inner_loop!(s, r)
           idx = sortperm(unique_values)
           DataFrame(
               s = unique_values[idx],
               r_maximum = maxima[idx],
           )
       end
custom_maximum (generic function with 1 method)

julia> combine(groupby(df, :s),:r=>maximum)
1000000×2 DataFrame
     Row │ s        r_maximum 
         │ Int64    Float64
─────────┼────────────────────
       1 │       1   0.899072
       2 │       2   0.804784
       3 │       3   0.639575
       4 │       4   0.799556
       5 │       5   0.707495
       6 │       6   0.72876
       7 │       7   0.702494
       8 │       8   0.892053
       9 │       9   0.6472
      10 │      10   0.841624
      11 │      11   0.684432
      12 │      12   0.884835
      13 │      13   0.801812
      14 │      14   0.823458
      15 │      15   0.881391
      16 │      16   0.731436
      17 │      17   0.760713
      18 │      18   0.489024
    ⋮    │    ⋮         ⋮
  999983 │  999983   0.624175
  999984 │  999984   0.805169
  999985 │  999985   0.731956
  999986 │  999986   0.437265
  999987 │  999987   0.757564
  999988 │  999988   0.674089
  999989 │  999989   0.81043
  999990 │  999990   0.236133
  999991 │  999991   0.892063
  999992 │  999992   0.9613
  999993 │  999993   0.80769
  999994 │  999994   0.960819
  999995 │  999995   0.99415
  999996 │  999996   0.794578
  999997 │  999997   0.694069
  999998 │  999998   0.901072
  999999 │  999999   0.87447
 1000000 │ 1000000   0.921737
           999964 rows omitted

julia> custom_maximum(df)
1000000×2 DataFrame
     Row │ s        r_maximum 
         │ Int64    Float64
─────────┼────────────────────
       1 │       1   0.899072
       2 │       2   0.804784
       3 │       3   0.639575
       4 │       4   0.799556
       5 │       5   0.707495
       6 │       6   0.72876
       7 │       7   0.702494
       8 │       8   0.892053
       9 │       9   0.6472
      10 │      10   0.841624
      11 │      11   0.684432
      12 │      12   0.884835
      13 │      13   0.801812
      14 │      14   0.823458
      15 │      15   0.881391
      16 │      16   0.731436
      17 │      17   0.760713
      18 │      18   0.489024
    ⋮    │    ⋮         ⋮
  999983 │  999983   0.624175
  999984 │  999984   0.805169
  999985 │  999985   0.731956
  999986 │  999986   0.437265
  999987 │  999987   0.757564
  999988 │  999988   0.674089
  999989 │  999989   0.81043
  999990 │  999990   0.236133
  999991 │  999991   0.892063
  999992 │  999992   0.9613
  999993 │  999993   0.80769
  999994 │  999994   0.960819
  999995 │  999995   0.99415
  999996 │  999996   0.794578
  999997 │  999997   0.694069
  999998 │  999998   0.901072
  999999 │  999999   0.87447
 1000000 │ 1000000   0.921737
           999964 rows omitted

julia> @btime custom_maximum(df);
  154.487 ms (144 allocations: 122.88 MiB)

julia> @btime combine(groupby(df, :s),:r=>maximum);
  32.756 ms (348 allocations: 55.33 MiB)
1 Like

Strange – I’ll have to try building 1.8 on my machine to see if that’s the issue. You get timings very far from the ones I see consistently across different code. Hard to know if that’s just the 1.8 vs 1.9 difference.

For what it’s worth, I get the following timings on my laptop (M1 MacBook Air) with Julia 1.8.2.

@btime custom_maximum(df);
138.764 ms (120 allocations: 122.88 MiB)

@btime combine(groupby(df, :s),:r=>maximum);
62.364 ms (338 allocations: 55.33 MiB)
1 Like

Thanks! I built 1.8.5 on my M1 laptop and got similar results.

Still haven’t figured out you’d need to do to catch up with modern DataFrames code. Reading the implementations there it’s not obvious what’s the core insight we’re missing in this thread.

The performance of DataFrames is really impressive in this case. I’ve tried a couple of variations and DataFrames is still better out-of-the-box.
The underlying reason could be less memory-access when doing the maximum. Instead of updating a group maximum in memory after each row, the groupby has the row indices by group which allows doing the maximum on the CPU by going over each group separately.
Again props to @bkamins for really optimized code.

4 Likes

Agreed - quite impressive. Have you trying doing that row index collection? It’s not obvious to me why it would be so much better (cache locality?), but would love to see if that’s the missing piece.