How can I benchmark DataFrames.leftjoin!? (Is this some kind of bug?)

I can’t get the setup argument of @benchmark to work correctly with DataFrames. I’ll use the example in the docs of DataFrames.leftjoin! as an MWE:

julia> using DataFrames, BenchmarkTools

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
 Row │ ID     Name       Job
     │ Int64  String     String?
─────┼───────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     3  Joe Blogs  missing

julia> @benchmark leftjoin($name, $job, on = :ID)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.000 μs …   6.566 ms  ┊ GC (min … max):  0.00% … 98.21%
 Time  (median):     12.100 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   14.525 μs ± 113.040 μs  ┊ GC (mean ± σ):  13.30% ±  1.71%

        ▃ █ ▇ ▅ ▃
  ▁▁▂▂▆▄█▇█████▇█▅▅█▅▇▄▇▄▆▃▅▃▅▃▅▂▂▃▂▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  10 μs           Histogram: frequency by time         18.8 μs <

 Memory estimate: 13.00 KiB, allocs estimate: 201.

So far so good. But now let’s try leftjoin! instead. Since it modifies its first DataFrame argument, I’ll use the setup argument to reinitialize that argument every sample:

julia> @benchmark leftjoin!(newname, $job, on = :ID) setup=(newname=copy($name))
ERROR: ArgumentError: the following columns are present in both left and right data frames but not listed in `on`: Job. Pass makeunique=true to add a suffix automatically to columns names from the right data frame.
Stacktrace:
  [1] leftjoin!(df1::DataFrame, df2::DataFrame; on::Symbol, makeunique::Bool, source::Nothing, matchmissing::Symbol)
    @ DataFrames C:\Users\niclas\.julia\packages\DataFrames\LteEl\src\join\inplace.jl:118
  [2] var"##core#470"(job#469::DataFrame, newname::DataFrame)
    @ Main C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:489
  [3] var"##sample#471"(::Tuple{DataFrame}, __params::BenchmarkTools.Parameters)
    @ Main C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:497
  [4] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; maxevals::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ BenchmarkTools C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:161
  [5] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
    @ BenchmarkTools C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:152
  [6] #invokelatest#2
    @ .\essentials.jl:729 [inlined]
  [7] invokelatest
    @ .\essentials.jl:726 [inlined]
  [8] #lineartrial#46
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:35 [inlined]
  [9] lineartrial
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:35 [inlined]
 [10] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ BenchmarkTools C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:251
 [11] tune! (repeats 2 times)
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:247 [inlined]
 [12] top-level scope
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:394

It seems the newname DataFrame has the :Job column even though it gets reinitialized to copy(name) (which lacks the :Job column) every benchmark sample. What’s going on here? Is this a bug or am I using it wrong?

I get the same error using @benchmark, but interestingly this works

b = @benchmarkable leftjoin!(newname, $job, on=:ID) setup=(newname = copy($name))
run(b)
1 Like

That makes me understand even less. :slight_smile: But useful workaround, thanks!

You need to also add evals=1:

julia> @benchmark leftjoin!(newname, $job, on = :ID) setup=(newname=copy($name)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  4.625 μs …  41.542 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.917 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.083 μs ± 764.174 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▂▁▆
  ▂▄████▆█▄▃▄▃▃▂▂▂▂▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▂▂▁▂▁▂▁▁▁▁▁▁▂ ▃
  4.62 μs         Histogram: frequency by time        8.04 μs <

 Memory estimate: 7.26 KiB, allocs estimate: 103.

Because BenchmarkTools by default may do multiple evaluations per sample.

3 Likes

10000 samples with 1 evaluation

Wow, somehow I never noticed this part of the output. I see I have to go study the docs of BenchmarkTools to learn what the difference between sample and evaluation is. Thanks!

1 Like