How can I benchmark DataFrames.leftjoin!? (Is this some kind of bug?)

NiclasMattsson · April 18, 2023, 4:49pm

I can’t get the setup argument of @benchmark to work correctly with DataFrames. I’ll use the example in the docs of DataFrames.leftjoin! as an MWE:

julia> using DataFrames, BenchmarkTools

julia> name = DataFrame(ID=[1, 2, 3], Name=["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID=[1, 2, 4], Job=["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
 Row │ ID     Name       Job
     │ Int64  String     String?
─────┼───────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     3  Joe Blogs  missing

julia> @benchmark leftjoin($name, $job, on = :ID)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.000 μs …   6.566 ms  ┊ GC (min … max):  0.00% … 98.21%
 Time  (median):     12.100 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   14.525 μs ± 113.040 μs  ┊ GC (mean ± σ):  13.30% ±  1.71%

        ▃ █ ▇ ▅ ▃
  ▁▁▂▂▆▄█▇█████▇█▅▅█▅▇▄▇▄▆▃▅▃▅▃▅▂▂▃▂▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  10 μs           Histogram: frequency by time         18.8 μs <

 Memory estimate: 13.00 KiB, allocs estimate: 201.

So far so good. But now let’s try leftjoin! instead. Since it modifies its first DataFrame argument, I’ll use the setup argument to reinitialize that argument every sample:

julia> @benchmark leftjoin!(newname, $job, on = :ID) setup=(newname=copy($name))
ERROR: ArgumentError: the following columns are present in both left and right data frames but not listed in `on`: Job. Pass makeunique=true to add a suffix automatically to columns names from the right data frame.
Stacktrace:
  [1] leftjoin!(df1::DataFrame, df2::DataFrame; on::Symbol, makeunique::Bool, source::Nothing, matchmissing::Symbol)
    @ DataFrames C:\Users\niclas\.julia\packages\DataFrames\LteEl\src\join\inplace.jl:118
  [2] var"##core#470"(job#469::DataFrame, newname::DataFrame)
    @ Main C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:489
  [3] var"##sample#471"(::Tuple{DataFrame}, __params::BenchmarkTools.Parameters)
    @ Main C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:497
  [4] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; maxevals::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ BenchmarkTools C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:161
  [5] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
    @ BenchmarkTools C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:152
  [6] #invokelatest#2
    @ .\essentials.jl:729 [inlined]
  [7] invokelatest
    @ .\essentials.jl:726 [inlined]
  [8] #lineartrial#46
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:35 [inlined]
  [9] lineartrial
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:35 [inlined]
 [10] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ BenchmarkTools C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:251
 [11] tune! (repeats 2 times)
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:247 [inlined]
 [12] top-level scope
    @ C:\Users\niclas\.julia\packages\BenchmarkTools\0owsb\src\execution.jl:394

It seems the newname DataFrame has the :Job column even though it gets reinitialized to copy(name) (which lacks the :Job column) every benchmark sample. What’s going on here? Is this a bug or am I using it wrong?

mihalybaci · April 18, 2023, 5:05pm

I get the same error using @benchmark, but interestingly this works

b = @benchmarkable leftjoin!(newname, $job, on=:ID) setup=(newname = copy($name))
run(b)

NiclasMattsson · April 18, 2023, 5:15pm

That makes me understand even less. But useful workaround, thanks!

ericphanson · April 18, 2023, 6:59pm

You need to also add evals=1:

julia> @benchmark leftjoin!(newname, $job, on = :ID) setup=(newname=copy($name)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  4.625 μs …  41.542 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.917 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.083 μs ± 764.174 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▂▁▆
  ▂▄████▆█▄▃▄▃▃▂▂▂▂▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▂▂▁▂▁▂▁▁▁▁▁▁▂ ▃
  4.62 μs         Histogram: frequency by time        8.04 μs <

 Memory estimate: 7.26 KiB, allocs estimate: 103.

Because BenchmarkTools by default may do multiple evaluations per sample.

NiclasMattsson · April 18, 2023, 7:16pm

10000 samples with 1 evaluation

Wow, somehow I never noticed this part of the output. I see I have to go study the docs of BenchmarkTools to learn what the difference between sample and evaluation is. Thanks!

Topic		Replies	Views
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9351	January 1, 2025
DataFrames.leftjoin now working? leftjoin not defined General Usage	2	609	October 3, 2020
Strange behavior with dataframes in julia 1.6 General Usage dataframes	6	497	March 7, 2021
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
Learning to benchmark and find the best function to select a subset of a dataframe New to Julia question	20	443	December 16, 2022

How can I benchmark DataFrames.leftjoin!? (Is this some kind of bug?)

Related topics