Eliminite overhead of "invoke" in benchmark

In my benchmark, I would like to compare the performance of a more specialized method to a less specialized method. Following Ralph Smith’s suggestion, I’ve tried using the invoke function for this.

The problem is that the call to invoke introduces an overhead that dominates by order of magnitude over the function that I actually want to benchmark

My setup looks roughly like this:

T = Complex128; β = 1; N = 4
α, A, B, C = rand(T), rand(T, N, N), rand(T, N, N), rand(T, N, N)
t_super = (Number, AbstractMatrix, AbstractMatrix, Number, AbstractMatrix)
t_args = (typeof(α), typeof(A), typeof(B), typeof(β), typeof(C))
label = "T=$T, β=$β), N=$N"
SUITE["NDM"]["$label - super"] = @benchmarkable(invoke(commutator!, $t_super, $α, $A, $B, $β, Cb), setup=(Cb = copy($C)))
SUITE["NDM"]["$label - invoke"] = @benchmarkable(invoke(commutator!, $t_args, $α, $A, $B, $β, Cb), setup=(Cb = copy($C)))
SUITE["NDM"]["$label - direct"] = @benchmarkable(commutator!($α, $A, $B, $β, Cb), setup=(Cb = copy($C)))

where commutator! is the function I’d actually like to benchmark and t_super is the signature of the more general method I’d like to compare against.
The result of this benchmark is this:

ID time GC time memory allocations
["NDM", "T=Complex{Float64}, β=1, N=4 - direct"] 627.216 ns (5%)
["NDM", "T=Complex{Float64}, β=1, N=4 - invoke"] 24.574 μs (5%) 208 bytes (1%) 3
["NDM", "T=Complex{Float64}, β=1, N=4 - super"] 21.289 μs (5%) 3.25 KiB (1%) 17

“direct” is the benchmark for the commutator! function without any overhead, “invoke” is the benchmark for the exact same method, called through invoke, and “super” is the benchmark for the more general method, also called through invoke. The “direct” time is below the margin of error for the “invoke”/“super” time, which means the result for “invoke”/“super” is completely useless: I’m really only measuring invoke.

So, my question is whether there’s any way within BenchmarkTools.jl/PkgBenchmark.jl to eliminate the overhead of invoke, or any workaround. I see three possibilities:

  1. Something within BenchmarkTools akin to setup that just does the right thing (do @jrevels or @kristoffer.carlsson have any ideas?)

  2. Copying the code for the more general commutator! method under a new name, commutator_super!, and call that in the benchmark. This would obviously work, but it’s very inelegant (code duplication), and I’d like to avoid it if at all possible

  3. Automate possibility (2) via metaprogramming. Something like commutator_super! = @swap_signature(commutator!, t_args, t_super) where the @swap_signature macro would have to be implemented to define a new function that uses the body of the commutator! method for the t_super signature, but with the function signature t_args. This seems pretty difficult. The farthest I’ve gotten with this is to get the AST of a function, e.g.

    function f(x::Int)
        if (x>0)
            return x
        else
            return -x
        end
    end
    
    meth = methods(f, (Int, )).ms[end]
    ast = ccall(:jl_uncompress_ast, Any, (Any, Any), meth, meth.source).code
    

    This gives me an array of Expressions. Is there any way to turn that back into a function? If so, that should actually allow me to implement the @swap_signature macro.

Invoke should be able to be optimized away; e.g:

julia> foo(x::Int) = x + 1
foo (generic function with 2 methods)

julia> foo(x::Integer) = x + 2
foo (generic function with 2 methods)

julia> f_integer(x) = invoke(foo, Tuple{Integer}, x)
f_integer (generic function with 1 method)

julia> @code_llvm foo(2)

define i64 @julia_foo_63003(i64) #0 !dbg !5 {
top:
  %1 = add i64 %0, 1
  ret i64 %1
}

julia> @code_llvm f_integer(2)

define i64 @julia_f_integer_63004(i64) #0 !dbg !5 {
top:
  %1 = add i64 %0, 2
  ret i64 %1
}

Not sure why it is not happening in your case.

1 Like

Could be a fluke, but maybe because @goerz is passing a tuple instance rather than a tuple type to invoke for the signature?

# using tuple instance (Integer,) instead of Tuple{Integer}
julia> f_integer(x) = invoke(foo, (Integer,), x)
f_integer (generic function with 1 method)

julia> @code_llvm f_integer(2)

; Function f_integer
; Location: REPL[5]:1
define nonnull %jl_value_t addrspace(10)* @julia_f_integer_57937(i64) {
top:
  %1 = alloca [3 x %jl_value_t addrspace(10)*], align 8
  %gcframe1 = alloca [4 x %jl_value_t addrspace(10)*], align 8
  %gcframe1.sub = getelementptr inbounds [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %gcframe1, i64 0, i64 0
  %.sub = getelementptr inbounds [3 x %jl_value_t addrspace(10)*], [3 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 0
  %2 = getelementptr inbounds [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %gcframe1, i64 0, i64 1
  %3 = bitcast %jl_value_t addrspace(10)** %2 to i8*
  call void @llvm.memset.p0i8.i32(i8* %3, i8 0, i32 24, i32 8, i1 false)
  %4 = call %jl_value_t*** inttoptr (i64 4342740608 to %jl_value_t*** ()*)() #2
  %5 = bitcast [4 x %jl_value_t addrspace(10)*]* %gcframe1 to i64*
  store i64 4, i64* %5, align 8
  %6 = bitcast %jl_value_t*** %4 to i64*
  %7 = load i64, i64* %6, align 8
  %8 = getelementptr [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %gcframe1, i64 0, i64 1
  %9 = bitcast %jl_value_t addrspace(10)** %8 to i64*
  store i64 %7, i64* %9, align 8
  %10 = bitcast %jl_value_t*** %4 to %jl_value_t addrspace(10)***
  store %jl_value_t addrspace(10)** %gcframe1.sub, %jl_value_t addrspace(10)*** %10, align 8
  %11 = bitcast %jl_value_t*** %4 to i8*
  %12 = call noalias nonnull %jl_value_t addrspace(10)* @jl_gc_pool_alloc(i8* %11, i32 1376, i32 16)
  %13 = bitcast %jl_value_t addrspace(10)* %12 to %jl_value_t addrspace(10)* addrspace(10)*
  %14 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(10)* %13, i64 -1
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4387617216 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %14, align 8
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4386429408 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %13, align 8
  %15 = getelementptr [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %gcframe1, i64 0, i64 3
  store %jl_value_t addrspace(10)* %12, %jl_value_t addrspace(10)** %15, align 8
  %16 = call %jl_value_t addrspace(10)* @jl_box_int64(i64 signext %0)
  %17 = getelementptr [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %gcframe1, i64 0, i64 2
  store %jl_value_t addrspace(10)* %16, %jl_value_t addrspace(10)** %17, align 8
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4515053584 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %.sub, align 8
  %18 = getelementptr [3 x %jl_value_t addrspace(10)*], [3 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 1
  store %jl_value_t addrspace(10)* %12, %jl_value_t addrspace(10)** %18, align 8
  %19 = getelementptr [3 x %jl_value_t addrspace(10)*], [3 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 2
  store %jl_value_t addrspace(10)* %16, %jl_value_t addrspace(10)** %19, align 8
  %20 = call nonnull %jl_value_t addrspace(10)* @jl_f_invoke(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* null to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %.sub, i32 3)
  %21 = load i64, i64* %9, align 8
  store i64 %21, i64* %6, align 8
  ret %jl_value_t addrspace(10)* %20
}

EDIT: (note that this is just going to trigger an error/depwarn depending on version, but maybe the benchmark is swallowing it for some reason)

1 Like

Interesting… I really didn’t get how invoke is supposed to be called from the documentation. It probably shouldn’t accept both a tuple-type (or whatever you’d call that) and a tuple of types.

Sadly, though, even with using the tuple-type, it seems like the commutator! function isn’t being inlined:

julia> α = rand(Complex128)
julia> A = rand(Complex128, 4,4)
julia> B = rand(Complex128, 4,4)
julia> β = one(α)
julia> C = rand(Complex128, 4,4)
julia> commutator_super!(alpha, A, B, beta, C) = invoke(commutator!, Tuple{Number, AbstractMatrix, AbstractMatrix, Number, AbstractMatrix}, alpha, A, B, beta, C)

julia>  @code_llvm commutator!(α,A,B,β,C)
define i8** @"julia_commutator!_98986"(%Complex.108* nocapture readonly dereferenceable(16), i8** dereferenceable(40), i8** dereferenceable(40), %Complex.108* nocapture readonly dereferenceable(16), i8** dereferenceable(40)) #0 !dbg !5 {
top:
%5 = alloca %Complex.108, align 8
%6 = alloca %Complex.108, align 8
%7 = call i8** @"julia_gemm!_97939"(i32 78, i32 78, %Complex.108* nocapture nonnull readonly %0, i8** nonnull %1, i8** nonnull %2, %Complex.108* nocapture nonnull readonly %3, i8** nonnull %4)
%8 = getelementptr inbounds %Complex.108, %Complex.108* %0, i64 0, i32 0
%9 = load double, double* %8, align 8
%10 = fsub double -0.000000e+00, %9
%11 = getelementptr inbounds %Complex.108, %Complex.108* %6, i64 0, i32 0
store double %10, double* %11, align 8
%12 = getelementptr inbounds %Complex.108, %Complex.108* %0, i64 0, i32 1
%13 = load double, double* %12, align 8
%14 = fsub double -0.000000e+00, %13
%15 = getelementptr inbounds %Complex.108, %Complex.108* %6, i64 0, i32 1
store double %14, double* %15, align 8
%16 = getelementptr inbounds %Complex.108, %Complex.108* %5, i64 0, i32 0
store double 1.000000e+00, double* %16, align 8
%17 = getelementptr inbounds %Complex.108, %Complex.108* %5, i64 0, i32 1
store double 0.000000e+00, double* %17, align 8
%18 = call i8** @"julia_gemm!_97939"(i32 78, i32 78, %Complex.108* nocapture nonnull readonly %6, i8** nonnull %2, i8** nonnull %1, %Complex.108* nocapture nonnull readonly %5, i8** nonnull %4)
ret i8** %18
}

julia>  @code_llvm commutator_super!(α,A,B,β,C)

top:
call void @"julia_commutator!_98980"(%Complex.108* nocapture nonnull readonly %0, i8** nonnull %1, i8** nonnull %2, %Complex.108* nocapture nonnull readonly %3, i8** nonnull %4)
ret void
}

Maybe because commutator! doesn’t have a return value? I’ve tried a few other variations of this, too, like function...end. Trying to use @inline doesn’t work:

julia> commutator_super!(alpha, A, B, beta, C) = @inline @invoke(commutator!, Tuple{Number, AbstractMatrix, AbstractMatrix, Number, AbstractMatrix}, alpha, A, B, beta, C)
ERROR: invoke(commutator!, Tuple{Number, AbstractMatrix, AbstractMatrix, Number, AbstractMatrix}, alpha, A, B, beta, C) is not a function expression

Just to take a bit of a step back here: invoke doesn’t specialize the method differently from just calling it, so its performance isn’t any different (assuming it can still get optimized), just the method target from lookup might be different.

That totally makes sense, but somehow the compiler seems to add an inexplicable overhead. Not sure what’s going on. If anyone wants to look deeper into this, the steps to reproduce the benchmark on my actual code are

julia> Pkg.clone("git@github.com:goerz/Commutator.jl.git", "Commutator")
julia> Pkg.checkout("Commutator", "benchmarking")
julia> Pkg.add("PkgBenchmark")
julia> Pkg.checkout("PkgBenchmark")  # needs master branch
julia> result = PkgBenchmark.benchmarkpkg("Commutator")
julia> PkgBenchmark.export_markdown("benchmark.md", result)

This runs the benchmark suite in Commutator/benchmark/benchmarks.jl, which simply includes Commutator/test/runtests.jl (I’m defining my benchmarks alongside the tests)

Sorry for leading you into such a trap.

@inline is like a decorator rather than a local compiler directive: try putting it in the definition of the non-specific method:

@inline function commutator!(generic_args...)

That and jrevels’ Tuple{} construction worked for a simple case I tried.

Even without inlining, the only extra cost should be a function call, for which the overhead you report above seems scandalous. (I saw much less in my simple case.)

Ah, that works, syntactically. But the same overhead is still there. And I agree, microseconds is much longer than I would expect for a simple function call. I might just have to sidestep the problem and run the benchmark with bigger matrices, where the time spent on the actual computation dominates. That’ll make the benchmark suite pretty slow to run through, though.