Performance of typed keyword arguments

Hello, I have tried this code and I am really confused why the keyword version is slo slow even when I provide a type to it.

using BenchmarkTools

f1(x) = exp(x)
f2(x::Number) = exp(x)
f3(x::Float64) = exp(x)

v1(;x=2.5) = exp(x)
v2(;x::Number=2.5) = exp(x)
v3(;x::Float64=2.5) = exp(x)

@btime f1(2.5)
@btime f2(2.5)
@btime f3(2.5)

@btime v1(x = 2.5)
@btime v2(x = 2.5)
@btime v3(x = 2.5)

This gives me the following on 0.6.2 version

  54.172 ns (0 allocations: 0 bytes)
  54.172 ns (0 allocations: 0 bytes)
  54.172 ns (0 allocations: 0 bytes)

  259.656 ns (2 allocations: 112 bytes)
  430.980 ns (2 allocations: 112 bytes)
  175.290 ns (1 allocation: 96 bytes)

and this are the results from 0.7.0 alpha

  2.793 ns (0 allocations: 0 bytes)
  2.793 ns (0 allocations: 0 bytes)
  2.793 ns (0 allocations: 0 bytes)

  52.416 ns (0 allocations: 0 bytes)
  52.416 ns (0 allocations: 0 bytes)
  52.416 ns (0 allocations: 0 bytes)

So the new named tuples performs approximately as fast as old normal keyword, however, the new non-keyword version is still much faster…

Can someone explain me this behavior. Does this mean I always have to use normal arguments instead of keyword for performance critical code?

I think the 0.7 benchmarks for the f-functions are so fast because of the new constant propagation feature of the compiler. If you do instead:

julia> a = 2.5                                                                                                                                                          
2.5                                                                                                                                                                     

julia> f1(x) = exp(x)                                                                                                                                                   
f1 (generic function with 1 method)                                                                                                                                     
                                                                                                                                                                        
julia> @btime f1(2.5);                                                                                                                                                  
  1.686 ns (0 allocations: 0 bytes)                                                                                                                                     
                                                                                                                                                                        
julia> @btime f1($a);                                                                                                                                                   
  11.654 ns (0 allocations: 0 bytes)                                                                                                                                    

The 11.6ns is in line with what I get on 0.6 and also the same as I get for keywords on 0.7:

julia> v1(;x=2.5) = exp(x)                                                                                                                                              
v1 (generic function with 1 method)                                                                                                                                     
                                                                                                                                                                        
julia> @btime v1(x=$a);                                                                                                                                                 
  11.652 ns (0 allocations: 0 bytes)                                                                                                                                    

So, keywords are as fast as positional arguments in 0.7 (at least for this test). However the constant propagation which makes f1(2.5) even faster does not seem to work for keywords.

Ah, ok, that is a good explanation. Would be nice to know, why the propagation does not work on keyword arguments, though.

Results from 0.7-alpha on Windows 10:

@btime f1(2.5)
@btime f2(2.5)
@btime f3(2.5)

@btime v1(x = 2.5)
@btime v2(x = 2.5)
@btime v3(x = 2.5)

  1.282 ns (0 allocations: 0 bytes)
  1.282 ns (0 allocations: 0 bytes)
  1.282 ns (0 allocations: 0 bytes)
  9.237 ns (0 allocations: 0 bytes)
  9.237 ns (0 allocations: 0 bytes)
  9.237 ns (0 allocations: 0 bytes)

It seems that even when we disable constant propagation, there is an overhead:

julia> VERSION
v"1.0.1-pre.0"

julia> x = 2.5
2.5

julia> @btime f1($x)
  8.781 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime f2($x)
  8.786 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime f3($x)
  8.781 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime v1(x = $x)
  13.808 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime v2(x = $x)
  13.951 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime v3(x = $x)
  13.945 ns (0 allocations: 0 bytes)
12.182493960703473

Yeap, there is a little overhead (Julia 1.0):

julia> const b = 2.5
2.5

julia> @code_warntype v3(x=b)
Body::Float64
 1 ─ %1  = (Base.getfield)(#temp#, :x)::Float64                                                                                                                                                                           β”‚β•»     getindex
 β”‚   %2  = (Base.slt_int)(0, 1)::Bool                                                                                                                                                                                     β”‚β”‚β•»β•·β•·β•·  iterate
 └──       goto #3 if not %2                                                                                                                                                                                              │││┃│    iterate
 2 ─       goto #4                                                                                                                                                                                                        ││││┃     iterate
 3 ─       invoke Base.getindex(()::Tuple{}, 1::Int64)                                                                                                                                                                    β”‚β”‚β”‚β”‚β”‚ 
 └──       $(Expr(:unreachable))                                                                                                                                                                                          β”‚β”‚β”‚β”‚β”‚ 
 4 β”„       goto #5                                                                                                                                                                                                        β”‚β”‚β”‚β”‚  
 5 ─       goto #6                                                                                                                                                                                                        β”‚β”‚β•»     iterate
 6 ─       goto #7                                                                                                                                                                                                        β”‚β”‚    
 7 ─       nothing                                                                                                                                                                                                        β”‚     
 β”‚   %11 = invoke Main.exp(%1::Float64)::Float64                                                                                                                                                                          β”‚β•»     #v3#5
 └──       return %11                                                                                                                                                                                                     β”‚     

julia> @code_warntype f3(b)
Body::Float64
1 1 ─ %1 = invoke Main.exp(_2::Float64)::Float64                                                                                                                                                                                          β”‚
  └──      return %1                                                                                                                                                                                                                      β”‚

julia> @code_native v3(x=b)
	.text
; Function #v3 {
; Location: none
; Function #v3#5; {
; Location: none
	pushq	%rax
	vmovsd	(%rdi), %xmm0           # xmm0 = mem[0],zero
	movabsq	$"reinterpret;", %rax
	callq	*%rax
;}
	popq	%rax
	retq
	nopw	%cs:(%rax,%rax)
;}

julia> @code_native f3(b)
	.text
; Function f3 {
; Location: REPL[5]:1
	pushq	%rax
	movabsq	$"reinterpret;", %rax
	callq	*%rax
	popq	%rax
	retq
	nop
;}

I am refactoring an API so I thought I would revisit this issue as keyword arguments would be very convenient.

The following compares positional and keyword arguments and NamedTuples.

using BenchmarkTools
pos(x) = exp(x)
kw(; x) = exp(x)
nt(y) = exp(y.x)
g_pos(x) = pos(x)
g_kw(x) = kw(x = x)
g_nt(x) = nt((x = x, ))
x = 2.5
@btime g_pos($x)
@btime g_kw($x)
@btime g_nt($x)

with the output

julia> @btime g_pos($x)
  0.026 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime g_kw($x)
  8.892 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime g_nt($x)
  0.026 ns (0 allocations: 0 bytes)
12.182493960703473

julia> VERSION
v"1.2.0-DEV.17"

I am inclined to believe that something weird is going on with the benchmarking of the positional and NamedTuple arguments, since that sub-nanosecond timing is too good to believe. Any suggestions how to do it better?

The following avoids the spurious benchmark and shows that all 3 versions are pretty much identical:

using BenchmarkTools

@inline op(A) = A * A

pos(A) = op(A)
kw(; A) = op(A)
nt(y) = op(y.A)
wrap_pos(A) = pos(A)
wrap_kw(A) = kw(A = A)
wrap_nt(A) = nt((A = A, ))

A = randn(5, 5)
@benchmark wrap_pos($A)
@benchmark wrap_kw($A)
@benchmark wrap_nt($A)

I usually do something like this in the hope of more accurate benchmarks (this is for your first example, without wrapping):

julia> @btime for n=1:1000; g_pos($x); end
  7.328 ΞΌs (0 allocations: 0 bytes)

julia> @btime for n=1:1000; g_kw($x); end
  7.112 ΞΌs (0 allocations: 0 bytes)

julia> @btime for n=1:1000; g_nt($x); end
  7.108 ΞΌs (0 allocations: 0 bytes)

Maybe clobber/escape suggested here [RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM by vchuravy Β· Pull Request #92 Β· JuliaCI/BenchmarkTools.jl Β· GitHub would help for this kind of benchmark?

I don’t know enough about LLVM for this. I imagine one can prevent the compiler from doing something insanely clever by including a more expensive inner calculation, like I did above.

But perhaps a warning could be helpful. I opened an issue

https://github.com/JuliaCI/BenchmarkTools.jl/issues/130