Performance of typed keyword arguments

Hello, I have tried this code and I am really confused why the keyword version is slo slow even when I provide a type to it.

using BenchmarkTools

f1(x) = exp(x)
f2(x::Number) = exp(x)
f3(x::Float64) = exp(x)

v1(;x=2.5) = exp(x)
v2(;x::Number=2.5) = exp(x)
v3(;x::Float64=2.5) = exp(x)

@btime f1(2.5)
@btime f2(2.5)
@btime f3(2.5)

@btime v1(x = 2.5)
@btime v2(x = 2.5)
@btime v3(x = 2.5)

This gives me the following on 0.6.2 version

  54.172 ns (0 allocations: 0 bytes)
  54.172 ns (0 allocations: 0 bytes)
  54.172 ns (0 allocations: 0 bytes)

  259.656 ns (2 allocations: 112 bytes)
  430.980 ns (2 allocations: 112 bytes)
  175.290 ns (1 allocation: 96 bytes)

and this are the results from 0.7.0 alpha

  2.793 ns (0 allocations: 0 bytes)
  2.793 ns (0 allocations: 0 bytes)
  2.793 ns (0 allocations: 0 bytes)

  52.416 ns (0 allocations: 0 bytes)
  52.416 ns (0 allocations: 0 bytes)
  52.416 ns (0 allocations: 0 bytes)

So the new named tuples performs approximately as fast as old normal keyword, however, the new non-keyword version is still much faster…

Can someone explain me this behavior. Does this mean I always have to use normal arguments instead of keyword for performance critical code?

I think the 0.7 benchmarks for the f-functions are so fast because of the new constant propagation feature of the compiler. If you do instead:

julia> a = 2.5                                                                                                                                                          
2.5                                                                                                                                                                     

julia> f1(x) = exp(x)                                                                                                                                                   
f1 (generic function with 1 method)                                                                                                                                     
                                                                                                                                                                        
julia> @btime f1(2.5);                                                                                                                                                  
  1.686 ns (0 allocations: 0 bytes)                                                                                                                                     
                                                                                                                                                                        
julia> @btime f1($a);                                                                                                                                                   
  11.654 ns (0 allocations: 0 bytes)                                                                                                                                    

The 11.6ns is in line with what I get on 0.6 and also the same as I get for keywords on 0.7:

julia> v1(;x=2.5) = exp(x)                                                                                                                                              
v1 (generic function with 1 method)                                                                                                                                     
                                                                                                                                                                        
julia> @btime v1(x=$a);                                                                                                                                                 
  11.652 ns (0 allocations: 0 bytes)                                                                                                                                    

So, keywords are as fast as positional arguments in 0.7 (at least for this test). However the constant propagation which makes f1(2.5) even faster does not seem to work for keywords.

4 Likes

Ah, ok, that is a good explanation. Would be nice to know, why the propagation does not work on keyword arguments, though.

1 Like

Results from 0.7-alpha on Windows 10:

@btime f1(2.5)
@btime f2(2.5)
@btime f3(2.5)

@btime v1(x = 2.5)
@btime v2(x = 2.5)
@btime v3(x = 2.5)

  1.282 ns (0 allocations: 0 bytes)
  1.282 ns (0 allocations: 0 bytes)
  1.282 ns (0 allocations: 0 bytes)
  9.237 ns (0 allocations: 0 bytes)
  9.237 ns (0 allocations: 0 bytes)
  9.237 ns (0 allocations: 0 bytes)

It seems that even when we disable constant propagation, there is an overhead:

julia> VERSION
v"1.0.1-pre.0"

julia> x = 2.5
2.5

julia> @btime f1($x)
  8.781 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime f2($x)
  8.786 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime f3($x)
  8.781 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime v1(x = $x)
  13.808 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime v2(x = $x)
  13.951 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime v3(x = $x)
  13.945 ns (0 allocations: 0 bytes)
12.182493960703473

Yeap, there is a little overhead (Julia 1.0):

julia> const b = 2.5
2.5

julia> @code_warntype v3(x=b)
Body::Float64
 1 ─ %1  = (Base.getfield)(#temp#, :x)::Float64                                                                                                                                                                           β”‚β•»     getindex
 β”‚   %2  = (Base.slt_int)(0, 1)::Bool                                                                                                                                                                                     β”‚β”‚β•»β•·β•·β•·  iterate
 └──       goto #3 if not %2                                                                                                                                                                                              │││┃│    iterate
 2 ─       goto #4                                                                                                                                                                                                        ││││┃     iterate
 3 ─       invoke Base.getindex(()::Tuple{}, 1::Int64)                                                                                                                                                                    β”‚β”‚β”‚β”‚β”‚ 
 └──       $(Expr(:unreachable))                                                                                                                                                                                          β”‚β”‚β”‚β”‚β”‚ 
 4 β”„       goto #5                                                                                                                                                                                                        β”‚β”‚β”‚β”‚  
 5 ─       goto #6                                                                                                                                                                                                        β”‚β”‚β•»     iterate
 6 ─       goto #7                                                                                                                                                                                                        β”‚β”‚    
 7 ─       nothing                                                                                                                                                                                                        β”‚     
 β”‚   %11 = invoke Main.exp(%1::Float64)::Float64                                                                                                                                                                          β”‚β•»     #v3#5
 └──       return %11                                                                                                                                                                                                     β”‚     

julia> @code_warntype f3(b)
Body::Float64
1 1 ─ %1 = invoke Main.exp(_2::Float64)::Float64                                                                                                                                                                                          β”‚
  └──      return %1                                                                                                                                                                                                                      β”‚

julia> @code_native v3(x=b)
	.text
; Function #v3 {
; Location: none
; Function #v3#5; {
; Location: none
	pushq	%rax
	vmovsd	(%rdi), %xmm0           # xmm0 = mem[0],zero
	movabsq	$"reinterpret;", %rax
	callq	*%rax
;}
	popq	%rax
	retq
	nopw	%cs:(%rax,%rax)
;}

julia> @code_native f3(b)
	.text
; Function f3 {
; Location: REPL[5]:1
	pushq	%rax
	movabsq	$"reinterpret;", %rax
	callq	*%rax
	popq	%rax
	retq
	nop
;}

I am refactoring an API so I thought I would revisit this issue as keyword arguments would be very convenient.

The following compares positional and keyword arguments and NamedTuples.

using BenchmarkTools
pos(x) = exp(x)
kw(; x) = exp(x)
nt(y) = exp(y.x)
g_pos(x) = pos(x)
g_kw(x) = kw(x = x)
g_nt(x) = nt((x = x, ))
x = 2.5
@btime g_pos($x)
@btime g_kw($x)
@btime g_nt($x)

with the output

julia> @btime g_pos($x)
  0.026 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime g_kw($x)
  8.892 ns (0 allocations: 0 bytes)
12.182493960703473

julia> @btime g_nt($x)
  0.026 ns (0 allocations: 0 bytes)
12.182493960703473

julia> VERSION
v"1.2.0-DEV.17"

I am inclined to believe that something weird is going on with the benchmarking of the positional and NamedTuple arguments, since that sub-nanosecond timing is too good to believe. Any suggestions how to do it better?

2 Likes

The following avoids the spurious benchmark and shows that all 3 versions are pretty much identical:

using BenchmarkTools

@inline op(A) = A * A

pos(A) = op(A)
kw(; A) = op(A)
nt(y) = op(y.A)
wrap_pos(A) = pos(A)
wrap_kw(A) = kw(A = A)
wrap_nt(A) = nt((A = A, ))

A = randn(5, 5)
@benchmark wrap_pos($A)
@benchmark wrap_kw($A)
@benchmark wrap_nt($A)
1 Like

I usually do something like this in the hope of more accurate benchmarks (this is for your first example, without wrapping):

julia> @btime for n=1:1000; g_pos($x); end
  7.328 ΞΌs (0 allocations: 0 bytes)

julia> @btime for n=1:1000; g_kw($x); end
  7.112 ΞΌs (0 allocations: 0 bytes)

julia> @btime for n=1:1000; g_nt($x); end
  7.108 ΞΌs (0 allocations: 0 bytes)
2 Likes

Maybe clobber/escape suggested here [RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM by vchuravy Β· Pull Request #92 Β· JuliaCI/BenchmarkTools.jl Β· GitHub would help for this kind of benchmark?

1 Like

I don’t know enough about LLVM for this. I imagine one can prevent the compiler from doing something insanely clever by including a more expensive inner calculation, like I did above.

But perhaps a warning could be helpful. I opened an issue

https://github.com/JuliaCI/BenchmarkTools.jl/issues/130

1 Like