Dispatch on Value allocating

I want to provide a function that will value dispatch on a handful of int values (3,4,5). That works fine, but when I wrap that function so that the caller doesn’t have to put in Val(3) when calling there are allocations and things slow down ~6000 times. Could someone let me know where I’m going wrong?

function _fastfunc(x, ::Val{N}) where {N}
    x & ~((1 << N) - 1)

function fastfunc(x, n)
    _fastfunc(x, Val(n))

results = Vector{UInt64}(undef, 1000000);

@btime for i=1:1000000
    $results[i] = _fastfunc(i, Val(5))

@btime for i=1:1000000
    $results[i] = fastfunc(i, 5)

Calling directly with Val works just fine:

  491.280 μs (0 allocations: 0 bytes)

But calling with an outer function results in approx 2 allocations per call.

  3.364 s (1998978 allocations: 30.50 MiB)

(on 1.0.2)

You can’t really help this in general, as going from values to types will always involve dynamic dispatch. That said, at the user level, constant folding may help.

julia> using BenchmarkTools

julia> _fastfunc(x, ::Val{N}) where {N} = x & ~((1 << N) - 1)
_fastfunc (generic function with 1 method)

julia> fastfunc(x, n)  = _fastfunc(x, Val(n))
fastfunc (generic function with 1 method)

julia> i = 10

julia> @btime fastfunc($i, 5)
  0.027 ns (0 allocations: 0 bytes)

julia> @btime _fastfunc($i, Val(5))
  0.027 ns (0 allocations: 0 bytes)

julia> VERSION
1 Like

fyi using the current stable release of Julia is recommended (v1.1.1)

If you have just a few integer values of interest, e.g. (3, 4, 5), a simpler way:

_fastfunc3(x) = x & ((1 << 3) - 1)
_fastfunc4(x) = x & ((1 << 4) - 1)
_fastfunc5(x) = x & ((1 << 5) - 1)

function fastfunc(x, n)
  if n === 3
  elseif n === 4
  elseif n === 5
     throw(ErrorException("one of (3,4,5) is expected"))
1 Like

Constant folding of the first argument isn’t an option. I only want to evaluate the subexpressions involving the second argument at compile time. Like a c++ template with an int parameter on N.

Yes there will be a dynamic dispatch from the outer to the inner function. Since the value being passed to the “nice syntax” outer function will always be a compile time constant, my hope was that the compiler might often inline that call, eliding the dispatch.

Even absent inlining though, dynamically following a function pointer shouldn’t allocate, that’s the part that I’m confused about.

This might just be the way to go for now.

Weirdly this runtime check seems faster than the dispatch.

# Call the correct function directly
@btime for i=1:1000000
           $results[i] = _fastfunc5(i)
  473.670 μs (0 allocations: 0 bytes)

# Call the switch function
julia> @btime for i=1:1000000
           $results[i] = fastfunc(i, 5)
  479.488 μs (0 allocations: 0 bytes)

# Call the dispatched function with a compile time constant
julia> @btime for i=1:1000000
           $results[i] = _fastfunc(i, Val(5))
  491.995 μs (0 allocations: 0 bytes)

Could anyone offer some insight as to why the latter version is slower? I would have thought the dispatch would have happen at compile time, so wouldn’t have expected it to be so much slower.

Is the takeaway here not to use Val dispatch for this kind of compile time optimisation?

EDIT: This post contains an error, the last function has a bitwise not ~ missing from the first two, so it is not an apples to apples comparison.

In order to win over dynamic dispatch ovherad your function need to be large enough and the specialization give enough performance.

That is not the only thing that happens.

Seems like I was abusing multiple dispatch then. I have three more specific questions:

Is a dispatch on a Val(foo) where foo is a compile time constant generally not compiled away?
If so, is there anything that would prevent this optimisation from being added down the line?
What is the Julian way of doing these kind of micro-optimisations where I’d usually abuse templates?

Yes, it generally will.

julia> f() = ntuple(x -> x + 5, Val(2))
f (generic function with 2 methods)

julia> @code_llvm f()

;  @ REPL[45]:1 within `f'
define void @julia_f_12684([2 x i64]* noalias nocapture sret) {
  %1 = bitcast [2 x i64]* %0 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %1, i8* bitcast ([2 x i64]* @0 to i8*), i64 16, i32 8, i1 false)
  ret void

julia> @code_native f()
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[49]:1 within `f'
        decl    %eax
        movl    $429409128, %eax        ## imm = 0x19984368
        addl    %eax, (%eax)
        addb    %al, (%eax)
        vmovups (%eax), %xmm0
        vmovups %xmm0, (%edi)
        decl    %eax
        movl    %edi, %eax
        nopw    %cs:(%eax,%eax)
; └
1 Like

I have just realised that I was careless in my comparison, my _fastfunc5 omitted the bitwise not ~. Adding that back in there is no difference in performance between the this and the Val(5) dispatched version.

Since the goal for now is performance I’ll prioritise that over my attempt to beautify the API.

Thanks so much everyone, and thanks too for your patience. I’ve certainly learned something!

On 1.1, explicitly annotating fastfunc (original post) with @inline results in identical performance to _fastfunc(i, Val(5)) for me.

1 Like