Dispatch on Value allocating

cshenton · July 14, 2019, 10:43am

I want to provide a function that will value dispatch on a handful of int values (3,4,5). That works fine, but when I wrap that function so that the caller doesn’t have to put in Val(3) when calling there are allocations and things slow down ~6000 times. Could someone let me know where I’m going wrong?

function _fastfunc(x, ::Val{N}) where {N}
    x & ~((1 << N) - 1)
end

function fastfunc(x, n)
    _fastfunc(x, Val(n))
end

results = Vector{UInt64}(undef, 1000000);

@btime for i=1:1000000
    $results[i] = _fastfunc(i, Val(5))
end

@btime for i=1:1000000
    $results[i] = fastfunc(i, 5)
end

Calling directly with Val works just fine:

  491.280 μs (0 allocations: 0 bytes)

But calling with an outer function results in approx 2 allocations per call.

  3.364 s (1998978 allocations: 30.50 MiB)

(on 1.0.2)

Tamas_Papp · July 14, 2019, 11:30am

You can’t really help this in general, as going from values to types will always involve dynamic dispatch. That said, at the user level, constant folding may help.

julia> using BenchmarkTools

julia> _fastfunc(x, ::Val{N}) where {N} = x & ~((1 << N) - 1)
_fastfunc (generic function with 1 method)

julia> fastfunc(x, n)  = _fastfunc(x, Val(n))
fastfunc (generic function with 1 method)

julia> i = 10
10

julia> @btime fastfunc($i, 5)
  0.027 ns (0 allocations: 0 bytes)
0

julia> @btime _fastfunc($i, Val(5))
  0.027 ns (0 allocations: 0 bytes)
0

julia> VERSION
v"1.2.0-rc2.0"

JeffreySarnoff · July 14, 2019, 11:58am

fyi using the current stable release of Julia is recommended (v1.1.1)

If you have just a few integer values of interest, e.g. (3, 4, 5), a simpler way:

_fastfunc3(x) = x & ((1 << 3) - 1)
_fastfunc4(x) = x & ((1 << 4) - 1)
_fastfunc5(x) = x & ((1 << 5) - 1)

function fastfunc(x, n)
  if n === 3
     _fastfunc3(x)
  elseif n === 4
     _fastfunc4(x)
  elseif n === 5
     _fastfunc5(x)
  else
     throw(ErrorException("one of (3,4,5) is expected"))
  end
end

cshenton · July 14, 2019, 6:30pm

Constant folding of the first argument isn’t an option. I only want to evaluate the subexpressions involving the second argument at compile time. Like a c++ template with an int parameter on N.

Yes there will be a dynamic dispatch from the outer to the inner function. Since the value being passed to the “nice syntax” outer function will always be a compile time constant, my hope was that the compiler might often inline that call, eliding the dispatch.

Even absent inlining though, dynamically following a function pointer shouldn’t allocate, that’s the part that I’m confused about.

cshenton · July 14, 2019, 6:39pm

This might just be the way to go for now.

Weirdly this runtime check seems faster than the dispatch.

# Call the correct function directly
@btime for i=1:1000000
           $results[i] = _fastfunc5(i)
       end
  473.670 μs (0 allocations: 0 bytes)

# Call the switch function
julia> @btime for i=1:1000000
           $results[i] = fastfunc(i, 5)
       end
  479.488 μs (0 allocations: 0 bytes)

# Call the dispatched function with a compile time constant
julia> @btime for i=1:1000000
           $results[i] = _fastfunc(i, Val(5))
       end
  491.995 μs (0 allocations: 0 bytes)

Could anyone offer some insight as to why the latter version is slower? I would have thought the dispatch would have happen at compile time, so wouldn’t have expected it to be so much slower.

Is the takeaway here not to use Val dispatch for this kind of compile time optimisation?

EDIT: This post contains an error, the last function has a bitwise not ~ missing from the first two, so it is not an apples to apples comparison.

kristoffer.carlsson · July 14, 2019, 6:39pm

In order to win over dynamic dispatch ovherad your function need to be large enough and the specialization give enough performance.

That is not the only thing that happens.

cshenton · July 14, 2019, 6:47pm

Seems like I was abusing multiple dispatch then. I have three more specific questions:

Is a dispatch on a Val(foo) where foo is a compile time constant generally not compiled away?
If so, is there anything that would prevent this optimisation from being added down the line?
What is the Julian way of doing these kind of micro-optimisations where I’d usually abuse templates?

kristoffer.carlsson · July 14, 2019, 6:48pm

Yes, it generally will.

julia> f() = ntuple(x -> x + 5, Val(2))
f (generic function with 2 methods)

julia> @code_llvm f()

;  @ REPL[45]:1 within `f'
define void @julia_f_12684([2 x i64]* noalias nocapture sret) {
top:
  %1 = bitcast [2 x i64]* %0 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %1, i8* bitcast ([2 x i64]* @0 to i8*), i64 16, i32 8, i1 false)
  ret void
}

julia> @code_native f()
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[49]:1 within `f'
        decl    %eax
        movl    $429409128, %eax        ## imm = 0x19984368
        addl    %eax, (%eax)
        addb    %al, (%eax)
        vmovups (%eax), %xmm0
        vmovups %xmm0, (%edi)
        decl    %eax
        movl    %edi, %eax
        retl
        nopw    %cs:(%eax,%eax)
; └

cshenton · July 14, 2019, 7:00pm

I have just realised that I was careless in my comparison, my _fastfunc5 omitted the bitwise not ~. Adding that back in there is no difference in performance between the this and the Val(5) dispatched version.

Since the goal for now is performance I’ll prioritise that over my attempt to beautify the API.

Thanks so much everyone, and thanks too for your patience. I’ve certainly learned something!

tkoolen · July 14, 2019, 9:58pm

On 1.1, explicitly annotating fastfunc (original post) with @inline results in identical performance to _fastfunc(i, Val(5)) for me.

Topic		Replies	Views
Experiments in speeding up runtime dispatch for integer value specialization Performance	28	1233	November 18, 2022
Branch in dispatch on values General Usage dispatch	2	444	April 12, 2020
Unstable type with dispatch by Val{:X}() with 5 functions Performance	9	851	October 12, 2020
Value type dispatch on constants Internals & Design	3	1420	December 16, 2016
Preventing dispatch on val not bits General Usage	5	589	October 23, 2019

Dispatch on Value allocating

Related topics