[ANN] ManualDispatch.jl

This trivial macro was suggested (or a similar macro was) by @tim.holy and @tkf. It is not registered.

Provides the macro @unionsplit to dispatch “manually” on types rather than
rely on runtime multiple dispatch.

If type information is hidden from the compiler, then the dispatch system has to select the
correct method at runtime, which can be very slow. In this case it can be much faster to use
conditional statements to check the type. Inside each branch of the conditional, the type
is known and the call can be devirtualized or inlined.

An implementation note. This macro just constructs a multi-branch conditional statement; a series of if , elseif statements. There should be a more elegant way to construct something like this than what I chose.

16 Likes

Thanks for sharing the package!

Regarding the syntax, how about something like @unionsplit(func(y::Union{Int, Float64})) so that it can be used for arbitrary argument position?

4 Likes

Yes, that would be much more flexible. It currently only manually dispatches on the first argument.

3 Likes

Thanks for creating this!

This is a usable manual version of the core of my JIT proposal, and it seems we have worked on the two completely parallel. The weekend of manual dispatch! :smiley:

I did not know of previous work in this area (at least not consciously) but I guessed that the idea is not new. Thanks for crediting the original thinkers too!

3 Likes

Thanks @jlapeyre or the idea and package!

Expanding on this idea, I’m not sure why we should restrict the use of such a macro to a single function call? Couldn’t we allow arbitrary expressions?

I would have imagined use cases with more general expressions, with a syntax looking like

@unionsplit y∈(Int, Float64) begin
    s += f(y)
end

I really like Base.Cartesian macros for such metaprogramming tasks (in this particular case: Base.Cartesian.@nif). Here would be a possible implementation of a macro handling the syntax described above:

macro unionsplit(spec, expr)
    # TODO: replace these assertions with proper pattern matching on the expressions
    @assert spec isa Expr
    @assert spec.head == :call
    @assert spec.args[1] in (:in, :∈)
    var = esc(spec.args[2])

    @assert spec.args[3] isa Expr
    @assert spec.args[3].head == :tuple
    types = esc.(spec.args[3].args)
    N = 1 + length(types)

    expr = esc(expr)
    quote
        t = tuple($(types...))
        Base.Cartesian.@nif $N  i->($var isa t[i])  i->$expr  i->$expr
    end
end

Expansion example:

julia> (@macroexpand function ex_union_split(x)
           s = 0.
           for y in x
               @unionsplit y∈(Int, Float64)  s+=func(y)
           end
           s
       end) |> Base.remove_linenums!
:(function ex_union_split(x)
      s = 0.
      for y = x
          begin
              var"#8#t" = Main.tuple(Int, Float64)
              if y isa var"#8#t"[1]
                  s += func(y)
              else
                  if y isa var"#8#t"[2]
                      s += func(y)
                  else
                      s += func(y)
                  end
              end
          end
      end
      s
  end)
2 Likes

I thought the whole point of union splitting was to do this automatically, as explained in this blog post: https://julialang.org/blog/2018/08/union-splitting/. Why is this faster?

If I understand correctly, this package is useful when inference can’t tell anything about the type, i.e. the type is Any.

In the README example, would there be no advantage to the macro if the vector was typed with Union{Float64,Int} elements?

1 Like

Yes, that’s also my understanding. Also, automatic union splitting by the compiler kicks in for small unions. With manual union splitting, you can get the same kind of benefits in the case of larger unions, for which the benefit/cost ratio of automatic compiler optimizations was deemed too low.

using BenchmarkTools
using ManualDispatch

func(x::Int) = 1
func(x::Float64) = 1

function ex_union_split(x)
    for y in x
        @unionsplit((Int, Float64), func(y))
    end
end

function ex_runtime_dispatch(x)
    for y in x
        func(y)
    end
end

const x_int = Any[1 for i in 1:1000]
const x_float = Any[1.0 for i in 1:1000]
const x_mixed = Any[iseven(i) ? 1 : 1.0 for i in 1:1000]
const x_typed_mixed = Union{Int, Float64}[iseven(i) ? 1 : 1.0 for i in 1:1000]

print("Manual union split with array of Int")
@btime ex_union_split(x_int)

print("Runtime dispatch with array of Int")
@btime ex_runtime_dispatch(x_int)

print("Manual union split with array of Float64")
@btime ex_union_split(x_float)

print("Runtime dispatch with array of Float64")
@btime ex_runtime_dispatch(x_float)

print("Manual union split with mixed array")
@btime ex_union_split(x_mixed)

print("Runtime dispatch with mixed array")
@btime ex_runtime_dispatch(x_mixed)

print("Manual union split with typed mixed array")
@btime ex_union_split(x_typed_mixed)

print("Runtime dispatch with typed mixed array")
@btime ex_runtime_dispatch(x_typed_mixed)
Manual union split with array of Int  479.421 ns (0 allocations: 0 bytes)
Runtime dispatch with array of Int  7.309 μs (0 allocations: 0 bytes)
Manual union split with array of Float64  598.893 ns (0 allocations: 0 bytes)
Runtime dispatch with array of Float64  8.369 μs (0 allocations: 0 bytes)
Manual union split with mixed array  509.000 ns (0 allocations: 0 bytes)
Runtime dispatch with mixed array  7.990 μs (0 allocations: 0 bytes)
Manual union split with typed mixed array  1.202 ns (0 allocations: 0 bytes)
Runtime dispatch with typed mixed array  243.578 ns (0 allocations: 0 bytes)

It seems things are being constant prop’ed away, 1ns is too quick!

Absolutely! (In fact, I wrote a half of the sentence suggesting this but I then I thought it might be better to keep the comment short.)

I find Base.Cartesian, with combination of @generated, is overused. Most (maybe all) usecases can be done with functional programming. I think it’d be much better because it forces the programmer to come up with more composable and easy-to-debug abstraction.

There is a similar discussion of “the metaprogramming temptation” by Steven G. Johnson in JuliaCon 2019:

Having said that, of course, sometimes it is better or necessary to use metaprogramming techniques to, e.g., workaround compiler limitations. I think @unionsplit is a good example that we have a legit need for a macro.

1 Like

Interestingly, I got very different results with this sample, manual dispatch is not performing better, sometimes (on master) even worse than fully dynamic dispatch. (Not counting the dubious 1 ns result)

  | | |_| | | | (_| |  |  Version 1.5.2 (2020-09-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> include("demo.jl")
Manual union split with array of Int  1.118 μs (0 allocations: 0 bytes)
Runtime dispatch with array of Int  1.117 μs (0 allocations: 0 bytes)
Manual union split with array of Float64  1.118 μs (0 allocations: 0 bytes)
Runtime dispatch with array of Float64  1.117 μs (0 allocations: 0 bytes)
Manual union split with mixed array  1.118 μs (0 allocations: 0 bytes)
Runtime dispatch with mixed array  1.117 μs (0 allocations: 0 bytes)
Manual union split with typed mixed array  1.123 ns (0 allocations: 0 bytes)
Runtime dispatch with typed mixed array  232.700 ns (0 allocations: 0 bytes)

julia> 

               _
  | | |_| | | | (_| |  |  Version 1.6.0-DEV.1063 (2020-09-26)
 _/ |\__'_|_|_|\__'_|  |  Commit 93bbe0833b (1 day old master)
|__/                   |

julia> include("demo.jl")
[ Info: Precompiling ManualDispatch [666be00d-f264-4d02-8d64-dd271fd33b9e]
Manual union split with array of Int  673.535 ns (0 allocations: 0 bytes)
Runtime dispatch with array of Int  459.122 ns (0 allocations: 0 bytes)
Manual union split with array of Float64  672.891 ns (0 allocations: 0 bytes)
Runtime dispatch with array of Float64  563.735 ns (0 allocations: 0 bytes)
Manual union split with mixed array  1.117 μs (0 allocations: 0 bytes)
Runtime dispatch with mixed array  895.978 ns (0 allocations: 0 bytes)
Manual union split with typed mixed array  1.123 ns (0 allocations: 0 bytes)
Runtime dispatch with typed mixed array  228.789 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.6.0-DEV.1063
Commit 93bbe0833b (2020-09-26 14:27 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)

It seems that for simple cases dynamic dispatch is the best.

Actually, that was the reason why I used parametric types in my JIT demo. ManualDispatch.jl works like a charm on that:

using BenchmarkTools
using ManualDispatch

const NUM_TYPES = 10
const QUEUE_LENGTH = 100

abstract type A{T} end
struct C1{T} <: A{T} end
struct C2{T} <: A{T} end

const c1_count = Ref(0)
const c2_count = Ref(0)
reset() = (c1_count[] = 0; c2_count[] = 0)

count_subtypes(a::A) = nothing
count_subtypes(c1::C1) = (c1_count[] = c1_count[] + 1; nothing)
count_subtypes(c2::C2) = (c2_count[] = c2_count[] + 1; nothing)

function createq(alpha = 0.7, ql = QUEUE_LENGTH, nt = NUM_TYPES)
    @assert ql >= nt
    return [rand() < alpha ? C1{Val(i % nt)}() : C2{Val(i % nt)}() for i = 1:ql]
end


function ex_union_split1(x)
    for y in x
        @unionsplit((C1, C2), count_subtypes(y))
    end
end

function ex_union_split2(x)
    for y in x
        @unionsplit((C1{Val(0)}, C1{Val(1)}, C2{Val(0)}, C2{Val(1)}), count_subtypes(y))
    end
end

function ex_runtime_dispatch(x)
    for y in x
        count_subtypes(y)
    end
end

print("Manual 'union split' $(typeof(createq())) with (C1, C2)")
@btime ex_union_split1(x) setup=(reset(); x=createq())
print("Manual 'union split' $(typeof(createq())) with (C1{Val(0)}, C1{Val(1)}, C2{Val(0)}, C2{Val(1)})")
@btime ex_union_split2(x) setup=(reset(); x=createq())
print("Runtime dispatch")
@btime ex_runtime_dispatch(x) setup=(reset(); x=createq())
  | | |_| | | | (_| |  |  Version 1.6.0-DEV.1063 (2020-09-26)
 _/ |\__'_|_|_|\__'_|  |  Commit 93bbe0833b (1 day old master)
|__/                   |

julia> include("parametricdemo.jl")
Manual 'union split' Vector{A} with (C1, C2)  87.778 ns (0 allocations: 0 bytes)
Manual 'union split' Vector{A} with (C1{Val(0)}, C1{Val(1)}, C2{Val(0)}, C2{Val(1)})  5.699 μs (0 allocations: 0 bytes)
Runtime dispatch  7.034 μs (0 allocations: 0 bytes)

The example (C1{Val(0)}, C1{Val(1)}, C2{Val(0)}, C2{Val(1)}) also tries to show that what is happening here is potentially more than a union split.

Because abstract types can be extended with new subtypes, while unions cannot:

I verify your results on Julia 1.5 and today’s master. The results I quoted are from master one week ago: Version 1.6.0-DEV.988, fafc0a4ee3. So, this is somewhat complicated.

LLVM was upgraded from 9 to 10 in the meantime: