[ANN] Virtual.jl: optimizing dynamic calls away through "virtual" multiple dispatch

P.S: Recently I’m planning to implement a zero-latency scripting language in Julia, hence I need this feature. The package can be regarded as a generalization to ValSplit.jl, supporting fast runtime polymorphisms for multiple arguments.

This package (Virtual.jl) provides two macros @virtual and @override, with which we can avoid those slow dynamic calls and produce orders of magnitude performance gain!

abstract type Animal end
struct Cat <: Animal end
struct Dog <: Animal end

@virtual f(x::Animal, y::Int) = y
@override f(x::Cat, y::Int) = 1 + y
@override f(x::Dog, y::Int) = 2 + y

julia> code_typed(f, (Animal, Int))
1-element Vector{Any}:
 CodeInfo(
1 ─ %1  = (x isa Cat)::Bool
└──       goto #4 if not %1
2 ─       goto #4 if not true
3 ─ %4  = Base.add_int(1, y)::Int64
└──       goto #8
4 β”„ %6  = (x isa Dog)::Bool
└──       goto #7 if not %6
5 ─       goto #7 if not true
6 ─ %9  = Base.add_int(2, y)::Int64
└──       goto #8
7 β”„       goto #8
8 β”„ %12 = Ο† (#3 => %4, #6 => %9, #7 => y)::Int64
└──       return %12
) => Int64

The following example comes with a benchmark against hand-written if-else and dynamic multiple dispatch, which shows that the performance of Virtual.jl is best:

using Virtual, BenchmarkTools
abstract type Animal end
struct Dog <: Animal end
struct Tiger <: Animal end
struct Duck <: Animal end

@virtual fast_func(x::Animal, y::Int) = error("No default method for score!")
@override fast_func(x::Dog, y::Int) = 2 + y
@override fast_func(x::Tiger, y::Int) = 3 + y
@override fast_func(x::Duck, y::Int) = 4 + y

dyn_func(x::Animal, y::Int) = error("No default method for score!")
dyn_func(x::Dog, y::Int) = 2 + y
dyn_func(x::Tiger, y::Int) = 3 + y
dyn_func(x::Duck, y::Int) = 4 + y

manual_func(x::Animal, y::Int) =
    if x isa Dog
        2 + y
    elseif x isa Tiger
        3 + y
    elseif x isa Duck
        4 + y
    else
        error("No default method for score!")
    end

const samples = Animal[Dog(), Duck(), Tiger()]
animals = Animal[samples[rand(1:3)] for i = 1:100]

function sum_score(score_func, xs::AbstractVector{Animal})
    s = 0
    for x in xs
        s += score_func(x, 3)
    end
    return s
end

@info "fast_func via Virtual.jl"
display(@benchmark(sum_score(fast_func, animals)))
@info "manual split by hand"
display(@benchmark(sum_score(manual_func, animals)))
@info "dyn_func by dynamic multiple dispatch"
display(@benchmark(sum_score(dyn_func, animals)))

results:

[ Info: fast_func via Virtual.jl
BenchmarkTools.Trial: 10000 samples with 883 evaluations.
 Range (min … max):  118.913 ns … 982.673 ns  β”Š GC (min … max): 0.00% … 84.44%
 Time  (median):     139.751 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   145.934 ns Β±  28.422 ns  β”Š GC (mean Β± Οƒ):  0.16% Β±  1.46%

  β–„ β–„β–ƒ β–‚β–β–ˆβ–…β–ƒβ–ƒβ–‚β–ƒ
  β–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–…β–†β–„β–‡β–…β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  119 ns           Histogram: frequency by time          257 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.
[ Info: manual split by hand
BenchmarkTools.Trial: 10000 samples with 818 evaluations.
 Range (min … max):  146.088 ns …  1.342 ΞΌs  β”Š GC (min … max): 0.00% … 87.01%
 Time  (median):     162.958 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   168.780 ns Β± 29.335 ns  β”Š GC (mean Β± Οƒ):  0.14% Β±  1.23%

  β–ˆ ▅▅▁▂       β–„
  β–ˆβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–„β–ˆβ–ƒβ–ˆβ–ƒβ–ˆβ–ƒβ–ˆβ–†β–…β–…β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  146 ns          Histogram: frequency by time          283 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.
[ Info: dyn_func by dynamic multiple dispatch
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.043 ΞΌs …  19.000 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.543 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.677 ΞΌs Β± 624.903 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        β–ˆ β–†
  β–‚β–‚β–ƒβ–ƒβ–‡β–ƒβ–ˆβ–†β–ˆβ–ˆβ–†β–†β–ƒβ–…β–‚β–†β–ƒβ–ƒβ–„β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  4.04 ΞΌs         Histogram: frequency by time        7.39 ΞΌs <

 Memory estimate: 256 bytes, allocs estimate: 16.

There are also limitations to this package, for instance, variadic parameters and keyword parameters are not supported. There is a complete list of the limitations: Virtual.jl limitations.

Addressing such limitations is as hard as correctly re-implementing Julia’s type lattice, namely, too hard…

Enjoy! :partying_face: :partying_face: :partying_face:

31 Likes

This may be a bit tangential to the package (and congrats on releasing it!) but it makes me think: why does this not happen by default in regular Julia, i.e. without the need for these macros? If dispatch on abstract containers can be made faster what are the reasons not to? Disclaimer I am still learning how Julia is built, so there may be something rather fundamental I am missing.

4 Likes

Thank you for putting this package together. I think it could be very useful.

I was wondering why the example no longer works when I change the type annotation from Int to Integer. Do the types need to be concrete?

1 Like

I tried changing all Int to Integer, it works.

Maybe you are defining overloaded methods whose signature is more general than the default method? See limitations:

@virtual f(x::Number, y::Number) = 0

# wrong: Tuple{Float64, Any} <: Tuple{Number, Number} === false
@override f(x::Float64, y) = x * 3

# correct: Tuple{Float64, Number} <: Tuple{Number, Number} === true
@override f(x::Float64, y::Number) = x * 3
2 Likes

I think one of the reason for not making Virtual.jl and ValSplit.jl a language builtin is, it causes too much recompilation. The dispatch method needs to be recompiled once a new method is defined.

This leads to a longer precompilation time.

Dynamic calls are considered slow, but in practice they are not slow at all if the logic inside the function is a heavy computation. Dynamic calls delay the compilation, and may trigger compilation if necessary.

3 Likes

Interesting. Maybe I am misunderstanding something. I was under the impression that I could change Int to Integer in your example. Here is the code I used on Julia 1.7.3:

Summary
using Virtual, BenchmarkTools
abstract type Animal end
struct Dog <: Animal end
struct Tiger <: Animal end
struct Duck <: Animal end

@virtual fast_func(x::Animal, y::Integer) = error("No default method for score!")
@override fast_func(x::Dog, y::Integer) = 2 + y
@override fast_func(x::Tiger, y::Integer) = 3 + y
@override fast_func(x::Duck, y::Integer) = 4 + y

dyn_func(x::Animal, y::Integer) = error("No default method for score!")
dyn_func(x::Dog, y::Integer) = 2 + y
dyn_func(x::Tiger, y::Integer) = 3 + y
dyn_func(x::Duck, y::Integer) = 4 + y

manual_func(x::Animal, y::Integer) =
    if x isa Dog
        2 + y
    elseif x isa Tiger
        3 + y
    elseif x isa Duck
        4 + y
    else
        error("No default method for score!")
    end

const samples = Animal[Dog(), Duck(), Tiger()]
animals = Animal[samples[rand(1:3)] for i = 1:100]

function sum_score(score_func, xs::AbstractVector{Animal})
    s = 0
    for x in xs
        s += score_func(x, 3)
    end
    return s
end

@info "fast_func via Virtual.jl"
display(@benchmark(sum_score(fast_func, animals)))
@info "manual split by hand"
display(@benchmark(sum_score(manual_func, animals)))
@info "dyn_func by dynamic multiple dispatch"
display(@benchmark(sum_score(dyn_func, animals)))

Here are the results:

[ Info: fast_func via Virtual.jl

BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.762 ΞΌs …  16.781 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.826 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.960 ΞΌs Β± 521.806 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–ˆβ–β–ˆβ–…β–ƒβ–‚                                                     β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–†β–‡β–‡β–‡β–†β–…β–…β–†β–…β–„β–„β–‡β–†β–†β–‡β–‡β–†β–‡β–‡β–†β–†β–†β–„β–„β–†β–…β–…β–…β–β–ƒβ–β–„β–ƒβ–„β–ƒβ–„β–„β–…β–†β–†β–…β–†β–…β–†β–†β–† β–ˆ
  3.76 ΞΌs      Histogram: log(frequency) by time      6.56 ΞΌs <

 Memory estimate: 256 bytes, allocs estimate: 16.
[ Info: manual split by hand
BenchmarkTools.Trial: 10000 samples with 893 evaluations.
 Range (min … max):  126.036 ns …  1.944 ΞΌs  β”Š GC (min … max): 0.00% … 90.37%
 Time  (median):     126.924 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   134.716 ns Β± 32.419 ns  β”Š GC (mean Β± Οƒ):  0.34% Β±  1.57%

  β–ˆ ▅▅▁ ▁▁                                                     ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–†β–‡β–†β–†β–†β–†β–†β–†β–†β–†β–†β–…β–†β–†β–…β–…β–…β–†β–…β–…β–…β–…β–„β–†β–…β–… β–ˆ
  126 ns        Histogram: log(frequency) by time       216 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.
[ Info: dyn_func by dynamic multiple dispatch
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.008 ΞΌs …   9.808 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.174 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.346 ΞΌs Β± 611.198 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆβ–…β–ƒβ–‡β–…β–ƒβ–ƒβ–                            ▁▁ ▁▁                  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–…β–†β–…β–‡β–‡β–‡β–…β–†β–…β–…β–…β–…β–†β–‡β–†β–†β–‡β–‡β–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–…β–…β–ƒβ–…β–†β–†β–…β–†β–† β–ˆ
  3.01 ΞΌs      Histogram: log(frequency) by time      5.67 ΞΌs <

 Memory estimate: 256 bytes, allocs estimate: 16.
1 Like

Why is it faster than manual splitting? The output of code_typed looks very similar to the manual version.

Could you say something about how the package was made? I’ve never seen Julia code like that before!

Is the deserialization bit for latency reasons? That technique kind of scares me because I’m not sure how to review the code being loaded.

This is because we will always the signature from @virtual to box values (to assure the generared dispatch funxction is statically called), while manual split can do specialization in the callsite. If you provide the second argument of score_func using an element from Vector{Number}, the three funcs will be the same slow.

2 Likes

Use Int instead of Integer for @override methods will produce fast code, while keeping @virtual f’s parameters abstract is okay.

Besides, if + is implemented with Virtual.jl, our fast_func can be still fast.

1 Like

Just check reflection.jl, it is the source code.

For serialization stuffs, I’m recently experimenting with β€œdev-only dependencies” in Julia, because a library made by me, MLStyle, is runtime-free (usually not used at all after precompilation).

4 Likes

I’d like to expand a little.

Your code using Integer:

@virtual fast_func(x::Animal, y::Integer) = error("No default method for score!")
@override fast_func(x::Dog, y::Integer) = 2 + y
@override fast_func(x::Tiger, y::Integer) = 3 + y
@override fast_func(x::Duck, y::Integer) = 4 + y

This will work like

# struct Virtual.GenWrapper
#    value
# end
function fast_func(_x::Virtual.GenWrapper, _y::Virtual.GenWrapper)
   x = _x.value
   y = _y.value
    if x isa Dog && y isa Integer
        2 + y
    elseif x isa Tiger && y isa Integer
        3 + y
    elseif x isa Duck && y isa Integer
        4 + y
    else
        error("No default method for score!")
    end
end

As can be seen, (EDIT) @override’s type annotations affect the optimization of using isa to narrow argument types.

Modifying your code to the following one produce fast code as well:

@virtual fast_func(x::Animal, y::Integer) = error("No default method for score!")
@override fast_func(x::Dog, y::Int) = 2 + y
@override fast_func(x::Tiger, y::Int) = 3 + y
@override fast_func(x::Duck, y::Int) = 4 + y

# the generated fast_func is

function fast_func(_x::Virtual.GenWrapper, _y::Virtual.GenWrapper)
   x = _x.value
   y = _y.value
    if x isa Dog && y isa Int
        2 + y
    elseif x isa Tiger && y isa Int
        3 + y
    elseif x isa Duck && y isa Int
        4 + y
    else
        error("No default method for score!")
    end
end

Now maybe you see why there is a performance gap?

2 Likes

Thanks for your explanation. I am still a bit confused about the cases in which Virtual.jl will work. Based on the documentation, I thought the following would work because Integer <: Number. However, it did not seem to work.

@virtual fast_func(x::Animal, y::Number) = error("No default method for score!")
@override fast_func(x::Dog, y::Integer) = 2 + y
@override fast_func(x::Tiger, y::Integer) = 3 + y
@override fast_func(x::Duck, y::Integer) = 4 + y

1 Like

Some more discussion on this happened here: Union splitting vs C++ - #14 by lmiq

1 Like

It works for me, although the generated code for this case is slow:

(EDIT: make the code copy-paste runnable)

using Virtual, BenchmarkTools
abstract type Animal end
struct Dog <: Animal end
struct Tiger <: Animal end
struct Duck <: Animal end


@virtual fast_func(x::Animal, y::Number) = error("No default method for score!")
@override fast_func(x::Dog, y::Integer) = 2 + y
@override fast_func(x::Tiger, y::Integer) = 3 + y
@override fast_func(x::Duck, y::Integer) = 4 + y

fast_func(Dog(), 1)
# 3

fast_func(Tiger(), 2)
# 5

fast_func(Duck(), 3)
# 7

test_results(animals::Vector{Animal}) =
           [fast_func(x, 0) for x in animals]
# test_results (generic function with 1 method)

test_results([Dog(), Tiger(), Duck()])
# 3-element Vector{Int64}:
# 2
# 3
# 4

code_typed result:
julia> code_typed(fast_func, (Any, Any))
1-element Vector{Any}:
 CodeInfo(
1 ── %1  = Virtual.GenWrapper::Type{Virtual.GenWrapper}
β”‚    %2  = %new(%1, x)::Virtual.GenWrapper
β”‚    %3  = Virtual.GenWrapper::Type{Virtual.GenWrapper}
β”‚    %4  = %new(%3, y)::Virtual.GenWrapper
β”‚    %5  = Main.fast_func::typeof(fast_func)
β”‚    %6  = Core.tuple(%5, %2, %4)::Tuple{typeof(fast_func), Virtual.GenWrapper, Virtual.GenWrapper}
β”‚    %7  = (x isa Tiger)::Bool
└───       goto #4 if not %7
2 ── %9  = (y isa Integer)::Bool
└───       goto #4 if not %9
3 ── %11 = Ο€ (x, Tiger)
β”‚    %12 = Ο€ (y, Integer)
β”‚    %13 = invoke var"##368"(%11::Tiger, %12::Integer)::Any
└───       goto #14
4 ┄─ %15 = (x isa Duck)::Bool
└───       goto #7 if not %15
5 ── %17 = (y isa Integer)::Bool
└───       goto #7 if not %17
6 ── %19 = Ο€ (x, Duck)
β”‚    %20 = Ο€ (y, Integer)
β”‚    %21 = invoke var"##369"(%19::Duck, %20::Integer)::Any
└───       goto #14
7 ┄─ %23 = (x isa Dog)::Bool
└───       goto #10 if not %23
8 ── %25 = (y isa Integer)::Bool
└───       goto #10 if not %25
9 ── %27 = Ο€ (x, Dog)
β”‚    %28 = Ο€ (y, Integer)
β”‚    %29 = invoke var"##367"(%27::Dog, %28::Integer)::Any
└───       goto #14
10 β”„ %31 = (x isa Animal)::Bool
└───       goto #13 if not %31
11 ─ %33 = (y isa Number)::Bool
└───       goto #13 if not %33
12 ─       (var"##366")(x, y)::Union{}
└───       unreachable
13 β”„ %37 = invoke Base.print_to_string(%6::Tuple{typeof(fast_func), Virtual.GenWrapper, Virtual.GenWrapper})::String
β”‚    %38 = invoke Base.string("method not found for "::String, %37::String)::String
β”‚          invoke error(%38::String)::Union{}
└───       unreachable
14 β”„ %41 = Ο† (#3 => %13, #6 => %21, #9 => %29)::Any
└───       return %41
) => Any
1 Like