Best approach for runtime dispatching inside a hot loop (heterogeneous tree structure)

Elrod · March 2, 2018, 9:50am

I haven’t played around with union dispatching in 0.7 yet. On an 8-day old master:

julia> using BenchmarkTools, Random

julia> x = randn(20);

julia> u = Vector{Union{Float64, Float32}}(x);

julia> @benchmark exp($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.207 ns (0.00% GC)
  median time:      4.228 ns (0.00% GC)
  mean time:        4.286 ns (0.00% GC)
  maximum time:     19.357 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark exp($u[1])
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     24.061 ns (0.00% GC)
  median time:      30.399 ns (0.00% GC)
  mean time:        36.888 ns (15.74% GC)
  maximum time:     38.634 μs (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     996


julia> @benchmark f($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.212 ns (0.00% GC)
  median time:      1.213 ns (0.00% GC)
  mean time:        1.221 ns (0.00% GC)
  maximum time:     14.737 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark f($u[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.077 ns (0.00% GC)
  median time:      4.107 ns (0.00% GC)
  mean time:        4.107 ns (0.00% GC)
  maximum time:     19.216 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Versus 0.6.2:

julia> using BenchmarkTools, Random

julia> x = randn(20);

julia> u = Vector{Union{Float64, Float32}}(x);

julia> @benchmark exp($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.582 ns (0.00% GC)
  median time:      7.753 ns (0.00% GC)
  mean time:        7.932 ns (0.00% GC)
  maximum time:     20.750 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp($u[1])
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     25.851 ns (0.00% GC)
  median time:      26.194 ns (0.00% GC)
  mean time:        27.828 ns (2.21% GC)
  maximum time:     965.944 ns (93.62% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> f(x::Float64) = 2x
f (generic function with 1 method)

julia> f(x::Float32) = 2+x
f (generic function with 2 methods)

julia> @benchmark f($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.442 ns (0.00% GC)
  median time:      1.463 ns (0.00% GC)
  mean time:        1.466 ns (0.00% GC)
  maximum time:     16.591 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark f($u[1])
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.675 ns (0.00% GC)
  median time:      12.719 ns (0.00% GC)
  mean time:        13.708 ns (4.71% GC)
  maximum time:     999.245 ns (95.45% GC)
  --------------
  samples:          10000
  evals/sample:     998

For comparison, cost of an if statement is close to 1 ns. With if statements, you also don’t have to worry about squashing type instability that can result from dynamic dispatches.

If there’s some sort of pattern that lets you use Base.Cartesian.@nif, you could still be relatively concise with the control flow.

Topic		Replies	Views
Performance optimizations with runtime dispatch Performance	14	945	May 30, 2024
The mechanics of dispatch General Usage	1	519	January 14, 2017
Experiments in speeding up runtime dispatch for integer value specialization Performance	28	1226	November 18, 2022
Performance discrepancy for different ways of piping/composing functions for handling runtime dispatch Performance function , runtime-dispatch	9	195	June 13, 2025
Why is this code run-time dispatch/slow? Performance	16	883	September 25, 2020

Best approach for runtime dispatching inside a hot loop (heterogeneous tree structure)

Related topics