I ended up implementing one of my pipe dreams, and hey, it works!
Catwalk.jl Intro
Catwalk.jl can speed up long-running Julia processes by minimizing the
overhead of dynamic dispatch. It is a JIT compiler that continuosly
re-optimizes dispatch code based on data collected at runtime.
It profiles user-specified call sites, estimating the distribution of
dynamically dispatched types during runtime, and generates fast
static routes for the most frequent ones on the fly.
The statistical profiler has very low overhead and can be configured
to handle situations where the distribution of dispatched types
changes relatively fast.
To minimize compilation overhead, recompilation only occurs when the
distribution changed enough and the tunable cost model predicts
significant speedup compared to the best version that was previously
compiled.
When to use this package
The dynamic dispatch in Julia is very fast in itself, so speeding it up is not an easy task.
Catwalk.jl focuses on use cases when it is not feasible to list the dynamically dispatched concrete types in the source code of the call site.
Catwalk.jl assumes the followings:
- The process is long running: several seconds, but possibly minutes are needed to break even after the initial compilation overhead.
- Few dynamically dispatched call sites contribute significantly to the running time (dynamic dispatch in a hot loop).
- You can modify the source code around the interesting call sites (add a macro call), and calculation is organized into batches.
Alternative packages
Alternatives
- JuliaFolds packages in general try to do a weaker version of this, as discussed in: Tail-call optimization and function-barrier -based accumulation in loops.
- ManualDispatch.jl can serve you better in less dynamic cases, when it is feasible to list the dynamically dispatched types in the source code.
- In even simpler cases using unions instead of a type hierarchy may allow the Julia compiler to âsplit the unionâ. See for example List performance improvent by Union-typed tail in DataStructures.jl.
- FunctionWrappers.jl will give you type stability for a fixed (?) cost. Its use case is different, but if you are wrestling with type instabilities, take a look at it first.
- FunctionWranglers.jl allows fast, inlined execution of functions provided in an array - for that use case it is a better choice than Catwalk.jl.
Usage
Letâs say you have a long-running calculation, organized into batches:
const NUM_BATCHES = 1000
function runbatches()
for batchidx = 1:NUM_BATCHES
hotloop()
# Log progress, etc.
end
end
The hot loop calls the type-unstable function get_some_x()
and passes its result to a relatively cheap calculation calc_with_x()
.
const NUM_ITERS_PER_BATCH = 1_000_000
function hotloop()
for i = 1:NUM_ITERS_PER_BATCH
x = get_some_x(i)
calc_with_x(x)
end
end
const xs = Any[1, 2.0, ComplexF64(3.0, 3.0)]
get_some_x(i) = xs[i % length(xs) + 1]
const result = Ref(ComplexF64(0.0, 0.0))
function calc_with_x(x)
result[] += x
end
As get_some_x
is not type-stable, calc_with_x
must be dynamically dispatched, which slows down the calculation.
Sometimes it is not feasible to type-stabilize get_some_x
. Catwalk.jl is here for those cases.
You mark hotloop
* , the outer function with the @jit
macro and provide the name of the dynamically dispatched function and the argument to operate on (the API will hopefully improve in the future). You also have to add an extra argument named jitctx
to the jit-ed function:
using Catwalk
@jit calc_with_x x function hotloop_jit(jitctx)
for i = 1:NUM_ITERS_PER_BATCH
x = get_some_x(i)
calc_with_x(x)
end
end
The Catwalk optimizer will provide you the jitctx
context which you have to pass to the jit-ed function manually. Also, every batch needs a bit housekeeping to drive the Catwalk optimizer:
function runbatches_jit()
jit = Catwalk.JIT() ## Also works inside a function (no eval used)
for batch = 1:NUM_BATCHES
Catwalk.step!(jit)
hotloop_jit(Catwalk.ctx(jit))
end
end
Yes, it is a bit complicated to integrate your code with Catwalk, but it may worth the effort:
result[] = ComplexF64(0, 0)
@time runbatches_jit()
# 4.608471 seconds (4.60 M allocations: 218.950 MiB, 0.56% gc time, 21.68% compilation time)
jit_result = result[]
result[] = ComplexF64(0, 0)
@time runbatches()
# 23.387341 seconds (1000.00 M allocations: 29.802 GiB, 7.71% gc time)
And the results are the same:
jit_result == result[] || error("JIT must be a no-op!")
Please note that the speedup depends on the portion of the runtime spent in dynamic dispatch, which is most likely smaller in your case than in this contrived example.
Source of this demo: usage.jl
Whatâs inside: How it works? ¡ Catwalk.jl
Fully tunable: Configuration & tuning ¡ Catwalk.jl
* EDIT: clarification: The name hotloop
for this function is misleading. There is a hot loop somewhere in the code, but it is possible that the function marked whit the @jit
macro is only called from the loop body, meaning that the macro is not aware of the loop. Possibly more than one jit-ed functions are called from the same loop (example).