I ended up implementing one of my pipe dreams, and hey, it works!
Catwalk.jl can speed up long-running Julia processes by minimizing the
overhead of dynamic dispatch. It is a JIT compiler that continuosly
re-optimizes dispatch code based on data collected at runtime.
It profiles user-specified call sites, estimating the distribution of
dynamically dispatched types during runtime, and generates fast
static routes for the most frequent ones on the fly.
The statistical profiler has very low overhead and can be configured
to handle situations where the distribution of dispatched types
changes relatively fast.
To minimize compilation overhead, recompilation only occurs when the
distribution changed enough and the tunable cost model predicts
significant speedup compared to the best version that was previously
The dynamic dispatch in Julia is very fast in itself, so speeding it up is not an easy task.
Catwalk.jl focuses on use cases when it is not feasible to list the dynamically dispatched concrete types in the source code of the call site.
Catwalk.jl assumes the followings:
- The process is long running: several seconds, but possibly minutes are needed to break even after the initial compilation overhead.
- Few dynamically dispatched call sites contribute significantly to the running time (dynamic dispatch in a hot loop).
- You can modify the source code around the interesting call sites (add a macro call), and calculation is organized into batches.
- JuliaFolds packages in general try to do a weaker version of this, as discussed in: Tail-call optimization and function-barrier -based accumulation in loops.
- ManualDispatch.jl can serve you better in less dynamic cases, when it is feasible to list the dynamically dispatched types in the source code.
- In even simpler cases using unions instead of a type hierarchy may allow the Julia compiler to “split the union”. See for example List performance improvent by Union-typed tail in DataStructures.jl.
- FunctionWrappers.jl will give you type stability for a fixed (?) cost. Its use case is different, but if you are wrestling with type instabilities, take a look at it first.
- FunctionWranglers.jl allows fast, inlined execution of functions provided in an array - for that use case it is a better choice than Catwalk.jl.
Let’s say you have a long-running calculation, organized into batches:
const NUM_BATCHES = 1000 function runbatches() for batchidx = 1:NUM_BATCHES hotloop() # Log progress, etc. end end
The hot loop calls the type-unstable function
get_some_x() and passes its result to a relatively cheap calculation
const NUM_ITERS_PER_BATCH = 1_000_000 function hotloop() for i = 1:NUM_ITERS_PER_BATCH x = get_some_x(i) calc_with_x(x) end end const xs = Any[1, 2.0, ComplexF64(3.0, 3.0)] get_some_x(i) = xs[i % length(xs) + 1] const result = Ref(ComplexF64(0.0, 0.0)) function calc_with_x(x) result += x end
get_some_x is not type-stable,
calc_with_x must be dynamically dispatched, which slows down the calculation.
Sometimes it is not feasible to type-stabilize
get_some_x . Catwalk.jl is here for those cases.
hotloop * , the outer function with the
@jit macro and provide the name of the dynamically dispatched function and the argument to operate on (the API will hopefully improve in the future). You also have to add an extra argument named
jitctx to the jit-ed function:
using Catwalk @jit calc_with_x x function hotloop_jit(jitctx) for i = 1:NUM_ITERS_PER_BATCH x = get_some_x(i) calc_with_x(x) end end
The Catwalk optimizer will provide you the
jitctx context which you have to pass to the jit-ed function manually. Also, every batch needs a bit housekeeping to drive the Catwalk optimizer:
function runbatches_jit() jit = Catwalk.JIT() ## Also works inside a function (no eval used) for batch = 1:NUM_BATCHES Catwalk.step!(jit) hotloop_jit(Catwalk.ctx(jit)) end end
Yes, it is a bit complicated to integrate your code with Catwalk, but it may worth the effort:
result = ComplexF64(0, 0) @time runbatches_jit() # 4.608471 seconds (4.60 M allocations: 218.950 MiB, 0.56% gc time, 21.68% compilation time) jit_result = result result = ComplexF64(0, 0) @time runbatches() # 23.387341 seconds (1000.00 M allocations: 29.802 GiB, 7.71% gc time)
And the results are the same:
jit_result == result || error("JIT must be a no-op!")
Please note that the speedup depends on the portion of the runtime spent in dynamic dispatch, which is most likely smaller in your case than in this contrived example.
* EDIT: clarification: The name
hotloop for this function is misleading. There is a hot loop somewhere in the code, but it is possible that the function marked whit the
@jit macro is only called from the loop body, meaning that the macro is not aware of the loop. Possibly more than one jit-ed functions are called from the same loop (example).