I’m working on reducing TTFX for my project using SnoopCompile.jl.
I have eliminated all relevant invalidations according to precompile_blockers
@snoopi_deep reports the following for my workload:
julia> tinf
InferenceTimingNode: 0.899820/0.965381 on Core.Compiler.Timings.ROOT() with 22 direct children
The resulting inference flamegraph is almost entirely blank space, which I understand is attributable to code generation
Despite this progress, there is still an unexplained 10-second pause in my workload.
I’m doing something like the following:
# Redefine a very small module which imports my package
@time include("small_module.jl")
@info "done with module loading"
@time SmallModule.run_workload()
This results in the following output:
4.013175 seconds (21.86 M allocations: 1.273 GiB, 5.50% gc time, 7.90% compilation time: 3% of which was recompilation)
[ Info: done with module loading
<<<<< Ten second pause >>>>>
0.919104 seconds (359.83 k allocations: 254.819 MiB, 6.49% gc time, 23.76% compilation time)
The four seconds is due to using MyPackage inside of SmallModule.
The 0.9 seconds roughly tracks my estimate of how long the workload really takes, based on logging.
Questions
Is the 10-second pause already accounted for in the blank space in @snoopi_deep’s flamegraph, but excluded from @time?
If not, is it still possibly attributed to codegen?
How can I go about debugging what Julia is doing during this ten second pause?
SnoopCompile script for reference
using SnoopCompileCore
@time invalidations = @snoopr begin
include("small_module.jl")
end
using SnoopCompile
@time tinf = @snoopi_deep begin
SmallModule.run_workload()
end
trees = invalidation_trees(invalidations)
blockers = precompile_blockers(trees, tinf)
It’s a bit hard to guess exactly what’s going on here since most of the code is missing. Here are some relevant questions:
are you using Julia 1.9? Precompilation is “fixed” on 1.9 but incomplete on previous Julia releases.
is nearly all your “real” code in one or more packages that you load with using MyPackage (I think you said as much, but worth being sure)
do all your packages have a precompilation workload? For most packages you want to use SnoopPrecompile, but if your code is really well-inferred then it’s not necessary
in your script, what happens if you run the @snoopi_deep step before loading SnoopCompile? SnoopCompile can (and does) invalidate code and force recompilation; that’s why SnoopCompileCore exists, to allow you to collect the data before you load “dangerous” code that allows you to do the analysis. You should run both @snoopr and @snoopi_deep before loading SnoopCompile. Just be sure to suppress printing of tinf since that can be very long before SnoopCompile loads. (SnoopCompile defines the relevant show methods; they are not in SnoopCompileCore because defining new methods for Base functions is the best way to invalidate poorly-inferred code).
julia> tinf
InferenceTimingNode: 0.965624/1.002108 on Core.Compiler.Timings.ROOT() with 21 direct children
All of those inference roots are for “setup” code, stuff like Base.merge!(Dict{String, Any}, Dict{String, Any}), not the main part of my workload. On top of this, the output of staleinstances and precompile_blockers are both empty.
Looking at the invalidation trees again, there are some things that I should probably be concerned about. However, I can’t see how these show up in the inference flamegraph. My mental model for how invalidations affect compilation time is:
Invalidation -> Inference -> Code generation
As I understand it, any invalidation which results in significant compilation time should show up in the @snoopi_deep flamegraph. Is that mistaken?
In the meantime I will keep plugging away on those invalidations and see if they fix the problem, but I would love to understand this better.
Update: I removed a dependency on Polynomials.jl, some of the methods of which were getting invalidated. This seems to have fixed the situation where the bulk of the compilation of my workload was not getting measured by @snoopi_deep; now the inference graph looks much worse, but the real time spent has slightly decreased. Now there are some meaningful inference roots to continue tackling.
As a data point, I was using Polynomials inside the “compile-time” part of some @generated functions.
I think your understanding is correct. If all of your remaining compilation is just due setup code, then you can create one or more personal StartupX.jl packages that also precompile the setup code.
Ah, interesting. Can you paste that portion of the code here? Indeed it’s possible that for @generated functions we have an incomplete story (I’m unsure). We don’t track backedges for the generator, for example.