I have ported the dynamical core of our C++ CFD code to Julia a while ago, and see now huge speedups going from 1.8 to 1.9 (170 sec to 21 sec to first model step). Inspired by many discussions on this forum, I tried to precompile all functions triggered towards the first model step, but I do not see speedups. In the main file, I have added just before the __init__() function the following code:
## Precompilation
include("precompile_settings.jl")
# map([Float32, Float64]) do float_type
for float_type in [Float32, Float64]
n_domains = 1
m = Model("precompile", n_domains, create_precompile_settings(), float_type)
save_model(m)
load_model!(m)
in_progress = prepare_model!(m)
in_progress = step_model!(m)
end
This code runs only in the precompilation phase, should compile all required functions for the two float types, but at the same time, the time to first model step is the same with and without precompilation. Hence, it does not work. I do not see what I am doing wrong, and this problem is hard to isolate in a minimal working example. The actual code can be found here:
I have followed the instructions at your suggested link and tried this, but the resulting load time is identical.
## Precompilation
@precompile_setup begin
include("precompile_settings.jl")
@precompile_all_calls begin
for float_type in [Float32, Float64]
n_domains = 1
m = Model("precompile", n_domains, create_precompile_settings(), float_type)
save_model(m)
load_model!(m)
in_progress = prepare_model!(m)
in_progress = step_model!(m)
end
end
end
The final result is below. I do not see any methods here from my own package, suggesting that the problem is elsewhere (or I did something wrong).
julia> staletrees = precompile_blockers(trees, tinf)
2-element Vector{SnoopCompile.StaleTree}:
inserting convert(::Type{T}, N::Union{Static.StaticBool{N}, Static.StaticFloat64{N}, Static.StaticInt{N}} where N) where T<:Number @ Static ~/.julia/packages/Static/Ldb7F/src/Static.jl:408 invalidated:
backedges: 1: MethodInstance for convert(::Type{<:Real}, ::Real) at depth 0 with 4 children blocked InferenceTimingNode: 0.001004/0.003701 on Logging.default_metafmt(::Base.CoreLogging.LogLevel, ::Any, ::Any, ::Any, ::Any, ::Any) with 2 direct children
inserting num_threads() @ CPUSummary ~/.julia/packages/CPUSummary/jSvVJ/src/CPUSummary.jl:72 invalidated:
mt_backedges: 1: MethodInstance for PolyesterWeave.worker_bits() at depth 1 with 39 children blocked 7.771725978999998 inclusive time for 16 nodes
No, see the explanation in Invalidations findings (from a GMT case) - #32 by tim.holy. If you use ascend (see the SnoopCompile docs), you can see how methods used in your workload ultimately trace back to num_threads. You can see that invalidation is very expensive, more than 7s of compilation time. (The other one is negligible.)
I think somewhere I saw @Elrod say that CPUSummary.num_threads method might be eliminated?
I haven’t eliminated it everywhere yet.
I’ll try to do that tonight.
It shold be gone from Polyester thanks to @Krastanov , and I removed it from @turbo earlier, but it’s still in @tturbo and Octavian.jl.
I’ll have to double check TriangularSolve.jl
The cumulative minor runtime benefit num_threads provided across all users and uses since its inception was probably less than 7 seconds.
I also spent more than 7 seconds implementing it, or writing this comment about it. Not my best idea.