Precompilation of complex code does not make time to first model step faster

I have ported the dynamical core of our C++ CFD code to Julia a while ago, and see now huge speedups going from 1.8 to 1.9 (170 sec to 21 sec to first model step). Inspired by many discussions on this forum, I tried to precompile all functions triggered towards the first model step, but I do not see speedups. In the main file, I have added just before the __init__() function the following code:

## Precompilation
include("precompile_settings.jl")

# map([Float32, Float64]) do float_type
for float_type in [Float32, Float64]
    n_domains = 1
    m = Model("precompile", n_domains, create_precompile_settings(), float_type)
    save_model(m)
    load_model!(m)
    in_progress = prepare_model!(m)
    in_progress = step_model!(m)
end

This code runs only in the precompilation phase, should compile all required functions for the two float types, but at the same time, the time to first model step is the same with and without precompilation. Hence, it does not work. I do not see what I am doing wrong, and this problem is hard to isolate in a minimal working example. The actual code can be found here:

Could you try using SnoopPrecompile · SnoopCompile?

1 Like

I have followed the instructions at your suggested link and tried this, but the resulting load time is identical.

## Precompilation
@precompile_setup begin
    include("precompile_settings.jl")
   

    @precompile_all_calls begin
        for float_type in [Float32, Float64]
            n_domains = 1
            m = Model("precompile", n_domains, create_precompile_settings(), float_type)
            save_model(m)
            load_model!(m)
            in_progress = prepare_model!(m)
            in_progress = step_model!(m)
        end
    end
end

Try Julia v1.9.0-beta2 is fast - #17 by tim.holy ?

2 Likes

The final result is below. I do not see any methods here from my own package, suggesting that the problem is elsewhere (or I did something wrong).

julia> staletrees = precompile_blockers(trees, tinf)
2-element Vector{SnoopCompile.StaleTree}:
 inserting convert(::Type{T}, N::Union{Static.StaticBool{N}, Static.StaticFloat64{N}, Static.StaticInt{N}} where N) where T<:Number @ Static ~/.julia/packages/Static/Ldb7F/src/Static.jl:408 invalidated:
   backedges: 1: MethodInstance for convert(::Type{<:Real}, ::Real) at depth 0 with 4 children blocked InferenceTimingNode: 0.001004/0.003701 on Logging.default_metafmt(::Base.CoreLogging.LogLevel, ::Any, ::Any, ::Any, ::Any, ::Any) with 2 direct children

 inserting num_threads() @ CPUSummary ~/.julia/packages/CPUSummary/jSvVJ/src/CPUSummary.jl:72 invalidated:
   mt_backedges: 1: MethodInstance for PolyesterWeave.worker_bits() at depth 1 with 39 children blocked 7.771725978999998 inclusive time for 16 nodes

No, see the explanation in Invalidations findings (from a GMT case) - #32 by tim.holy. If you use ascend (see the SnoopCompile docs), you can see how methods used in your workload ultimately trace back to num_threads. You can see that invalidation is very expensive, more than 7s of compilation time. (The other one is negligible.)

I think somewhere I saw @Elrod say that CPUSummary.num_threads method might be eliminated?

:grimacing:
I haven’t eliminated it everywhere yet.
I’ll try to do that tonight.

It shold be gone from Polyester thanks to @Krastanov , and I removed it from @turbo earlier, but it’s still in @tturbo and Octavian.jl.
I’ll have to double check TriangularSolve.jl

The cumulative minor runtime benefit num_threads provided across all users and uses since its inception was probably less than 7 seconds.
I also spent more than 7 seconds implementing it, or writing this comment about it. Not my best idea.

2 Likes

Minor remark: PolyesterWeave 0.1.13 still needs to get registered for the fix to be available publicly.

This is absolutely awesome, thanks to your fix following the analysis suggested by @tim.holy the compilation time has completely vanished!

4 Likes