Precompilation of complex code does not make time to first model step faster

Chiil · January 3, 2023, 12:22pm

I have ported the dynamical core of our C++ CFD code to Julia a while ago, and see now huge speedups going from 1.8 to 1.9 (170 sec to 21 sec to first model step). Inspired by many discussions on this forum, I tried to precompile all functions triggered towards the first model step, but I do not see speedups. In the main file, I have added just before the __init__() function the following code:

## Precompilation
include("precompile_settings.jl")

# map([Float32, Float64]) do float_type
for float_type in [Float32, Float64]
    n_domains = 1
    m = Model("precompile", n_domains, create_precompile_settings(), float_type)
    save_model(m)
    load_model!(m)
    in_progress = prepare_model!(m)
    in_progress = step_model!(m)
end

This code runs only in the precompilation phase, should compile all required functions for the two float types, but at the same time, the time to first model step is the same with and without precompilation. Hence, it does not work. I do not see what I am doing wrong, and this problem is hard to isolate in a minimal working example. The actual code can be found here:

github.com

Chiil/MicroHH.jl/blob/main/src/MicroHH.jl#L506


      
          function output_timer_model!(m::Model)
              if m.parallel.id == 0
                  @info ""
                  @info "Timer output:"
                  show(m.to)
                  println("")
              end
          end
          
          

          
## Precompilation
          include("precompile_settings.jl")
          
          
# map([Float32, Float64]) do float_type
          for float_type in [Float32, Float64]
              n_domains = 1
              m = Model("precompile", n_domains, create_precompile_settings(), float_type)
              save_model(m)
              load_model!(m)
              in_progress = prepare_model!(m)
              in_progress = step_model!(m)

gbaraldi · January 3, 2023, 1:08pm

Could you try using SnoopPrecompile · SnoopCompile?

Chiil · January 3, 2023, 2:24pm

I have followed the instructions at your suggested link and tried this, but the resulting load time is identical.

## Precompilation
@precompile_setup begin
    include("precompile_settings.jl")
   

    @precompile_all_calls begin
        for float_type in [Float32, Float64]
            n_domains = 1
            m = Model("precompile", n_domains, create_precompile_settings(), float_type)
            save_model(m)
            load_model!(m)
            in_progress = prepare_model!(m)
            in_progress = step_model!(m)
        end
    end
end

tim.holy · January 3, 2023, 6:06pm

Try Julia v1.9.0-beta2 is fast - #17 by tim.holy ?

Chiil · January 3, 2023, 8:46pm

The final result is below. I do not see any methods here from my own package, suggesting that the problem is elsewhere (or I did something wrong).

julia> staletrees = precompile_blockers(trees, tinf)
2-element Vector{SnoopCompile.StaleTree}:
 inserting convert(::Type{T}, N::Union{Static.StaticBool{N}, Static.StaticFloat64{N}, Static.StaticInt{N}} where N) where T<:Number @ Static ~/.julia/packages/Static/Ldb7F/src/Static.jl:408 invalidated:
   backedges: 1: MethodInstance for convert(::Type{<:Real}, ::Real) at depth 0 with 4 children blocked InferenceTimingNode: 0.001004/0.003701 on Logging.default_metafmt(::Base.CoreLogging.LogLevel, ::Any, ::Any, ::Any, ::Any, ::Any) with 2 direct children

 inserting num_threads() @ CPUSummary ~/.julia/packages/CPUSummary/jSvVJ/src/CPUSummary.jl:72 invalidated:
   mt_backedges: 1: MethodInstance for PolyesterWeave.worker_bits() at depth 1 with 39 children blocked 7.771725978999998 inclusive time for 16 nodes

tim.holy · January 3, 2023, 9:16pm

No, see the explanation in Invalidations findings (from a GMT case) - #32 by tim.holy. If you use ascend (see the SnoopCompile docs), you can see how methods used in your workload ultimately trace back to num_threads. You can see that invalidation is very expensive, more than 7s of compilation time. (The other one is negligible.)

I think somewhere I saw @Elrod say that CPUSummary.num_threads method might be eliminated?

Elrod · January 3, 2023, 9:57pm

I haven’t eliminated it everywhere yet.
I’ll try to do that tonight.

It shold be gone from Polyester thanks to @Krastanov , and I removed it from @turbo earlier, but it’s still in @tturbo and Octavian.jl.
I’ll have to double check TriangularSolve.jl

The cumulative minor runtime benefit num_threads provided across all users and uses since its inception was probably less than 7 seconds.
I also spent more than 7 seconds implementing it, or writing this comment about it. Not my best idea.

Krastanov · January 3, 2023, 11:24pm

Minor remark: PolyesterWeave 0.1.13 still needs to get registered for the fix to be available publicly.

Chiil · January 5, 2023, 11:29am

This is absolutely awesome, thanks to your fix following the analysis suggested by @tim.holy the compilation time has completely vanished!

Topic		Replies	Views
Precompilation/recompilation - what's going on? General Usage question , precompilation	3	593	December 20, 2022
Avoiding Repeated Function Compilation General Usage precompilation	11	4628	March 28, 2019
Julia precompilation limits or are there really any? General Usage precompilation , snoopcompile , jit	20	3330	June 20, 2022
A few basic questions about Precompilation General Usage precompilation	6	943	February 17, 2022
__precompile__() is not resulting in precompiled code New to Julia	4	3956	August 17, 2017

Precompilation of complex code does not make time to first model step faster

Related topics