Help reducing compilation and inference time

This is something I want to do, but haven’t had time to work on: have a nightly CI run that builds an image of the current state of the Manifest.toml and builds a sysimage. Tag the image with the hash of the manifest file. Then each CI run, checks the Manifest file against the base image tag, if it matches, it can skip installing and recompiling dependencies.

I don’t know what your technical limitations are to pre compilation. I also don’t know what you mean by having a compiled directory causing issues for other teams, so I can’t propose a good solution.

But from my read of this thread, it sounds like you’re near the upper bound of how fast Julia can compile your code base from scratch, so this becomes a DevOps problem.

The approach I described in my comment above might be applicable here. If you could build a sysimage with all your static dependencies that runs as many library methods as possible overnight, and then, have your CI/CD pipeline pull that in, you wouldn’t have to pay so much compilation time.

It makes it a bit complicated because you need to ensure that your CI can get the base image that matches the current state of your dependencies, But that’s a tractable problem.

1 Like

I suspect this is measuring the backend (mostly LLVM) of the compiler, which does not show up in a lot of profiles.

A slight clarification. I believe the left cluster is actually measuring the compilation of your code, whereas the right one is measuring the runtime. That’s why you see functions related to type inference and whatnot on the left. The profile is likely compiler front + middle end (left cluster) → compiler backend (middle valley) → runtime (right cluster).

It’s notable that running with -O0 removes so much of the middle area. Have you checked if TTFX is faster? If it is, then it’s worth looking into which functions are giving LLVM a hard time. I can provide general tips such as not @inlineing every function and avoiding large tuples, but it would be a good idea to see what people who have experience troubleshooting this stuff have to say.

2 Likes

Yes. Here are some crude results.
Without an opt flag: 4m4.478s
With -O0: 1m34.976s

That’s precisely what I want to achieve but I don’t know how or where to start from.

We use @inline in our code but not extensively. Additionally, we usually use the macro for small functions. I will remove the inline macros and see if it will make a difference but I have my doubts.

We do have a lot of large tuples and nested tuples because we are interfacing with C code. However, I thought this issue was fixed in Julia 1.9. I posted about this issue a couple of years ago here. Julia 1.9 did indeed reduce our compilation time significantly but the compilation time still remains high as you can tell by the aforementioned reported times at the beginning of this reply.

1 Like

While I cannot generate flamegraphs from SnoopCompile. Here are some results that could be helpful in this investigation:

julia> tinf = @snoopi_deep main("input", "another_input")
julia> tinf
InferenceTimingNode: 169.692228/230.404140 on Core.Compiler.Timings.ROOT() with 196 direct children
julia> julia> sort(tinf.children; lt=(x, y) -> x.mi_timing.inclusive_time < y.mi_timing.inclusive_time, rev=true)[1:3]
InferenceTimingNode: 0.007097/57.119113 on main(::String, ::String) with 17 direct children
InferenceTimingNode: 0.002234/0.553512 on some_function1 with 2 direct children
InferenceTimingNode: 0.001673/0.253361 on some_function2 with 1 direct children

What I am gathering from this information is that main is the main bottleneck while other functions’ compilation times are relatively negligible.

Additional SnoopCompile results:

julia> tinf = @snoopi_deep main("input", "another_input")
julia> itrigs = inference_triggers(tinf)
julia> x = sort(itrigs; lt=(x, y) -> x.node.mi_timing.inclusive_time < y.node.mi_timing.inclusive_time, rev=true)
julia> x[1].node.mi_timing.inclusive_time
0.5535115639999997

The worst runtime inference call in terms of timing is not that bad.

To my knowledge, this is still a problem. As evidenced by the regular posts of someone experiencing slowdowns with large StaticArrays.

It is suspicious that main only accounts for 57/169 seconds of inference time according to SnoopCompile. Perhaps worth digging into. I was hoping some folks would comment after sharing this on Slack, but no such luck. If you keep hearing nothing, you may want to try posting there yourself. There’s a dedicated #ttfx channel alongside the usual help ones.

For profiling compilation I think you’ll need to use Tracy with a custom build of Julia: External Profiler Support · The Julia Language

Though I’ve never used it myself so I can’t help much there. You’ve probably already tried this, but the only other thing I can think of would be a process of elimination, i.e. comment out parts of main()/run_sim() until you find a bottleneck. I’m guessing you’re either hitting some pathological case or death by a thousand cuts :sweat_smile:

I can try Tracy even though it looks like a headache to get it to work and sift through all the results :cry:

I have tried this a couple of times and the compilation time savings were minimal but it might worth trying it out different permutations.

I can try removing most of the code that use large tuples and see if it will make a difference. It will take some time.
Another reason why I am doubting the large tuples theory (I could be wrong) is because, if you look at the cartoon code in my OP, almost all of the large tuples are created and processed in process_a, process_b and process_c. The process_* functions do not take a long time to compile. So how can the large tuple compilation times be shifted to the main function?

I appreciate you posting this issue on Slack. Thank you. I don’t have a Slack account but I can create one and post there myself.

Just to confirm, how did you time the compilation time of process_*?

Earlier you said when you ran run_sim directly, the compilation slowdown “shifted” to run_sim. That sounds to me like main is not the source of the compilation slowdown, rather one of the internal functions.

Depending on the types making up the tuples, you could potentially simplify inference by wrapping them in a struct with static fieldtypes (or far fewer free type parameters) and pass that to the foreign call instead. So long as the memory layout is the same between the struct with far fewer free type parameters and your Tuple with far more type parameters, you might be able to observe a speedup and still call the function.

I wrote a macro that roughly does the following:

t1 = time()
res_a = process_a(param_a, data)
t2 = time()
push!(some_global_vec, t2 - t1)

Then I report the first element of some_global_vec as the compilation time, i.e. some_global_vec[1]. Technically, this time is compilation + runtime but I purposefully choose simulation parameters that do not have have high runtime. Therefore, the runtime is negligible compared to the compilation time.

That is an interesting proposition. Can you provide a simple example to elaborate your point?

I can give you some examples of the structs that we are dealing with:

struct Struct2
    a::UInt8
    b::UInt8
    c::UInt8
end

struct Struct4
    a::Int16
    b::Int16
end

struct Struct3
    a::NTuple{4, Struct4}
end

struct Struct7
    a::UInt8
    b::UInt8
    c::NTuple{4, UInt16}
end

struct Struct5
    a::UInt8
    b::NTuple{16, UInt8}
end

struct Struct6
    a::UInt64
    b::NTuple{2, Cfloat}
    c::Cfloat
    d::Int16
    e::Int8
    f::Bool
end

struct Struct1
    a::UInt8
    b::UInt8
    c::UInt16
    d::UInt8
    e::UInt8
    f::UInt16
    # ...
    # Some other fields
    # ...
    cc::NTuple{8, Struct7}
    dd::Struct5
    g::Ptr{UInt8}
    h::Ptr{UInt8}
    i::Struct6
    # ...
    # Some other fields
    # ...
    aa::Struct2
    bb::Bool
    cc::NTuple{4, Struct3}
end

We deal with more complicated structs but this should give you a rough idea.