Help reducing compilation and inference time

osimonn · July 4, 2024, 2:32pm

We have a simulator in our company that’s fully written in Julia. The codebase is quite large. We have been suffering from very high latency for some time now. I took the initiative of trying to solve the latency issue but after working on it for a few months now, I have hit a roadblock a need some help because the results I am seeing do not make sense to me. The TTFX for our code is 4 to 6 minutes. After that, depending on the simulation parameters, the simulation can take less than a second to run.

I will explain our code using pseudo-julia-code since this is proprietary code. Here it goes!

function main(input_param_file)
    params = read_params(input_param_file)

    result_a, result_b, result_c = run_sim(params)

    print_res_summary_a(result_a)
    print_res_summary_b(result_b)
    print_res_summary_c(result_b)
end

function run_sim(params)
    # Extract params
    param_a = get_param_a(params)
    param_b = get_param_b(params)
    param_c = get_param_c(params)

    # Generate data based on params
    data = gen_data(param_a, param_b, param_c)

    # Loop
    while some_condition
        # main processing functions
        res_a = process_a(param_a, data)
        res_b = process_b(param_b, data)
        res_c = process_c(param_c, data)

        # Write results to files
        dump_results(res_a, res_b, res_c)
    end

    return res_a, res_b, res_c
end

I made sure that main, run_sim and the functions it calls are type-stable. I timed the first call of almost all of the functions above. This is using a reference param file. Here are the results:

function	time (seconds)
main	229.8
run_sim	8.5
process_a	0.58
process_b	9E-06
process_c	5E-06
dump_results	0.0029

As you can see, all of the functions have decent compilation/inference times with the exception of main and I don’t know why.

Precompiling is an option but it does not solve the fundamental issue we are having. Also, this will be our last resort.

I used SnoopCompile to get a sense of the amount and impact of runtime inferences. While there are runtime inferences (~150), as far as I can tell, they don’t seem to be affecting the latency that much.

Any help will be deeply appreciated. I will try to provide more information if needed given the proprietary nature of our code.

gdalle · July 4, 2024, 2:45pm

Could you maybe provide a minimal reproducible example?

osimonn · July 4, 2024, 2:49pm

I wish I could. If I do, I will probably get fired within the next few hours

gdalle · July 4, 2024, 2:59pm

I understand, I’m just not sure what we can do here. You’re probably already aware of relevant tools: beyond SnoopCompile.jl, Cthulhu.jl might help to diagnose faulty inference?

I made sure that main, run_sim and the functions it calls are type-stable.

How did you ensure it? Just by checking the output inference, or with more sophisticated checks like JET.jl?

osimonn · July 4, 2024, 3:04pm

I used Cthulhu and SnoopCompile extensively to analyze our code. I fixed some of the inference problems but the large TTFX of main does not improve.

Can you can tell me what I should look for in the SnoopCompile results?

gdalle · July 4, 2024, 3:05pm

Have you profiled the first call to main with @profview to make sure that the time is indeed spent compiling/inferring and not, say, reading the file?

Can you can tell me what I should look for in the SnoopCompile results?

Unfortunately I have next to zero experience with that one

osimonn · July 4, 2024, 3:06pm

How did you ensure it? Just by checking the output inference, or with more sophisticated checks like JET.jl?

I used Cthulhu to identify type-unstable functions and fixed them by making sure the output types are deterministic given a set of input types.

TimG · July 4, 2024, 3:23pm

Is it possible you are seeing a manifestation of this issue?

https://discourse.julialang.org/t/investigating-large-latency-on-a-constrained-windows-environment/114482

You describe your problem quite differently so probably not but, just in case…

osimonn · July 4, 2024, 3:28pm

This looks more like a third-party package issue. We don’t use heavy third-party packages in our code

sbuercklin · July 4, 2024, 3:29pm

In the same Julia instance, or restarting between each timing? If you don’t restart, then run_sim will get compiled by your call to main so it won’t have any compilation overhead.

And how long do these take on subsequent calls, so we know how much is compilation vs how much is runtime?

fatteneder · July 4, 2024, 3:32pm

Two questions:

Is params a named tuple with lots of fields?
Do you happen to use the splat operator (...) extensively?

osimonn · July 4, 2024, 3:32pm

In the same Julia instance, or restarting between each timing?

Restarting each time.

And how long do these take on subsequent calls, so we know how much is compilation vs how much is runtime?

Less than a minute. It is negligible compared to first call.

osimonn · July 4, 2024, 3:34pm

Is params a named tuple with lots of fields?

params is a dictionary and gets mapped to structs in the get_params_* functions

Do you happen to use the splat operator (...) extensively?

We do use the splat operator but not extensively.

sbuercklin · July 4, 2024, 3:35pm

What is the timing of compiling read_params? It looks like your run_sim function is fast to compile + run, so it has to be something else in main

fatteneder · July 4, 2024, 3:37pm

Presumably these getters are not performance critical, did you try to annotate their signatures using @nospecialize?
E.g.

function get_param_a(@nospecialize(params))
   ...
end

This would prevent excessive code generation, if that’s the bottleneck (although not sure).

osimonn · July 4, 2024, 3:38pm

That’s what I thought at first but I did an experiment where I removed main and ran run_sim without main. The latency shifted to run_sim, i.e. run_sim took roughly 229.8 seconds to start. Same as main

osimonn · July 4, 2024, 3:40pm

params is of type Dict{Symbol,Any}. Will @nospecialize make a difference in this case?

fatteneder · July 4, 2024, 3:41pm

Hmm, nah, probably not. Sorry.

Palli · July 4, 2024, 3:46pm

Is your code in a package? They are precompiled to machine code, and that’s the only way I know how to do that with the exception of putting your code in the sysimage (it’s not to complex, has pros and cons), another option. Or compiling to an app, with PackageCompiler.jl which is implemented that way, I believe.

That is quite long… I understand you don’t want to wait (but curious, is it a large fraction of the simulation time, how long in total?). You can also use DaemonMode.jl is that applies to you, to not pay for TTFX (for scripts).

Python packages are also precompiled (there to bytecode only, and in latest beta version it has JIT). I would like that modules, not just packages in Julia were precompiled, and even just scripts, were precompiled, but that does not happen even with Python I think. I.e. you don’t have make and to register a full package to get precompilation… (or other manual steps). You CAN have a local package, i.e. in a local registry, not the public General registry, but I’ve not tried to set one up, seems also like a hoop to jump through.

sbuercklin · July 4, 2024, 3:47pm

How is this timing different from the timing you reported for run_sim in the OP?

This sounds like the problem is actually internal to run_sim. At that point, you need to time things like process_a, process_b, process_c from scratch similar to how you described timing run_sim without calling it from main

Topic		Replies	Views
Update on latency and the typocalypse Internals & Design	11	4180	June 21, 2020
How to cut down compile time when inference is not the problem? Performance	4	1241	November 26, 2021
Start-up performance, types and compiler's type inference Performance	5	1098	September 27, 2018
Make first call faster Performance ttfp	6	2550	July 12, 2019
Help diagnosing slow compilation General Usage compilation , profiling	4	119	August 12, 2024

Help reducing compilation and inference time

Related topics