Help reducing compilation and inference time

The way you read this graph is that the width of a bar corresponds to the relative time taken in the call. The total width of the graph is the total time, and so wider bars correspond to more time taken in the call. The depth (or here height) of a stack of bars corresponds to the number of nested calls.

So I think the question here is what are the two ‘deserts’ to the left and to the right of the two peaks where seemingly nothing happens? It seems like >80% of the time is spent there.

4 Likes

what are the two small ‘deserts’ to the left and to the right of the two peaks where seemingly nothing happens?

Great question. The left-hand-side stack consists of the following calls:

./task.jl:974, poptask [inlined]
./task.jl:983, wait [inlined]
./condition.jl:130, #wait#621 [inlined]
./condition.jl:125, wait [inlined]
./lock.jl:229, lock [inlined]
./condition.jl:78, lock [inlined]
./threadingconstructs.jl:373, #137 [inlined]

While the right-hand-side has the following calls:

./task.jl:974, poptask [inlined]
./task.jl:983, wait [inlined]
./task.jl:672, task_done_hook [inlined]

My guess is that they are calls from the profiler? And maybe Julia is doing something under the hood that is not captured in this flame graph

That’s interesting.

To clarify earlier statements.

Do you mean by compilation here just the precompilation happening when you start a fresh REPL and type using ..., or does this measure compilation of all pkgs in the environment (presumably in an empty depot), like i.e. using Pkg; Pkg.precompile()?

Also: I think you did not yet say which Julia version you are using. So what’s the output of versioninfo?

Can you also generate a flamegraph when running with the -O0 option? Would be nice for a comparison.

Completely forgot about that :slight_smile:

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = auto

I do the following (roughly):

time julia --project -e "using OurPackage; main("some_input_file")"

Sure. It will take me some time so I will post it in another reply…

1 Like

Another guess: Did you monitor RAM usage during compilation? Could it be that RAM is saturated and compilation slows down due to excessive swapping?

One thing you should do is to update to 1.10, compilation time should be reduced a bit I think with that

Nope. I updated our code to 1.10 and the latency got worse so we had to revert back to 1.9

RAM usage is stable, around 5 GBs during compilation out of 32 GBs.

1 Like

You cannot even download anything to initialize .julia/compiled?

Without any caching available in your CI, the only solution is to minimize compilation. You can play with optimizations on a module level with the following macro.

help?> Base.Experimental.@optlevel
  Experimental.@optlevel n::Int

  Set the optimization level (equivalent to the -O command line argument)
  for code in the current module. Submodules inherit the setting of their
  parent module.

  Supported values are 0, 1, 2, and 3.

  The effective optimization level is the minimum of that specified on the
  command line and in per-module settings. If a --min-optlevel value is
  set on the command line, that is enforced as a lower bound.

The key here is to minimize the types and methods involved. Move type parameters to fields when possible. Dispatch using if statements instead of using multiple dispatch.

Check your dependencies. Use @time_imports to see what dependencies are taking a while to load.

Check for method invalidation and recompilation. Use a small number of concrete types rather than abstract types. Use SnoopCompile[Core] and @snoopr to see if you are forcing anything to recompile.

You may even want to use the --trace-compile command line option as a sanity check to see what is compiling.

Historically, we did have some compiled functions in .julia/compiled. However, it was causing major issues to other teams so we deliberately remove the directory’s content.

That’s a cool idea but I am afraid that this will increase the overall CI/CD jobs because the runtime post-compilation will increase.

I hate to break your bubble but I have tried all of these suggestions for the past year and nothing worked hence this post.

This is the flame graph running julia (I don’t know why the graph shifted from my previous post):


This is the flame graph running julia -O0:

Again, for both graphs, the left cluster is purely Julia “compiler” calls whereas the right cluster is where our code or main gets called.

I am trying to understand that empty space.

I might be on the wrong track here, IDK, but it seems to me that this might be a pkgimage issue.
At least that would be a simple theory that explains most things.

Assume your code fails to generate (or load) pkgimages, and instead has to compile all your dependencies at every start.
It would explain why nothing shows up in most of the graph, because precompilation happens in separately spawned processes, one for each pkg. It would also explain the task related calls in flamegraph (wait, lock), and possibly explain why the first graph shifted, because of the order in which they are compiled might slightly vary for pkgs that do not share dependencies. (An alternative explanation might be that the right empty space in the very first plot was pkg compilation for the code that displays the graph; IIRC ProfileView uses Gtk3.jl or so)


Again to clarify, do you in any way tinker with ~/.julia/compiled?

And what is on your LOAD_PATH and DEPOT_PATH?

Assuming its pkgimage related, you could start julia with JULIA_DEBUG=loading julia -e ... and then study the generated logs. If there are a lot of rejection messages, then that’s a smoking gun.

2 Likes

Could you also try to install OmniPackage.jl (in a separate env) and report how long installation and using OminPackage took (roughly)?

That’s a pkg used to benchmark the performance of pkgimages.
With such a test we could gauge whether there is an issue with the julia installation, system setup or if its an issue with your pkg.

1 Like

If this is possible and we could resolve those issues, I think there is room to make progress. Could you elaborate?

Not necessarily. You can avoid compiler passes on code that is either rarely used or that you know that is relatively simple. In other words, there are circumstances where less compilation could result in the same code generation.

On earlier (and possibly also recent) versions of Julia, starting Julia with multiple threads and profiling will result in large empty spaces. IIRC the empty space is usually [width of thread 1 profile] x [N - # of threads]. Hard to say if that’s the culprit here, but an easy way to check would be to ensure Threads.nthreads() == 1 before profiling.

Aside from checking the number of threads, another idea is to visualize the inference flamegraph using Snooping on inference: @snoopi_deep · SnoopCompile. This may help fill in more of that empty space because it captures part of what the compiler is doing.

1 Like

In the results I presented so far, no. Don’t worry about this detail. It is not relevant for this discussion.

I tried it. No out-of-the-ordinary debug messages. Mostly packages or cache loaded.

I will generate new flame graphs with -t 1

Unfortunately, when I try to collect inference results using SnoopCompile, I get a stack overflow. And if I increase the stack size, my machine runs out of memory and I cannot produce results.

I can try removing parts of the code to reduce the size of the data structure produced by @snoopi_deep.

julia -t 1:


julia -O0 -t 1

A few observations:

  1. What are valleys in the middle?
  2. What code is Julia aggressively optimizing? That’s the million dollar question I am seeking
  3. In both case, our code (the right-hand-side cluster) takes a relatively short time to compile and execute
2 Likes

Indeed, thanks for the hint.
I wrongly assumed that each thread gets its separate tab, like it is with newer versions (not sure when that was implemented).

If a significant part of your dependencies rarely change and could be tightly pinned, you could consider making a system image with those dependencies but without your own code.

Actually that’s something I should try out myself at work.