The way you read this graph is that the width of a bar corresponds to the relative time taken in the call. The total width of the graph is the total time, and so wider bars correspond to more time taken in the call. The depth (or here height) of a stack of bars corresponds to the number of nested calls.
So I think the question here is what are the two ‘deserts’ to the left and to the right of the two peaks where seemingly nothing happens? It seems like >80% of the time is spent there.
Do you mean by compilation here just the precompilation happening when you start a fresh REPL and type using ..., or does this measure compilation of all pkgs in the environment (presumably in an empty depot), like i.e. using Pkg; Pkg.precompile()?
Also: I think you did not yet say which Julia version you are using. So what’s the output of versioninfo?
Can you also generate a flamegraph when running with the -O0 option? Would be nice for a comparison.
You cannot even download anything to initialize .julia/compiled?
Without any caching available in your CI, the only solution is to minimize compilation. You can play with optimizations on a module level with the following macro.
help?> Base.Experimental.@optlevel
Experimental.@optlevel n::Int
Set the optimization level (equivalent to the -O command line argument)
for code in the current module. Submodules inherit the setting of their
parent module.
Supported values are 0, 1, 2, and 3.
The effective optimization level is the minimum of that specified on the
command line and in per-module settings. If a --min-optlevel value is
set on the command line, that is enforced as a lower bound.
The key here is to minimize the types and methods involved. Move type parameters to fields when possible. Dispatch using if statements instead of using multiple dispatch.
Check your dependencies. Use @time_imports to see what dependencies are taking a while to load.
Check for method invalidation and recompilation. Use a small number of concrete types rather than abstract types. Use SnoopCompile[Core] and @snoopr to see if you are forcing anything to recompile.
You may even want to use the --trace-compile command line option as a sanity check to see what is compiling.
Historically, we did have some compiled functions in .julia/compiled. However, it was causing major issues to other teams so we deliberately remove the directory’s content.
That’s a cool idea but I am afraid that this will increase the overall CI/CD jobs because the runtime post-compilation will increase.
I hate to break your bubble but I have tried all of these suggestions for the past year and nothing worked hence this post.
I might be on the wrong track here, IDK, but it seems to me that this might be a pkgimage issue.
At least that would be a simple theory that explains most things.
Assume your code fails to generate (or load) pkgimages, and instead has to compile all your dependencies at every start.
It would explain why nothing shows up in most of the graph, because precompilation happens in separately spawned processes, one for each pkg. It would also explain the task related calls in flamegraph (wait, lock), and possibly explain why the first graph shifted, because of the order in which they are compiled might slightly vary for pkgs that do not share dependencies. (An alternative explanation might be that the right empty space in the very first plot was pkg compilation for the code that displays the graph; IIRC ProfileView uses Gtk3.jl or so)
Again to clarify, do you in any way tinker with ~/.julia/compiled?
And what is on your LOAD_PATH and DEPOT_PATH?
Assuming its pkgimage related, you could start julia with JULIA_DEBUG=loading julia -e ... and then study the generated logs. If there are a lot of rejection messages, then that’s a smoking gun.
Could you also try to install OmniPackage.jl (in a separate env) and report how long installation and using OminPackage took (roughly)?
That’s a pkg used to benchmark the performance of pkgimages.
With such a test we could gauge whether there is an issue with the julia installation, system setup or if its an issue with your pkg.
If this is possible and we could resolve those issues, I think there is room to make progress. Could you elaborate?
Not necessarily. You can avoid compiler passes on code that is either rarely used or that you know that is relatively simple. In other words, there are circumstances where less compilation could result in the same code generation.
On earlier (and possibly also recent) versions of Julia, starting Julia with multiple threads and profiling will result in large empty spaces. IIRC the empty space is usually [width of thread 1 profile] x [N - # of threads]. Hard to say if that’s the culprit here, but an easy way to check would be to ensure Threads.nthreads() == 1 before profiling.
Aside from checking the number of threads, another idea is to visualize the inference flamegraph using Snooping on inference: @snoopi_deep · SnoopCompile. This may help fill in more of that empty space because it captures part of what the compiler is doing.
Unfortunately, when I try to collect inference results using SnoopCompile, I get a stack overflow. And if I increase the stack size, my machine runs out of memory and I cannot produce results.
I can try removing parts of the code to reduce the size of the data structure produced by @snoopi_deep.
Indeed, thanks for the hint.
I wrongly assumed that each thread gets its separate tab, like it is with newer versions (not sure when that was implemented).
If a significant part of your dependencies rarely change and could be tightly pinned, you could consider making a system image with those dependencies but without your own code.
Actually that’s something I should try out myself at work.