How to help reduce package load latency?

In an effort to reduce package load latency for a slow loading package,I successfully added some precompilation (via @compile_workload) and it had the desired effect.

However, some dependencies are very slow as well, and I would like to learn how to help with that. The output of @time_imports using ParameterEstimation includes the following

  12852.5 ms  Polymake 24.84% compilation time (89% recompilation)
   5354.2 ms  GAP 17.17% compilation time (83% recompilation)
   5574.2 ms  Hecke 41.18% compilation time (79% recompilation)
4183.0 ms  Oscar 62.59% compilation time (40% recompilation)
   9205.1 ms  ParameterEstimation

I started to try to learn about SnoopPrecompile, but it says it is deprecated. So, questions,

  1. Is the hope that all “compilation time” and “recompilation time” could in theory be eliminated by precompilation?
  2. Is there a relatively easy way for someone who is not the primary package developer to add this precompilation?
  3. Is this possible even if one is not at all familiar with the package itself and has never directly used it?
  4. Is there a recommended workflow that reflects recent changes in 1.9 and maybe 1.10? I found Analyzing sources of compiler latency in Julia: method invalidations but this predates caching of native code.

Thanks all.

2 Likes

BTW, it was replaced by PrecompileTools.jl

12 Likes

Precompilation directly solves the compilation time issue.

Recompilation is a more complicated matter. It means that code that was compiled needed to be compiled again because new definitions invalidated what was compiled before. This is indicative of a more serious issue.

You can always fork these packages and submit a pull request if you figure out a resolution.

The traditional path here is to would be to use PackageCompiler.jl to create a new system image.

That blog post is still relevant.

Moreover, I would take a look at SnoopCompile.jl and the @snoopr macro:
https://timholy.github.io/SnoopCompile.jl/dev/snoopr/

You may also want to take a look at how SciML addressed precompilation issues.

1 Like

Hi, thank you so much for your informative response.
I have jumped to the following conclusions

  1. There probably exists a way to make the OSCAR suite of julia packages load significantly faster
  2. It might be very complicated (as per the SciML blog post), and require deep investigation and significant changes across a package.

I might revisit this topic in a few julia versions if it still an issue.

2 Likes

You can absolutely improve latency of your own package, and probably also of your dependencies.

Currently, managing latency requires intermediate Julia expertise: It’s easier to do now than just a few years ago (so if you’ve read material like the SciML blog post, or SnoopCompile docs, don’t worry - it’s easier now). However, it’s not quite as easy yet as we’d want to.

As of 2023, to reduce latency, I recommend the following steps:

  1. Make sure your package don’t commit type piracy. You can use Aqua.jl to detect some (most?) cases of type piracy automatically.
  2. Make sure your code is inferrible (i.e. all variables are type stable in your functions). You can use JET’s report_opt on a representative workload to check this (just use your precompilation workload)
  3. Create a PrecompileTools workload. Note that if you haven’t done 1. and 2., you’ll get less benefit of this workload
  4. You pay for the latency of your dependencies. So these need to also do 1. and 2. (though they don’t need to do 3., if your package does). Alternatively, try to reduce the number of dependencies.

There are essentially three components to latency

  • Time taken to load code. There isn’t much you can do about this, but it’s significantly faster in Julia 1.10 than in Julia 1.9
  • Time taken to compile code when it’s first seen. This is what precompilation aims to limit (or mostly eliminate, in most cases).
  • Recompilation, which happens when code is invalidated. Invalidation is usually a sign of type piracy or type instability - if you get rid of these two, you should remove nearly all recompilation,

You can read more about the current state of latency and how to manage it in a blog post of mine.

21 Likes

Regarding OSCAR specifically, we, the OSCAR team (include me specifically, as one of the leads) already invested quite some time looking into the overhead for loading it, and how to reduce it. Unfortunately I am not sure any of us visit the Julia Discourse regularly, I certainly don’t, and I only stumbled over this thread by pure chance. Feel free to also talk to us about it on our Slack or on our GitHub Discussions…

That said, I can already now tell you that PrecompilationTools or anything like that won’t help much here, or else we’d already be doing it (well, it may help some and we have it in our agenda, to use it more extensively, but experiments suggest it’s not a panacea). It may eventually help, but there are many roadblocks.

For example, we suffer massively from method invalidations, e.g. caused by CxxWrap.jl (and surely also some of our own making). We’ve improved a bunch of things, including sending out patches to a bunch of our own dependencies; but it feels like an uphill battle, as we see here: ParameterEstimation has a bazillion dependencies, and some of these seem to exacerbate the problem as we’ll see below…

The whole thing unfortunately is complicated, and complicated to debug (any substantial help here is certainly welcome), And there are some factors involved that can’t be easily improved if at all. For example, GAP.jl (the one I am most familiar with as one of the authors and lead dev on the GAP computer algebra system it wraps) loads and parses a ton of source files in the GAP language – that means disk performance matters a lot. From your numbers I am guessing you are either on a HDD or on rather slow SSD – that’s why it loads much faster for me. To be clear: I am not suggesting here you should “just get a faster computer”, rather I want to try to paint the broad picture of what causes what.

To qualify this, here are the timing I get for @time_imports using ParameterEstimation which uses Oscar 0.11.3 (an outdated version, but it’s the one pulled in by ParameterEstimation so I used this for all timings to have a fair comparison) on an M1 MacBook Pro with Julia 1.9.3, these are the numbers I get:

   1192.4 ms  CxxWrap 3.49% compilation time (37% recompilation)
   4576.8 ms  Polymake 21.68% compilation time (89% recompilation)
   2031.9 ms  GAP 14.59% compilation time (76% recompilation)
   2821.9 ms  Hecke 48.95% compilation time (86% recompilation)
   1736.3 ms  Oscar 59.09% compilation time (44% recompilation)
      2.2 ms  ParameterEstimation

So that’s quite a lot faster than what you report, although it’s still not great.

But let me now contrast this with @time_imports using Oscar (using the exact same Oscar version)

    380.3 ms  CxxWrap 6.01% compilation time
   2334.8 ms  Polymake 3.34% compilation time (21% recompilation)
   1471.7 ms  GAP 19.51% compilation time (81% recompilation)
   1886.4 ms  Hecke 47.97% compilation time (80% recompilation)
   1171.1 ms  Oscar 51.77% compilation time

So things are quite a bit faster, and there is a LOT less recompilation. That’s not a fluke, I can easily reproduce it (of course with some fluctuation, but the rough ballpark stays).

Let’s just focus on @time_imports using GAP and we get

   1180.0 ms  GAP 4.91% compilation time

and easily >95% of that time is spent in the GAP kernel parsing and executing GAP code, so there is nothing here really to improve from the Julia side (maybe from the GAP side, but that’s way out of scope here).

As you can see, loading GAP varies from 1180.0 to 1471.7 to 2031.9 milliseconds in the three examples!

So what happens? I don’t claim to have the full answer for this, but I think at least part of this is method invalidation. One clue is the recompilation time. We can also reproduce this by doing e.g. @time_imports using CxxWrap, GAP:

    360.4 ms  CxxWrap 6.98% compilation time
   1326.4 ms  GAP 15.61% compilation time (72% recompilation)

I’ve submitted PRs to CxxWrap in the past to improve this (and it got better), but there is still more to be done but I am afraid I can’t do it (I think it may require changing the CxxWrap API to turn a bunch of “automatic” / implicit conversions in it into explicit conversions, but to a degree I am guessing ehre). Anyway, I’ve opened an issue about it Understanding and reducing invalidation caused by CxxWrap · Issue #278 · JuliaInterop/CxxWrap.jl · GitHub two years ago, if anyone would like to help… That would at least help anyone just doing using Oscar…

But in your context of course there is much more slowdown, presumably to the many, many additional packages requiring even more recompilations. But I have not studied this in detail, so there may well be other factors involved I am not aware of myself.

8 Likes

For complex packages, I am wondering if it would make sense to produce a version where some of the complicated dependencies are vendored in. That is including dependencies as modified submodules rather than another package. This creates new complications, but they could be manageable in terminal applications.

2 Likes

Hi, thank you so much for chiming in.
FYI I am using an SSD. On a fresh install of Julia and using the production version of packages I am seeing similiar timings to yours.

I guess the part this is confusing to me is that precompilation doesn’t help. In some sense, if you load packages in the same order in precompilation as you do in a production workflow, shouldn’t all the invalidations and recompilations have already happened and been cached? (at least, all the inferences and native code?)

I might speculate that Julia aggressively invalidates code in a .so (like .cxxwrap uses) to be safe if the .so got recompiled? But I was under the impression that invalidation should only come from inlined code, and that calling a dynamic linked libary should never allow that.

In any case, I am in some sense offering to help. I was hoping there was an iterative process something like
1)Find some invalidations, sorted by how bad they are
2)Fix them one by one, by adding a function wrapper or type annotations
3)Rinse and repeat
which would slowly improve these load times. But it sounds like for Oscar in particular something deeper is happening.

No, that’s not how this works. A package is always precompiled in the context of its own dependencies, and nothing more. Thus when you load package A first, and it installs new methods for existing functions that can invalidate code in package B, no matter what package B does. In general there is no way for B to “defend” against this (emphasis on “in general” – there are a few things you can do in certain cases, but in the end it is package A that must change something to prevent this – or in fact possibly some other package C, as things can be more complicated yet)

There is nothing magical about Oscar (and its related packages) about this other than that is very big compared to the average Julia package, and hence has many more opportunities for invalidations to strike.

@fingolfin This is generally incorrect. In practice, most code that is invalidated is type-unstable code. So, it is usually the code that is invalidated that should be fixed, not the code that triggers the invalidation. “Blame the victim”, so to speak. As a package author, you can massively reduce the number of invalidations of your package by making it type stable. Not every case of invalidations, but by far most.
Again, this is in general. I don’t know if this applies to Oscar.jl in particular.

@orebas There is indeed such a workflow, using SnoopCompile. Here is a video showing how it’s done: https://www.youtube.com/watch?v=7VbXbI6OmYo . If you prefer written material, read the SnoopCompile docs, especially the tutorial (Tutorial on the foundations · SnoopCompile) and the guide on fixing invalidations (Snooping on and fixing invalidations: @snoopr · SnoopCompile).
You can also directly heal invalidations (Home · PrecompileTools.jl), though this is not a very elegant solution compared to preventing them from happening.

5 Likes

OK. I have watched the video, and done some reading, and I am giving this a try. As a caveat, I am using a development version of ParameterEstimation.jl.
I ran

using SnoopCompile
 trees = invalidation_trees(@snoopr using ParameterEstimation)
methinvs = trees[end]
root = methinvs.backedges[end]
ascend(root)

and I am really quite lost. The “trees” object takes a bit of time to print. there seem to be many, many invalidations related to eltype(). Here’s the last few lines of output:

              1872: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String, typeof(MutableArithmetics.Test.sparse_linear_test)}, Vararg{Pair}}}) (2 children)
              1873: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String, typeof(MutableArithmetics.Test.sparse_linear_test)}, Vararg{Pair}}}) (2 children)
              1874: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String}, Pair{String}, Pair{String, HomotopyContinuation.ModelKit.System}, Pair{String}, Pair{String, Dict{Any, Any}}, Pair{String, Dict{Any, Any}}, Pair{String}, Pair{String}, Pair{String, Dict{Nemo.QQMPolyRingElem, Nemo.QQFieldElem}}, Pair{String}, Pair{String}, Pair{String}, Pair{String, Dict{Nemo.QQMPolyRingElem, Int64}}, Pair{String}}}) (3 children)
              1875: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String}, Pair{String}, Pair{String, HomotopyContinuation.ModelKit.System}, Pair{String}, Pair{String, Dict{Any, Any}}, Pair{String, Dict{Any, Any}}, Pair{String}, Pair{String}, Pair{String, Dict{Nemo.QQMPolyRingElem, Nemo.QQFieldElem}}, Pair{String}, Pair{String}, Pair{String}, Pair{String, Dict{Nemo.QQMPolyRingElem, Int64}}, Pair{String}}}) (3 children)
              1876: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String, Vector{Any}}, Pair{String, Vector{Any}}, Pair{String}}}) (3 children)
              1877: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String, Vector{Any}}, Pair{String, Vector{Any}}, Pair{String}}}) (3 children)
              1878: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String, String}, Vararg{Pair}}}) (2 children)
              1879: superseding eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207 with MethodInstance for eltype(::Type{<:Tuple{Pair{String, String}, Vararg{Pair}}}) (2 children)
   2 mt_cache

and what’s an example of what happens when I press enter on something in the tree:

Choose a call for analysis (q to quit):
 >   eltype(::Type{<:Tuple{Pair{String, String}, Vararg{Pair}}})
       eltype(::Tuple{Pair{String, String}, Vararg{Pair}})
         Dict(::Tuple{Pair{String, String}, Vararg{Pair}})
eltype(t::Type{<:Tuple}) @ Base ~/julia-1.9.3/share/julia/base/tuple.jl:207
207 eltype(t::Type{<:Tuple{Pair{String, String}, Vararg{Pair}}}::Type{<:Tuple})::Any = _compute_eltype(t::Type{<:Tuple{Pair{String, String}, Vararg{Pair}}})
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [w]arn, [h]ide type-stable statements, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native, [j]ump to source always.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
 • _compute_eltype(t::Type{<:Tuple{Pair{String, String}, Vararg{Pair}}})

I just tried it out. Whew, what a massive package! 314 dependencies!
I think, to improve compilation time, one easy step would be to cut down the number of dependencies. For example, I see Pkg, BenchmarkTools, Pkg, Test and JuliaFormatter in the dependency tree. These are used when developing, and shouldn’t need to be loaded by the user. I’m sure that if these packages are present, there are several other unnecessary dependencies that can safely be removed.

I also see Requires. AFAIU package has been made obsolete in Julia 1.9 by package extensions, and Requires, unlike packages extensions, interfere with precompilation. Removing this dependency would also make a big difference.

Also, I see very long package loading times.

Anyway, when you run ascend, in your example, you can see that the invalidated code comes from the call Dict(::Tuple{Pair{String, String}, Vararg{Pair}}). Unfortunately, it doesn’t say where that call is! I’m not sure why that is. There ought to be another line in the output where the Dict is being called, so you can track the source of the type instability down.
BUT - you can see that the call is type unstable - Vararg{Pair} is not a concrete type. So, tracking down where these type unstable Dict calls happen and making them type stable would remove this invalidation.

I’m not an expert at fixing invalidations, but I have worked a bit with it. And every time I have done so, it all comes back to this point: The code you call is not type stable.

Loading ParameterEstimation causes 66k invalidations, which is an obscene amount of invalidations. For comparison, this is 15x as many invalidations as loading DataFrames in Julia 1.5, back when invalidations were rampant. It’s invalidating more methodinstances than there are methodinstances in total in Julia’s system image! So - the good news is that there is a lot of gains to be had.

I must admit, looking at this package and its dependency tree makes me despair a little bit. Tonnes of unused dependencies that are not even loaded by the packages that require them. Test dependencies in the package itself. Rampant type instability. What makes me despair is not that this stuff is in packages - that’s to be expected, not everyone has dug the details of Julia packaging best practices. Rather, it’s that it’s too easy to inadvertently tank the latency - instead of these best practices being enforced by tooling, it’s up to each individual package manager to manually learn about adhere to these practices (type stability, don’t add unused dependencies, and add precompile workloads). I fear that this does not scale to the ecosystem level.

13 Likes

Then it seems we have had quite different experiences :slight_smile:

While I agree that often type-unstable code is involved, I’ve encountered plenty cases were the invalidated code by itself was completely innocent, albeit perhaps still suboptimal. To quote Tim Holy’s blog post, “most invalidations come from poorly-inferred code” – but that’s not the same as being type unstable.

But even with “perfect code” you can be a victim of type instability. E.g. because a package that is loaded later commits type piracy. Example: Here are two PRs that tweak CxxWrap to cause some fewer type invalidations in basically every package out there if it is loaded after CxxWrap – but they are not enough, it still does it. (I don’t mean to pick on CxxWrap here, by the way, it does its job, and “fixing” this will require someone who knows a lot about this and rewrite things in a breaking way, while ensuring that anything uses of CxxWrap need still can be done, and then help those migrate to the new version… That’s a major task and non I’ll likely expect a volunteer like Bart, the primary author of CxxWrap, to do “on the side”.)

As to SnoopCompile, JET, Cthulhu etc – I’ve used them quite a bit, but they are no panacea. Over time I submitted plenty of PRs to packages and even to Julia to fix things, and of course also to “our” packages. But it’s a big wall of things … And suggestions like “just run TESTXYZ on every function you suspect has an issue” is also not helpful if you have a package with thousands of functions… There is little tooling to help with that, it seems.

By the way, I certainly won’t claim Oscar is bug free (that’d be ridiculous), and I am sure there are plenty invalidations that are entirely our “fault”, but I also feel that to a degree blaming package authors for a systemic issue of a language (which is the flip side of a systemic advantage of Julia, of course) is not helping. And I am genuinely concerned about this whole business: it is very easy to write Julia code and Julia packages. But it seems very hard to do the same while avoiding pitfalls like type instability, invalidations, etc. Tools like JET and SnoopCompile or Cthulhu are miracles of engineering, but at the same time they are needed to solve problems that in other languages don’t even exist, and they are difficult to pick up and learn.

8 Likes

Thank you for going down this path! I know have something I know how to do: remove unused dependencies. This is low hanging fruit and I will report back on how much impact it has. I haven’t checked if there exists a tool for this (I use include-what-you-use in C++, maybe Julia has something similiar.)

It sounds like the proposed workflow for even finding invalidations has stumped two expert users of Julia. Perhaps this is an issue with one of the tools (Is it Cthulu? snoopr? I’m not sure.) In any case, I don’t think I can proceed in that direction at the moment. I’m probably making it worse, as I don’t quite understand how to write “type-stable” code.

Thanks again for taking a look.

1 Like

I think Aqua.jl checks for unused dependencies (among other things that might be helpful).

So I have used PkgDependency.jl (the call is PkgDependency.tree(“ParameterEstimation”)

and noticed that many of the most prominent Julia packages have some of the dependencies you mention.

In particular, by “activating” ParameterEstimation.jl, and then running
PkgDependency.tree(“Requires”, reverse=true)
you can get (I believe?) a view of the packages within ParameterEstimation.jl that cause it to depend on Requires.jl.
Some of the culprits include LinearSolve, Symbolics, FiniteDiff, ModelingToolkit, TaylorSeries, and many many others. (Requires shows up 16 times.)

JuliaFormatter seems to come in from ModelingToolkit.

I was able to remove BenchmarkTools as a dependency. (It’s a start!)

Pkg.jl seems to show up about 42 times. Many (all?) of the packages with _jll in the title depend on Pkg.

I don’t feel good about trying to get a package to not use LinearSolve or ModelingToolkit, indeed I will treat that as a non-starter.

I suppose I might expect “requires.jl” to make its way out of the well-maintained packages in a few months; if it would help the community I’m happy to try to document where that occurs via… Github issues? Polite emails to pkg maintainers? Here on discourse?

see also pkg.why which also tells you why packages are required.

the requires dependencies aren’t going away soon since they are needed to support Julia pre 1.9. the JuliaFormatter issue should be fixed. MTK currently is a combination of several different things and it currently has a lot of heavy dependencies to do things like show equations prettily.

If Requires is only used for 1.9- support (almost always the case), there is no need to have it installed on 1.9+. As Pkg docs recommend, Requires should be added to both deps and weakdeps sections of Project.toml so that it is not installed on 1.9+.

Not quite a necropost hopefully - your excellent blog post is 6 months old and presumably targeted at 1.9. Given further developments in 1.10 and master do you see any recommendations changing? Would it be useful to have this blogpost (or rather its “The present” section) as a wiki post on here so that the community can maintain and expand it?

I updated the recommendations a few weeks ago, but just to change emphasis. They have not changed in 1.10 or 1.11.
I think they will probably not change until one of the following two next steps happen

  • Julia can compile static executables
  • Type instability no longer cause invalidations
4 Likes