Julia precompilation limits or are there really any?

Hey,

I tried to read every details on this topic but I still can’t be sure about what is the limiting factor at reaching “nocompilation” speed. (Under no compilation speed I mean the speed when you second time run the appropriate function.)

  1. I read the documentation about the precompilation limits where I didn’t understand what are the limits of this precompilation all in all?

1.1. I am not sure If I see how can I use the @snoopi_deep to add its precompile directives to the the installed package. Is it possible to generate a precompile.jl and create a sysimage from them? (So far I couldn’t make it I guess because of the lack of chain ownership, so I have no idea how to do it yet.)

  1. How does create_sysimage comes into the game all in all? Does it create a precompiled method list from the package precompile directives?

  2. For me as far as I understand the only thing Julia can’t do is it to “cache generated code”, which is close to saving the state of the program, or it is just the LLVM code with the appropriate linking?

1 Like

I’m not an expert on the topic, so others more knowledgeable about the compiler can definitely chime in here. But from my understanding:

  • Currently there will always be some compilation cost on the first call. Precompile directives only save type-inferred code, not native code, so the first time a precompiled method is called the compiler still needs to do some work. However, a major part of compiling time is spent in type inference, so precompilation will still significantly cut package loading time in most cases. The reason why native code is not cached is primarily because of method invalidations, which would just force re-compilation and make the point of caching machine code moot. Now that there are tools to diagnose and fix invalidations early on (thanks to the ever-amazing @tim.holy), we might see cache improvement ideas soon!
  • The documentation of SnoopCompile on precompilation limits is referring to the complexity of precompilation. You can’t just slap some precompile directives on your package and expect it to magically solve all your life problems without careful inspection into what precompilation is actually doing in your case. Two examples that text points out: one, if your function requires dynamic dispatch at some point in its body then methods from that point on might have to be resolved during runtime and won’t be cached; and two, even if you pin down the specific types for those methods to precompile, then they still won’t be saved if you don’t own them and your function doesn’t depend on them explicitly in inferred code. For a tutorial on the subject, this recent blog post is an amazing read.
  • @snoopi_deep is a tool to gain some insights into the type inference process. By analyzing where the compiler spends the most time and why, you’ll be able to improve inference, fix invalidations, and prevent unnecessary specializations early on. This will help Julia cache your package much more efficiently and speed up loading time. SnoopCompile provides other tools to generate precompile directives from @snoopi_deep results to help with the process, but again be mindful that this might not significantly improve latency if you don’t take care of the issues mentioned so far.
  • PackageCompiler’s create_sysimage merely saves the current state of your Julia session. Methods are already precompiled according to the package’s precompile directives, the value of PackageCompiler is when you want to save all the subsequent compilation work you depend on in your workflow.
12 Likes

Hey,
Thank you for the detailed description.

So to conclude if I understand well:
There are steps from code → runnable LLVM code:

  1. code line
  2. type inferrings
  3. finding the approriate function call based on the inferred calling types
  • searching in cached generated code space
  • searching in precompiled code space
  1. if each of it failed:
  2. static compilation/precompilation of the function
  3. dynamic compilation (if not succeeded with the static code generation if I am right?)
  4. adding generated LLVM code to the cached code space?

On the precompilation.jl, can’t we create some kind of chain ownership to use the results of @snoopi_deep?
Can we create_sysimage that would save that sate of the julia session from some kind of ownership from these functions? Or is it possible to run the actual code and generate the sysimage after that? For me it sounds so close that we should have that chain ownership that woudl compile those precompile directive into the sysimage… I just don’t see how should we do that yet. It feels like we have everything to have those precompilations bundled with our sysimage if we want.

The compilation process actually continues all the way down to native machine code. On the first call:

  • @code_lowered: Code is first lowered from Julia AST into Julia SSA-form IR.
  • @code_typed: Type inference figures out (or gives up on if your code is unstable) the types of everything inside the function. This helps Julia deduce the return type and decide what specific methods to call when your function body calls other functions.
  • @code_llvm: Julia prepares the LLVM IR for your method.
  • @code_native: LLVM compiles code down to native machine code, and this is cached within your Julia session so that code is fast on the second run.

Precompilation covers the first two stages, so that’s why it doesn’t completely eliminate latency on the first call either. When a precompiled method is invoked for the first time, the compiler will just pick up the work from typed code and continue down to machine code. Theoretically we could cache machine code as well and achieve zero latency, but there are subtle technical points to consider.

One, type stability is still vital for precompilation to work well. Consider a hypothetical function:

function foo(x)
    temp = Any[x]
    bar(temp[1])
end

Here type inference cannot figure out what is passed into bar, so the function when compiled will still have to wait till runtime and fallback onto dynamic dispatch. If the package author decides to precompile foo(::Int), Julia won’t precompile bar(::Int) because it doesn’t know dynamic dispatch will eventually resolve to that specific method during runtime. On the first call of foo(2) then, Julia will still have to compile bar(::Int), and if sufficiently complicated and time-consuming, compiling bar(::Int) will dominate the compilation time cost and precompilation wouldn’t help. The fix here is to simply write code as type stable as possible, or if dynamic dispatch is necessary, to precompile bar(::Int) separately so that when the runtime type is known, Julia doesn’t have to start from scratch.

Two, Julia’s flexibility also represents an issue. Since method tables are mutable global states, loading modules and adding methods can invalidate and force recompilation of other methods. In such a case, all of the precompilation work will just be thrown out and the first call to an invalidated method would just trigger the entire compilation process again. While package developers can now diagnose and patch invalidations, it doesn’t change the fact that even if we finally cache native code during precompilation, the performance benefits are only there if the programmer writes their code carefully.

Finally, there’s also the interesting technical question of where precompilation should be saved when package composability comes into the picture. Currently each package has their own *.ji to save precompiled code, but if the function bar in the example above belongs to another package, then where should we save our precompiled work? If we cache it in our package’s file, then what if another package comes around and also caches the same method, how do we resolve the conflicts when we load them together? If bar is cached in the package it comes from instead, then loading this package will always load your specific bar method, even if it’s only useful when used with your module. Perhaps precompilation cache should work on a per-environment basis, but then a lot of shared precompilation work wouldn’t be reused across environments.

To be clear all of these problems don’t make precompilation improvement infeasible. Type stability is emphasized a lot in the Manual, and tools are available to help prevent invalidations. The technicalities of precompilation caching could be handled if there are enough developer hours to invest in it. But someone needs to take the initiative and the core devs already have a lot on their plate.

PackageCompiler is already able to save your session into a sysimage so you don’t have to pay the cost of first-call compilation, and can cache method compilation from precompile files and the package’s test suite. But for the process to be worry-free and automatic just by entering ] precompile into the REPL, we’d just have to wait and see how our community rises to the task.

6 Likes

Very nice and crystal clear answer, thank!

For invalidation I would say it would be nice to have a “black list” about the packages from the registry or even on other packages. I know this isn’t 100% sure blacklistable but some kind of warn on the invalidations. I also created my own counter for this some days ago: Invalidation checker [Code Snippet]

For now I am not sure if there is a real limit. I understand each of the point you mentioned. To write performant code we can assume that the person will look after how to make it type stable. The instable version of the code is for scripting, which is totally good for testing.
My idea is to create a package that is as you mentioned is like a precompilation environment, that is just keeping all the precompilations generated from @snoopi_deep.

Ok… I just realised I can’t push every precompile directive into a package because precompile directives isn’t able to continue the ownership chain. The function should be called from the package… damn! :smiley: I just wasted like 10 hours…
So:

module Y
prec() = Base.precompile(Tuple{typeof(DemoPkg.f),Core.Int64}) 
precompile(prec,())
end

Module Y can’t chain the ownership of DemoPkg.f.

Hey I think I made it work: Artificial ownership chaining for precompilation directives This is essentially the precompilation environment you were talking about. I will try to continue in that thread as that title is more approriate for this task. (Also sorry for interrupting this discussion, but wanted to note the precompilation limitation isn’t sure a limitation all in all)

1 Like

Ok, optimising only the package made a change from 1.4-1.5s to 0.076671348s the inference time, checked with @snoopi_deep … That 0.07s is from other package that can be also automatised, but I wouldn’t optimise under 0.01.

I think if this is used properly then we can give some freedom for the code to change where it is necessary but also we can fix most of the code that will barely never change during the development workflow which maybe a good start.

I will try to create a package that creates these artificial ownership automatically from the @snoopi_deep.

Some observation.

The method works and in our case it reduced some startup time from 78s to 33s. It is definitely a big deal for us.

I hope one day I can be back with some good working automated solution. At the moment I have a basic code generation method that works from @snoopi_deep.

4 Likes

is there a strong argument for how invalidation mechanism improves performance?
I read some article and it just glances over the need for invalidation, by presenting an example of how a type unstable function can be further optimized after new methods are introduced.
But this is a weak argument to me, because it doesn’t explain why type stable methods ever need to be invalidated.
If invalidation is never necessary for correctness, there should simply be a global switch to turn if off, for all methods and all cases.

Invalidation is only necessary for correctness.

Type stability has nothing to do with invalidation. A type stable code which uses f(x) will be invalidated if f(x::Int) is added and the inferred type is Int.

2 Likes

wouldn’t f(x) get left alone, and only change the dynamic dispatch to f(x::Int)?

It can also change static dispatches as well if they are inferred Ints.

it would not require f(x) to be recompiled. I’m reading a lot of people mentioning the slowdown is due to recompilation. Does this happen when code is heavily inlined?
I imagine you can disable inlining the first time a function’s callers have to be recompiled, then you’d be able to cache a lot of generated code. Packages that cause inline to be disabled can generate reports, you can have code caches keyed on which inlines have been disabled.

It would require that any code that uses f(x) on an Int to be recompiled.

@noinline, though you’re not talking about disabling inlining but devirtualization of dispatches. You can not inline the function while still devirtualizing the dispatch, making the choice of method static but retaining the function call (to reduce code size). In fact, this is done very often.

1 Like

Simple example of necessary invalidation:

julia> f(x) = 1
f (generic function with 1 method)

julia> f(1) # compiles f(::Int) MethodInstance with f(x)
1

julia> f(x::Int) = 2 # finds and invalidates MethodInstances like f(::Int)
f (generic function with 2 methods)

julia> f(1) # compiles f(::Int) MethodInstance with f(x::Int)
2

If invalidation didn’t happen, the second f(1) would have used the compiled f(::Int) MethodInstance based on f(x), the f(x::Int) method would be completely ignored. So invalidations are needed for dynamicism. This seems wasteful, but you only compile upon first method call, so if you define all your methods before any calls, you don’t need to invalidate anything. That’s why it’s best practice to do all the module definitions and imports that let modules extend each other’s functions before you start compiling and running the meat of the program.

There’s a lot about Julia that aren’t dynamic (e.g. repeated and changed at runtime), like struct definitions and nested method definitions, because it’s important for optimizations. Being able to add and edit methods of functions between function calls without things breaking is pretty sweet.

3 Likes

Chris mentioned virtualized dispatch. I assume this means the calls to f(x) = 1 in your case are just inlined 1’s. So every user of f(x) needs to be recompiled when you redefine f(x) to be 2. However, if heavily used functions can avoid being inlined, then dispatches to f(x::Int) can be overriden without triggering recompilation. Code caching will be worthwhile to do.

f(x) wasn’t redefined to 2. f(1.0) would still be 1. f was just given a 2nd more specific method, so it has both f(x) and f(x::Int) methods.

I’m not sure what you mean by avoiding recompilation. Replacing f(x)'s version of f(::Int) with f(x::Int)'s version is the recompilation, and there’s no way to avoid that. If you’re talking about not recompiling some function g that uses f(::Int), you still have to recompile g to switch to the new version of f(::Int), whether it’s inlined (change 1 to 2) or not (change MethodInstance address). The only way g is totally separate is if it dynamically dispatches f(::Int) (checks for it every time it runs), and that’s very slow behavior. We usually try hard to avoid that, and we wouldn’t want it to be the default just to avoid recompilation.

3 Likes

This.

No, you’re still missing the difference between devirtualization and inlining. You can still remove the runtime dispatch mechanism even without inlining, and this is probably the most common behavior. You can remove this through usage of @nospecialize though, which is different from @noinline

1 Like

dynamic dispatch is done by jumping through a vtable, static dispatch is done by patching the jump address. this could be hard to do in julia, but fundamentally there is no reason to recompile when you change a function.

This isn’t true since it ignores type inference. Changing which method you call can change the type of the returned result, so re-virtualization requires (at a minimum) re-running type inference.