Precompile: Why?

It is not every time, is when it is not possible to prove that a new or updated method on any new package does not invalidate previous “versions” of the precompiled code. Reducing these “invalidations” is one important way of trying to improve with this delay.

5 Likes

The main issue is that multiple dispatch leaves you with many possible methods to compile and adding a new method anywhere in the call stack can change what needs to be compiled. Anytime you add a new method or package it might have to recompile many other packages (this is what is meant by invaldiations).

Okay, but what if I could call supercompile and it compiled a library for all possible type combos in scope? And then I would never have to recompile this library unless I updated a version of a package. Even if supercompile took 3 days but it meant my development cycle would not involve precompile for another month, I would do it.

And if I added a package to my development environment, I would like not to have to run the whole hypothetical 3-day process again, but to be able to incrementally link another piece to the library.

Is this a world I can dream about? :star_struck:

1 Like

I think PackageCompiler.jl basically allows you to do that

It only recompiles a package X if (a) the Julia version changes or (b) the version of X or any of its (recursive) dependencies changes.

This is exactly what it already does (including dependencies).

(Even in C++ with make, a recompile will be triggered if any of the header files of the dependencies change.)

5 Likes

This is a very good question with a somewhat involved answer. Bear with me.

This is exactly how Julia’s precompilation works. What is referred to as precompilation happens when packages are updated (or upon first execution when the combination of versions you are using has changed for some other reason—this is uncommon but can happen). However, precompilation doesn’t save jitted machine code… Except on Julia master this branch, where @tim.holy recently implemented saving jitted machine code to precompile files (.ji files as they’re called). However, I’m pretty sure Julia’s precompilation is not what you’re thinking of because…

For compiled languages like C++, you generally use external libraries that you download in a format that is already compiled.

This describes what is called “separate compilation”, which is when separate source files can be compiled independently and then “linked” together to form an executable. The linking step is somewhat costly but not nearly as expensive as the individual compilation work. And yes, separate compilation is a feature of most statically compiled languages like Java, C, C++… but not of others like C++.

Wait, how can separate compilation be both a feature and not a feature of C++? Because C++ has various different styles of usage. If you avoid template metaprogramming then C++ supports fully separate compilation. But there are many C++ programmers who make constant and heavy uses of the Standard Template Library (STL) and other template libraries based on it. And this is basically a completely different language—in many ways a more expressive one. But also one which doesn’t really support separate compilation. This is why Google’s C++ style guide forbids templates in their code: templates would make compiling their monorepo take forever because it breaks separate compilation.

Julia is very similar to C++ with heavy use of templates: it specializes generic code very aggressively on the actual runtime types that get passed to that code. You don’t have to write explicit templates for this to happen—it’s just done automatically. If you define MyType with a custom isless method and then call sort on Vector{MyType}, Julia generates new machine code for sort that is specialized for Vector{MyType} and which probably inlines the implementation of isless into the sort body. This kind of specialization is crucial for optimal performance. And this is exactly what STL’s std::sort does too. On the other hand, this is not how libc’s qsort works. Instead qsort achieves genericness by working with void * pointers and arbitrary array strides and using a function pointer callback for comparisons. As a consequence, qsort is slooow, whereas std::sort is fast. All of Julia is automatically like std::sort. (Many static languages which support generic programming and separate compilation, on the other hand, are automatically like qsort instead—OCaml, Haskel, etc.) This is one of the major reasons why Julia is so fast—great speed is one of the major advantages of this approach. But one of the major disadvantages is that it makes separate compilation very challenging.

That’s an idea, but the “even if supercompile took 3 days” part isn’t hypothetical, it’s actually an underestimate of how long it would take to compile every possible specialization. The number of possible specializations of f(a::A, b::B) is N(A) * N(B) where N(T) denotes the number of concrete implementations of T. The number of specializations of f(a::A, b::B, c::C) is N(A) * N(B) * N(C). And so on—it’s exponential in the number of arguments and multilinear in the number of concrete implementations of every argument type. Worse still, it is trivial to write a parametric type which has an infinite number of concrete implementations, meaning that the number of possible specializations of f would be infinite. So when I said underestimate above I meant that supercompilation would never stop.

The same thing is true in C++ with templates, since it’s easy to write a generic type that has an infinite number of potential concrete instantiations—that’s kind of the whole point of generic types! So what do they do? They compile the whole program so that they know what actual instantiations of templates are used: templates are fully expanded even before the compiler proper starts compiling. Julia doesn’t require you to do that, of course, and instead generates only the specializations that are actually used via JIT. Moreover features like eval and include mean that one can express Julia programs for which it’s inherently impossible to say what types will even be instantiated and what specialized methods will be called. What if you want to do what C++ does? Well, that’s exactly what PackageCompiler and StaticCompiler do. You still need to keep the JIT around just in case (because of eval), but you can try hard to generate all the code you think you’ll need in advance.

An obvious followup observation is that even for C++ with templates there are often completely separate libraries that can be used without needing to recompile everything. Can’t Julia do something like that? Indeed, that is possible and @jeff.bezanson has gradually been working towards it. We’ll be able to generate completely separate libraries written in Julia at some point in the somewhat near future. With further work it will even be possible for those separate libraries to not depend on LLVM or on the garbage collector (either because it can be proven that they don’t need them or by just letting them fail if they do try to use either).

To wrap it up with a summary:

  • Julia precompilation already works like that :slightly_smiling_face:
  • But it doesn’t include machine code :slightly_frowning_face:
  • Except on Julia master it does soon will :slightly_smiling_face:
  • But that still isn’t “separate compilation” :slightly_frowning_face:
  • Which is available in most static languages :slightly_smiling_face:
  • But not in C++ with templates :slightly_frowning_face:
  • Yet even there some separate compilation is possible :slightly_smiling_face:
  • And that also is coming to Julia :grinning:
135 Likes

Man, it is awesome that this is coming!

Looking up the relevant commits, this is Support external linkage in "sysimages" by timholy · Pull Request #44527 · JuliaLang/julia · GitHub (and related commits like https://github.com/JuliaLang/julia/pull/43990)?

Seems to be this: separate-compilation.md · GitHub?

1 Like

There’s one loophole though. The precompilation doesn’t compile fully, as far as I understand (for stable Julia releases and 1.8, I see 1.9-DEV changes things). It compiles to .ji files, which I believe may be LLVM bitcode (thus portable), not like JVM bytecode, but close enough(?). So every time you load (pre)compiled code, it needs to generate say x86 machine code (non-portable, but ok, since only done in memory, never hits the disk).

I’m not sure how slow the rest of the compilation is (e.g. always O(n) operation?), it seems often very quick. I’m guessing when I perceive slowness for some package, the reasons are (mostly) elsewhere.

With PackageCompiler.jl this is however different, and you get non-portable system image, I believe, fully compiled.

That’s great. You are talking about .ji files, i.e. for packages. It’s not obvious to me why your main script isn’t compiled to .ji, except where to put it, it would clutter up your directory where your .jl file is (and even there might be no .jl file, you can invoke short programs on the command line, maybe even pipe a program in?).

I’m not sure this should be the default, but could it be an option, with standard Julia only? It seems Python neither does this (only compiles packages to its bytecode), so maybe there are good reasons against this.

But I don’t think template necessarily prohibit separate compilation. If template is in written a header file, then we can still compile templates separately by including that header file. Of course, that will incur the cost of repeatedly instantiations of templates. Most of these instantiations are completely the same so the much of work are redundant.

The point is that this is not separate compilation — you no longer compile an external library once, but instead you recompile it (or at least, all of the templated code) in every downstream dependency that instantiates the template.

4 Likes

I don’t think that is completely true. The PR you think about allowed saving more inferred code but there is still no machine code saved. You might be thinking about Support external linkage in "sysimages" by timholy · Pull Request #44527 · JuliaLang/julia · GitHub which is ongoing work to, in the end, reach that goal.

5 Likes

Ah, thanks for correcting that! I thought saving machine code had already landed but you’re quite right, it hasn’t. But it’s coming :smiley:

3 Likes

Yes, those are the correct links and as @kristoffer.carlsson pointed out, I was jumping the gun a bit about machine code being saved in .ji files—that’s what the first Tim Holy PR implements and it hasn’t been merged yet. The separate compilation document is Jeff’s plan for implementing some degree of separate compilation (it will never be “one source file one object file” like it is in C).

1 Like

@Stefan that’s really exciting that separate compilation for libraries is coming to Julia. If it is what I’m thinking it is it could be game changing for Julia users. However, I suspect that maybe I’m confusing different parts of code preparation with the precompile? One of the things that really slows me down, is how the first time I run a function it is much slower than subsequent runs. It can take 5, 10 seconds sometimes on first run but 10ths of a second on subsequent runs. Is that the generation of bitcode versus running bitcode? What is happening then? I thought that was some kind of compilation.

The delay you see is actual compilation. From your source code to assembly and then calling it. There are some ideas to reduce that too, one of them being doing something like V8/TurboFan for javascript or HotSpot for Java , where it sees if your function is hot (called many times). So that it can either just interpret it (low latency/slow execution) or compile it a bit, or generate very optimized code. And hopefully it can change that on the fly so the delay isn’t perceived.

This is a fantastic post, thanks! Small tangent:

I just want to point out that recent algorithmic improvements in mold (by the author of lld) have made linking ridiculously parallel and way more memory efficient to the point where a 16 core machine can link chrome or clang in about 3s. I have a fresh-off-the-press PR for a jll recipe for mold if anyone is interested.

7 Likes

Yes, that is compilation happening. There are many stages of compilation, which I described here (I was using a now-old version of Julia, but even if the output wouldn’t look exactly the same now, it’s still generally accurate). When you encounter a compilation lag it can be compilation starting at any of those stages for various portions of the code you wrote, not necessarily all starting at the same stage.

The first time you use a particular combination of Julia package versions, if they support what we call “precompilation”, they are precompiled and the results of precompilation are saved as ~/.julia/v1.x/compiled/$package/*.ji files. These files save the results of various stages of compilation so that they can be reused again later if you load the same combination of Julia package versions. This precompilation used to only happen when you actually used the packages, but people found it annoying to have a potentially long wait when they were ready to do something, so since Julia 1.5 (IIRC) the package manager has triggered precompilation as soon as the dependencies of the active project are updated. This potentially leads to wasted work if someone does one package operation triggering precompilation and then another operation that causes that precompiled output to no longer be applicable. But it turns out that psychologically people are fine with maybe more time being spent when they’re installing packages as long as when they type using X they don’t have to wait.

Precompilation currently only caches the results of the lowering and type inference phases of compilation. It saves what is called IR (intermediate representation), which is basically a serialized version of the internal representation that Julia’s compiler uses for Julia code. What Tim’s PR does is allow saving native machine code as well, which means that for code that’s in the .ji files, it can just be loaded into memory and used immediately without any compilation steps.

Once it lands, that will significantly improve the delay when you do using X for most packages, but it still won’t completely eliminate it in all cases. Some packages don’t support precompilation because they do something that breaks it (although this is rare these days since we made precompilation opt-out rather than opt-in when Julia 1.0 was released). There are two other reasons that precompilation even with machine code caching won’t be a cure all:

  1. Precompilation only saves code that has actually been compiled after the package has been loaded. There is a facility for having a curated list of method signatures to precompile (you call the precompile function on them). However, it’s possible that this doesn’t cover the code you actually want to run, in which case you’ll have to wait for what you want to run to be compiled.

  2. There are inherently interactions between packages that cannot be precompiled because they are novel combinations of generic code and types to apply that generic code to. It is, of course, possible to handle this the way qsort and OCaml and Haskell do and generate generic but slow code that will work for all types, but if I know one thing about Julia users, it’s that they’ll start complaining about that very quickly if it happens to be something they care about the speed of. But it could be possible to use generic compiled code until fast specialized code is available, but that’s a pretty fancy feature that would be a long way off in the future.

This last point gets into what is known as “tiered JIT” where code is run in progressively faster forms as it becomes clearer and clearer that it is worth optimizing it further. But this is already involved enough so I won’t get into that.

20 Likes

What about going from interpreter → Specialized code, instead of having a more generic code intermediate step? Would that be easier?

1 Like

Yes, that would be easier and will be done first. But it’s an interesting observation that you can potentially do separate compilation of a slow generic version (but compiled) that would be much faster than interpretation but would still be slow compared to a fully specialized version. There’s a lot of code that would be too slow in the interpreter but doesn’t really need full specialization.

6 Likes