Can you share an example of a multithreaded compiler?
I know nothing about compilers, I thought I remember that being suggested but perhaps I misunderstood.
The ORCJIT backend of LLVM is able to be multithreaded, and this is somewhat of a recent change.
ORCJIT allows speculative compilation, so it can be setup to try and compile some calls before they happen (on a separate thread).
Even in these cases, a lot can be cached. The majority of Plots can be cached. As long as you don’t inline all of the calls into the one that allows an anonymous function, the majority of stuff shouldn’t have to recompile. This is one thing that Plots.jl does get right (only the first time to plot is slow, then if you change to using a different plot call the recompiles are fine because the majority of the functions are just weird dictionary handling and stuff like that)
Just to be clear we are using “compile time” to mean the whole operation from hitting enter, which I believe consists of the following stages:
- Reducing to LLVM byte code. This includes type inference.
- LLVM compiling byte code to machine code.
- Running the code.
IIRC the actual LLVM compile time (3) is a small fraction of time, with most “compile” time being type inference (2). Speaking with zero experience/knowledge on the matter, it does seem like there is opportunity for multithreading here.
Could you elaborate on this? Just can’t imagine such a scenario, where the backend has no idea even about the structure of the potential queries so it would have to JIT-compile its reaction.
I’d go for 1 (static comp.) + 4 (interpreter). When developing (prototyping), use the interpreter (JIT-compilation “in the background” would be awesome, but not mandatory). When runtime speed is crucial, enable the JIT for the final touches of development and optimisations. Then statically compile everything that’s known in advance, and leave dynamic one-off stuff at runtime to the interpreter (optionally with JIT-compilation in the background).
I think compilation can never reach the interactivity and the responsiveness for one-off calls of an interpreter.
For Plots it’s mostly Julia type inference time, but in most cases it’s the other way around. But yes, I assume type inference should be able to do similar things? But that does mean that incorporating the ORCJIT changes won’t fix the Plots.jl issue.
In other words, a tracing JIT, which has worked wonders for LuaJIT and NodeJS. I find this approach really appealing because it’s a concept that’s well tested in other languages and (I assume) in CS research. It also lends itself well to offloading compilation of hot code to another thread, since you can just keep interpreting (and thus making progress) while you wait for compilation to complete, as well as kicking off multiple threads if multiple functions can be compiled separately.
Yeah, sounds nice (I’m not that deep into interpreter types etc.). Can those existing tracing JIT’s AOT-compile a “trace” and then use the precompiled versions of methods in another run/session when they encounter them? Shouldn’t be a big deal if the basic tracing is already in place…?
Yes, well essentially what it could do is compile methods (functions specialized on specific types, or unspecialized if
@nospecialized). If you look at the LLVM code of a function, you’ll notice that, if a function isn’t large enough, there is just a primitive for a function call. So if that function that it is calling is already compiled because it’s one of these parts, then that chunk doesn’t need to be compiled. Thus, even if you are doing something quite dynamic, a lot of code can be compiled ahead of time if the types can be known beforehand and if the function calls are not inlined. Note that if they are inlined, they are not using this primitive and instead essentially copy-pasting into the caller code, causing that whole chunk to have to be compiled within the caller’s compilation.
Therefore, unspecialized methods, or at least methods with commonly used types, which are not inlined are what will play a big role for cutting down compile times when mixing with AOT. @dlfivefifty mentioned that anonymous functions can be an issue, and this is because they will make it so every call essentially has new types. But if you can find subroutines in there that are large enough or called infrequently enough that you don’t care about them inlined (don’t effect performance) and they only depend on things like number and array types, those can be pulled out in order to drop compile times with an AOT tool.
If you do this, even an interpreter would be able to call the precompiled methods anyways, so you could probably make a great semi-compiled system just by making an interpreter that checks a compilation cache before interpreting.
Our application domain is medical research, which has some rather extreme diversity in what one may wish from a data source. With DataKnots.jl, queries can be rather sophisticated programs, as combinations of well-known and vetted building blocks. These queries can be specified in Julia natively, using a macro syntax, or using a JSON serialization constructed by visual query builders. So that queries perform their work quickly, we need to have them compiled; hence, DataKnots is a library, not an interpreter. Even if we did know the general kinds of queries that would be submitted, even a small rearrangement of operations will cause a new compilation. This compilation happens because we’re using transparent structures, that is, we’re not wrapping them in a fixed data type. This way each query operator can be specialized at compile time rather than dispatching based upon time at run-time. This is acceptable for us, in fact, it’s a feature – Julia’s runtime compilation capabilities make queries run quickly (by avoiding lots of dynamic dispatch).
I’m only mentioning it on this list just because we provide an extreme case where pre-compilation probably won’t help much, and hence compilation speed is very much important to our run-time performance. For heavily used user interface screens, one could use a fixed endpoint (perhaps using GraphQL) with query parameters which are essentially the same query but with different inputs. We could use precompilation in this case, and for the backend of a system that requires very quick response time, it’s probably our best option. But. It’s not an interesting option. Our research community is often exploring data and as a result, queries are different request by request. That Julia could handle both cases is quite great. I hope this helps.
Again, you can definitely setup your code so that way you isolate the part that actually depends on sending the data into the user-defined kernel, and have all of the associated setup code for querying the data base etc. done in a way that can be precompiled. So if setup nicely then it should AOT compile most of the call, but it would take some foresight on the developer end to make sure that happens.
or an easier way to handle that is to have the highest level call just wrap the user kernel in a FunctionWrapper using FunctionWrappers.jl, and then call an inner kernel which doesn’t inline. That inner kernel will be able to be AOT compiled, and then Julia only need to JIT the 3 line or so function that wraps it all into a function wrapper to call the AOT compiled function (or in theory it can just interpret this).
I like the ideas (towards which I was also thinking in this post) of having an interpreter that checks for compiled versions of methods before interpreting them. It could then, if desired, JIT-compile yet unknown methods after their first interpreted execution, in another thread or when idle anyway.
Wouldn’t this also solve @cce’s trade-off? Like this, already the first usage of a new query would be fast, and subsequent ones have precompiled speed…
@ChrisRackauckas, it could be interesting to explore “how to write composable AOT-compiled building blocks” and how they can be composed at runtime without JIT. I presume this already works combining PackageCompiler (or similar) and
Isn’t there any way to distribute compiled binaries? It’s the norm to distribute the binary for complied languages like c++ and rust, why does Julia has to distribute source code?
where do you get the impression that Julia distribute source code (only)?..https://julialang.org/downloads/
or do you mean the fact that
stdlib is written in Julia?
Julia’s AOT compilation model requires that for generic code, the AST of each method is available. Delivering self-contained executables is work in progress, but binary forms of packages do not make sense for general use.
I see. I only know the very basics of the most basic knowledge about compilers, but I wonder whether it would be possible and helpful for solving TTFP to compile most commonly called signatures of certain functions when building packages and distribute these compiled codes (either LLVM IR or assembly code for individual architectures) along side the source code? This way the compiler doesn’t have to compile all the way down during the first invocation.
Just an update for everyone in this thread. After the first time to plot analysis in
established that it really is all inference time (and hence solvable from Julia and not an LLVM issue), a new research software engineer has started in the Julia Lab for working on this problem. We have decided it’s probably best to probably not try and solving this specifically for Plots.jl, but more generally. We had a meeting with a bunch of compiler folk and it resulted in these issues:
So I’m not going to promise anything quick or soon (I’m not even a compiler guy! Just doing what I can!), but those are some concrete ideas for things that would reduce compiler latency for all users, and hopefully handle Plots.jl. If you’re interested, I’d follow those and cheer on any developments. This will take some time, but hopefully it can be fixed the correct way and all packages will benefit from it.
And as always, even though there’s a concrete plan, everything is always limited by resources (time, tests, money, etc.). If you do want to help out, there are likely ways to pitch in, so don’t be afraid to ask. For example, maybe one thing that could help is to put together a script/repo that runs all of the known compile-time explosion examples to make it easier for the compiler engineers to run and test changes.
Maybe looking at how other languages implement tiered compilation can be useful too, e.g.