Why is rust compilation faster than Julia pre-compilation?

I’ve been learning Rust a bit for tinkering with CLI tools (if curious… GitHub - MilesCranmer/rip2: A safe and ergonomic alternative to rm).

One thing I noticed is that the compilation time is super fast – even faster than Julia precompilation, let alone TTFX. I’m wondering what sorts of things are bottlenecking the current precompilation speeds in Julia at the moment?

Rust: One example: the rust crate clap has 10 dependencies. When I do cargo clean and then cargo build --release, it takes 3.83 seconds. Then, when I make a whitespace change to the source code, it only takes 0.36s to recompile into a binary (with -O3 optimization).

This fast compilation makes the static analysis tools really useful as they can do type inference almost instantly. Basically, imagine if we had instant @code_warntype running on every function simultaneously, highlighting all inferred types.

Julia: To contrast, the project Comonicon.jl (awesome, btw) is the equivalent of clap and has 8 dependencies (and fewer features). On v1.11.0-beta1, precompilation of only those 8 dependencies + comonicon takes 6 seconds. When I make a whitespace change, it takes 2 seconds to precompile again.

This does not include TTFX time which would be on top of this (the result of cargo build is a binary).

For testing:

Rust: When I make a change to clap and run cargo test, it takes 1.41 s run the entire test the entire suite. Really nice for quick development.

Julia: Whereas, for example, julia --project=. -e 'using Pkg; Pkg.test()' of Comonicon takes 16 seconds.


Do we know the reason Julia precompilation times are lengthy? It has crept up a bit the last few releases – obviously for very good reason; it has been awesome to have such fast TTFX. Just wondering the next steps and what can be done on the Julia internals to improve this.

23 Likes

Something that prompted me to ask this was trying out evcxr, which has a Jupyter kernel for rust. It looks like this:

Try it out if curious. It is kind of crazy. Each time you run a cell, it is actually compiling and then executing the code. I thought this would be painfully slow to use, but it’s… not that bad. Interactive Julia is obviously much faster, but It was wild to me that the compile times are so fast in Rust you can actually have a decent REPL. So it got me thinking about what could be keeping the precompilation+TTFX slow in Julia.

15 Likes

Btw rust benchmarks and profiles the compilation of clap among other crates on every commit. You can see the results here rustc performance data, it takes 2.6s from scratch and .2s if there’s been no change. And here’s the profile on the latest commit rustc performance data.

One reason why Julia might be slow is for the compilation that’s occurs in Julia I think a while is spent allocating which could be cheaper for various reasons in Rust.
Also, maybe they parallelise more? I have no idea if they, I was just was thinking about it because of Faster compilation with the parallel front-end in nightly | Rust Blog.
Also, recently rust has been sped up a fair bit recently by improving pgo and adding bolt as described in Speeding up the Rust compiler without changing its code | Kobzol’s blog. I’m doing some work to bring these benefits to Julia.
https://nnethercote.github.io/ also has quite a few posts listing performance improvements for compilation.

12 Likes

For those interested in Rust’s benchmarks page, Kobzol wrote two blog posts explaining it.

1 Like

Julia precompilation involves running the functions, in particular, Commonicon runs the complete test suite to precimpile. Maybe that’s why.

If rust performs an incremental compilation after a change, that’s probably closer to what you get using Revise.

9 Likes

Rust uses demand-driven compilation and is expected to use salsa-rs in the future.

1 Like

Note that Julia’s precompilation caching is parallelized at the package level with no cached intermediates. With Rust you have intermediate object file(*.o) that can be reused per source file - just like any statically compiled language.

For a single package, compare with the following.

cargo clean
cargo build -j 1

Large Rust projects do tend to suffer from compilation latency. Rust projects tend not to build shared libraries, but rather monolithic binaries.

For this reason, there are attempts to replace the LLVM compiler backend with the WASM based cranelift backend.

1 Like

I don’t know how Apple did, but Swift, another compiled programming language that uses LLVM, has a REPL with a debugger.

1 Like

Definitely not replace :slight_smile:

There was some brainstorming about a cranelift backend for Julia a while back, but it hasn’t gone anywhere: Cranelift: a faster alternative to LLVM. It would be really cool to see, cranelift is mature enough nowadays that you don’t hit many bugs using it with rustc (missing features yes, but not things like wrong codegen) and it really does have noticeably faster compilation times. I recommend trying it with the evcxr

For what it is worth, the project that provides Rust’s Jupyter kernel also provides a REPL for Rust GitHub - evcxr/evcxr. Would be interesting to see a benchmark between the two.

—-

Currently rustc does not parallelize the frontend. The backend is parallelized by splitting a single crate into as many orthogonal codegen units as possible before handing it to LLVM (codegen-units compiler flag, default maximum is 255), which can then build in parallel. These individual codegen units can then be reused or recompiled as needed to support incremental compilation. There is work to parallelize the frontend too, currently available on nightly, which will add another nice speed boost when it becomes stable. Crates compile in parallel where allowed by the dependency graph.

I don’t know much about the Julia side here but I have to imagine it isn’t splitting codegen units as optimally as it could be, which is tougher to do with the interpreted model. Same for frontend parallelization, I’m not entirely sure how that would work.

Rust does pretty well with the infra side of things too. As @Zentrik mentioned above, a performance suite gets run on every single commit (perfbot has dedicated machines, a run takes about 2 hours) and it is easy to run it on an individual pull to see the deltas (search the PRs for @rust-timer queue). So the barrier to entry is pretty low for somebody wanting to make performance contributions, which helps get interest - a few % wins here or there really add up. Julia may have something similar, I just have no idea.

Julia has been doing a great job getting startup times down in the past couple years, it will be exciting to see how good it gets.

8 Likes

There is a post on Rust’s blog about this. The speed increase is very promising.

When the parallel front-end is run in single-threaded mode, compilation times are typically 0% to 2% slower than with the serial front-end. […] When the parallel front-end is run in multi-threaded mode with -Z threads=8, our measurements on real-world code show that compile times can be reduced by up to 50%, though the effects vary widely and depend on the characteristics of the code and its build configuration.

Um… this is completely false. Comonicon.jl doesn’t even depend on PrecompileTools. I’m talking about precompiling Comonicon.jl itself – that takes 8 seconds.

1 Like

I thought they were generating the precompilation directives here: Comonicon.jl/compile.jl at main · comonicon/Comonicon.jl · GitHub

But maybe not. Should look closer to what is being compiled.

No worries. But indeed, that file is not executed during precompilation. And also it looks like the static precompile statements it generated aren’t even being used at the moment. (Might be helpful to edit that part of your original comment so nobody thinks this is why the times are so different.)

One still would have too look at the precompilation directives of all dependencies, right? If there are no precopilation directives, than precompilation should be fast, but TTFX should be slow.

I think the answer may instead simply be that Julia compilation has not been as optimized as Rust. Which is fine – recognition is the first step towards improvement. As others have shared in this thread, Rust has put a ton of resources into speeding up the compilation process. So there’s a lot to learn about how we can get the same things.

5 Likes

I think is more complicated than that. Precompilation in Julia is not equivalent to the compilation of a rust code. Each dependency might have precompilation directives to obtain cashed code for methods that might never be used by the frontend installed package. On the opposite side, packages might not have precompilation directives at all, and then most of the compilation moves to first time execution.

I’m not saying that Julia compilation is faster or that there isn’t a problem specifically there, just that the comparison is harder. Which doesn’t mean that we wouldn’t like having it faster, particularly when, as you show, there is a similar package in other language compiling faster.

2 Likes

I guess this point itself seems like a good thing to improve on?

2 Likes

I’m a little confused about what what is being timed here. I’m getting 2 seconds on a pretty old computer.

julia> using Comonicon, BenchmarkTools

julia> @btime Base.compilecache(Base.PkgId(Comonicon))
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
[ Info: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
  2.074 s (3687 allocations: 358.37 KiB)
("/home/mkitti/.julia/compiled/v1.10/Comonicon/ylrB3_Jb1xZ.ji", "/home/mkitti/.julia/compiled/v1.10/Comonicon/ylrB3_Jb1xZ.so")

(comonicon) pkg> st
Status `~/blah/comonicon/Project.toml`
  [863f3e99] Comonicon v1.0.7

You can observe what is going on under the hood via JULIA_DEBUG=loading and --trace-compile=stderr.

$ rm -r ~/.julia/compiled/v1.10/Comonicon/
$ JULIA_DEBUG=loading julia --project --trace-compile=stderr -e "using Comonicon"
precompile(Tuple{Base.var"##s128#247", Vararg{Any, 5}})
precompile(Tuple{typeof(Base._nt_names), Type{NamedTuple{(:wait,), Tuple{Bool}}}})
precompile(Tuple{typeof(Core.kwcall), NamedTuple{(:stale_age,), Tuple{Int64}}, typeof(FileWatching.Pidfile.trymkpidlock), Function, Vararg{Any}})
precompile(Tuple{FileWatching.Pidfile.var"##trymkpidlock#11", Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:stale_age,), Tuple{Int64}}}, typeof(FileWatching.Pidfile.trymkpidlock), Function, Vararg{Any}})
precompile(Tuple{typeof(Core.kwcall), NamedTuple{(:stale_age, :wait), Tuple{Int64, Bool}}, typeof(FileWatching.Pidfile.mkpidlock), Function, String})
precompile(Tuple{FileWatching.Pidfile.var"##mkpidlock#7", Base.Pairs{Symbol, Integer, Tuple{Symbol, Symbol}, NamedTuple{(:stale_age, :wait), Tuple{Int64, Bool}}}, typeof(FileWatching.Pidfile.mkpidlock), Base.var"#968#969"{Base.PkgId}, String, Int32})
precompile(Tuple{typeof(Base.CoreLogging.shouldlog), Logging.ConsoleLogger, Base.CoreLogging.LogLevel, Module, Symbol, Symbol})
precompile(Tuple{typeof(Base.get), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Symbol, Nothing})
precompile(Tuple{typeof(Base.CoreLogging.handle_message), Logging.ConsoleLogger, Base.CoreLogging.LogLevel, Vararg{Any, 6}})
precompile(Tuple{typeof(Base.isopen), Base.GenericIOBuffer{Array{UInt8, 1}}})
precompile(Tuple{typeof(Logging.default_metafmt), Base.CoreLogging.LogLevel, Vararg{Any, 5}})
precompile(Tuple{typeof(Base.string), Module})
precompile(Tuple{Type{Base.IOContext{IO_t} where IO_t<:IO}, Base.GenericIOBuffer{Array{UInt8, 1}}, Base.TTY})
precompile(Tuple{typeof(Core.kwcall), NamedTuple{(:bold, :color), Tuple{Bool, Symbol}}, typeof(Base.printstyled), Base.IOContext{Base.GenericIOBuffer{Array{UInt8, 1}}}, String})
precompile(Tuple{typeof(Core.kwcall), NamedTuple{(:bold, :color), Tuple{Bool, Symbol}}, typeof(Base.printstyled), Base.IOContext{Base.GenericIOBuffer{Array{UInt8, 1}}}, String, Vararg{String}})
precompile(Tuple{Base.var"##printstyled#995", Bool, Bool, Bool, Bool, Bool, Bool, Symbol, typeof(Base.printstyled), Base.IOContext{Base.GenericIOBuffer{Array{UInt8, 1}}}, String, Vararg{Any}})
precompile(Tuple{typeof(Core.kwcall), NamedTuple{(:bold, :italic, :underline, :blink, :reverse, :hidden), NTuple{6, Bool}}, typeof(Base.with_output_color), Function, Symbol, Base.IOContext{Base.GenericIOBuffer{Array{UInt8, 1}}}, String, Vararg{Any}})
precompile(Tuple{typeof(Base.write), Base.TTY, Array{UInt8, 1}})
┌ Debug: Precompiling Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
└ @ Base loading.jl:2353
precompile(Tuple{typeof(Core.kwcall), NamedTuple{(:cpu_target,), Tuple{Nothing}}, typeof(Base.julia_cmd)})
precompile(Tuple{typeof(Core.kwcall), NamedTuple{(:stderr, :stdout), Tuple{Base.TTY, Base.TTY}}, typeof(Base.pipeline), Base.Cmd})
precompile(Tuple{typeof(Base.open), Base.CmdRedirect, String, Base.TTY})
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/ExproniconLite/2CPrV_Jb1xZ.so for ExproniconLite [55351af7-c7e9-48d6-89ff-24e801d99491]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/OrderedCollections/LtT3J_sUOrk.so for OrderedCollections [bac558e1-5e72-5ebc-8fee-abe8a469f55d]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/Configurations/2z6N1_Jb1xZ.so for Configurations [5218b696-f38b-4ac9-8b61-a12ec717816d]
└ @ Base loading.jl:1057
┌ Debug: Skipping mtime check for file /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/LazyArtifacts/src/LazyArtifacts.jl used by /home/mkitti/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/compiled/v1.10/LazyArtifacts/MRP8l_jWrrO.ji, since it is a stdlib
└ @ Base loading.jl:3129
┌ Debug: Loading object cache file /home/mkitti/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/compiled/v1.10/LazyArtifacts/MRP8l_jWrrO.so for LazyArtifacts [4af54fe1-eca0-43a8-85a7-787d91b784e3]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/Scratch/ICI1U_SjAHO.so for Scratch [6c6a2e73-6563-6170-7368-637461726353]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/RelocatableFolders/Yg3O9_7JKAU.so for RelocatableFolders [05181044-ff0b-4ac5-8273-598c1e38db00]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/Glob/3FzEV_7JKAU.so for Glob [c27321d9-0574-5035-807b-f59d2c89b15c]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/PackageCompiler/MMV8C_Jb1xZ.so for PackageCompiler [9b87118b-4619-50d2-8e1e-99f35a4d4d9d]
└ @ Base loading.jl:1057
precompile(Tuple{typeof(Base.spawn_opts_inherit), Base.DevNull, Base.TTY, Base.TTY})
precompile(Tuple{typeof(Base._tryrequire_from_serialized), Base.PkgId, String, String})
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/ExproniconLite/2CPrV_Jb1xZ.so for ExproniconLite [55351af7-c7e9-48d6-89ff-24e801d99491]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/OrderedCollections/LtT3J_sUOrk.so for OrderedCollections [bac558e1-5e72-5ebc-8fee-abe8a469f55d]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/Configurations/2z6N1_Jb1xZ.so for Configurations [5218b696-f38b-4ac9-8b61-a12ec717816d]
└ @ Base loading.jl:1057
┌ Debug: Skipping mtime check for file /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/LazyArtifacts/src/LazyArtifacts.jl used by /home/mkitti/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/compiled/v1.10/LazyArtifacts/MRP8l_jWrrO.ji, since it is a stdlib
└ @ Base loading.jl:3129
┌ Debug: Loading object cache file /home/mkitti/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/compiled/v1.10/LazyArtifacts/MRP8l_jWrrO.so for LazyArtifacts [4af54fe1-eca0-43a8-85a7-787d91b784e3]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/Scratch/ICI1U_SjAHO.so for Scratch [6c6a2e73-6563-6170-7368-637461726353]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/RelocatableFolders/Yg3O9_7JKAU.so for RelocatableFolders [05181044-ff0b-4ac5-8273-598c1e38db00]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/Glob/3FzEV_7JKAU.so for Glob [c27321d9-0574-5035-807b-f59d2c89b15c]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/PackageCompiler/MMV8C_Jb1xZ.so for PackageCompiler [9b87118b-4619-50d2-8e1e-99f35a4d4d9d]
└ @ Base loading.jl:1057
┌ Debug: Loading object cache file /home/mkitti/.julia/compiled/v1.10/Comonicon/ylrB3_Jb1xZ.so for Comonicon [863f3e99-da2a-4334-8734-de3dacbe5542]
└ @ Base loading.jl:1057

Actually, @lmiq is correct here even without PrecompileTools.jl. To precompile pkgimages, Julia spawns another Julia process in output imaging mode.

The discussion also suggests a misconception about PrecompileTools.jl does. By default all code executed at the top-level is used for precompilation. What PrecompileTools.jl does is allows you to execute certain pieces of code only during precompilaton. PrecompileTools.jl basically boils down to the following if statement.

if ccall(:jl_generating_output, Cint, ()) == 1
    # exercise workload to only perform when generating a pkgimg
end

In output mode, Julia is monitoring what is being executed, which is actually not much in this case, and saving the type inference and native compilation to the specified files in ~/.julia/compiled/v#.#.

The command line options which control this process can be seen under julia --help-hidden:

$ julia --help-hidden

    julia [switches] -- [programfile] [args...]

Switches (a '*' marks the default value, if applicable):

 --compile={yes*|no|all|min}
                          Enable or disable JIT compiler, or request exhaustive or minimal compilation

 --output-o <name>        Generate an object file (including system image data)
 --output-ji <name>       Generate a system image data file (.ji)
 --strip-metadata         Remove docstrings and source location info from system image
 --strip-ir               Remove IR (intermediate representation) of compiled functions

 --output-unopt-bc <name> Generate unoptimized LLVM bitcode (.bc)
 --output-bc <name>       Generate LLVM bitcode (.bc)
 --output-asm <name>      Generate an assembly file (.s)
 --output-incremental={yes|no*}
                          Generate an incremental output file (rather than complete)
 --trace-compile={stderr,name}
                          Print precompile statements for methods compiled during execution or save to a path
 --image-codegen          Force generate code in imaging mode
 --permalloc-pkgimg={yes|no*} Copy the data section of package images into mem
3 Likes

Maybe you missed it but this was the original comment:

Which isn’t correct

2 Likes

I was timing the precompilation of Comonicon.jl as well as all of its dependencies.

If we were to only look at compilation of the package itself, then Rust compilation of clap would be at about 0.29 seconds, rather than 3.83.

1 Like