I’m pleased to announce the availability of SnoopPrecompile, a new package that can help reduce latency particularly on Julia 1.8 and higher. The concept of the package is to make precompilation easy. As described in greater detail in the documentation, you can use it like this:
module MyPackage
using SnoopPrecompile
# define methods, types, etc
@precompile_all_calls begin
# In here put "toy workloads" that exercise the code you want to precompile
some_workload_you_want_to_make_faster(args...)
end
end # MyPackage
@precompile_all_calls does a small amount of work for you to help ensure that (particularly on Julia 1.8 and higher) all the needed support calls get precompiled. This style of precompilation is appropriate if some_workload_you_want_to_make_faster can be run safely on its arguments without bad side effects; delete_my_whole_harddrive() would not be a recommended candidate for this style of precompilation .
Improving Julia’s precompilation is an ongoing adventure; 1.8rc3 has some known issues but still should be better, for most users, than any previous Julia version. We expect additional progress in the area of precompilation in future versions of Julia. For those who want more detail, there is a JuliaCon talk on the topic on Thursday morning.
run the block only when precompilling (it incorporates if ccall(:jl_generating_output, Cint, ()) == 1) so that if you’re running with --compiled-modules=no you don’t waste time
disables the interpreter when running the block (to ensure everything gets compiled)
intercept runtime dispatch to force precompilation of calls to methods defined in other packages
Item 3 is the really new thing. To explain in detail, any call made by runtime dispatch will break the chain of backedges. If those backedges don’t link back to the currently-precompiling package, the type-inferred code won’t be cached in the precompile file. Thus if the callee is a method in your package, no sweat (it’s already in your package so will be cached), but if the (runtime-dispatched) callee is in Base or elsewhere it won’t get precompiled by default on 1.8. @precompile_all_calls fixes this by snooping on inference and recording all new entrances to inference; once the @precompile_all_calls block exits it just iterates through the list of all newly-inferred MethodInstances and generates a manual precompile(f, (argtypes...)) command (which on 1.8 does force caching even if the method is external).
None of this is a concern for inferrable dispatch, because on Julia 1.8 all inferrable dispatch gets cached regardless of module ownership. This is really just to scoop up those runtime-dispatched dependencies.
This really eliminates the need to ever statically generate those precompile directives, unless your workload has undesirable side-effects.
Its biggest issue was that any instability before LV would cause all precompilation of LV to be removed, which is about 17 seconds of extra run time. So making that a lot more predictably precompiled is a very huge change, even if LV doesn’t precompile fully yet.
I’m gonna leave in vacation tomorrow and won’t be able to follow this closely. But if I uncomment
(plus line#6 and the Project.toml)
then I get a segmentation fault (Windows)
julia> cd("C:/v"); @time using GMT
[ Info: Precompiling GMT [5752ebe1-31b9-557e-87aa-f909b540aa54]
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffb0229d596 -- splitpath at C:\WINDOWS\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\joaqu\.julia\dev\GMT\src\GMT.jl:316
splitpath at C:\WINDOWS\SYSTEM32\ntdll.dll (unknown line)
Allocations: 83598528 (Pool: 83552476; Big: 46052); GC: 77
ERROR: Failed to precompile GMT [5752ebe1-31b9-557e-87aa-f909b540aa54] to C:\Users\joaqu\.julia\compiled\v1.7\GMT\jl_D31D.tmp.
Stacktrace:
[1] error(s::String)
@ Base .\error.jl:33
[2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, ignore_loaded_modules::Bool)
@ Base .\loading.jl:1466
[3] compilecache(pkg::Base.PkgId, path::String)
@ Base .\loading.jl:1410
[4] _require(pkg::Base.PkgId)
@ Base .\loading.jl:1120
[5] require(uuidkey::Base.PkgId)
@ Base .\loading.jl:1013
[6] require(into::Module, mod::Symbol)
@ Base .\loading.jl:997
[7] top-level scope
@ timing.jl:220
julia> using TriangularSolve, LinearAlgebra, MKL;
julia> BLAS.set_num_threads(1)
julia> BLAS.get_config().loaded_libs
1-element Vector{LinearAlgebra.BLAS.LBTLibraryInfo}:
LBTLibraryInfo(libmkl_rt.so, ilp64)
julia> N = 100;
julia> A = rand(N,N); B = rand(N,N); C = similar(A);
julia> @time TriangularSolve.rdiv!(C, A, UpperTriangular(B));
2.799829 seconds (1.21 M allocations: 56.681 MiB, 98.19% compilation time: 97% of which was recompilation)
98.19% compile time, of which 97% was recompilation.
Checking @snoopr using TriangularSolve, there are a lot of invalidations.
Many of them still involve Static.jl.
My own focus will be on continuing to replace LoopVectorization in a way that allows us to avoid needing to add a ton of types that trigger further invalidations.
Although, many of these are easily fixable and thus could be worth spending time on, e.g. :
inserting ifelse(u::Bool, v1::VectorizationBase.Double, v2::VectorizationBase.Double)
@ VectorizationBase ~/.julia/packages/VectorizationBase/nsLCg/src/special/double.jl:108 invalidated:
backedges: 1: superseding ifelse(condition::Bool, x, y)
@ Base essentials.jl:565 with MethodInstance for ifelse(::Bool, ::Any, ::Any) (4 children)
I should definitely stop using CPUSummary.num_threads() (and use Threads.nthreads() instead).
That’s a perceptive comment. I hadn’t thought about putting the precompiles in the __init__ but it seems like it should work. In such cases @precompile_setup will have no impact but @precompile_all_calls should work as intended. (It’s a bit subtle because __init__ will be compiled before it runs, but since it’s a method in your package all the backedges should be fine for anything that’s not runtime-dispatched, and once the snooping is on it should pick up the runtime-dispatched.)
I’ve been messing around with this and getting some pretty awesome results However, one place it doesn’t seem to do much is for precompiling Zygote gradients. It doesn’t seem to improve TTFG (time to first gradient) by much at all. For some random thing I was just trying e.g. it goes from 13s → 11s, whereas for some of my other non-gradient code I’m getting like 15s → 2s. Is this a known limitation or is there anything to do about this? (This is with 1.8 btw)
This is most likely because Zygote gradients use so many closures, and closures are a different type per session so you cannot precompile them. If those closures were turned into callable structs everywhere in ChainRules.jl, then we’d probably be able to precompile a lot more (@chriselrod we should add this as another reason in SciMLStyle to not use closures ). I don’t know how the ChainRules.jl devs would feel about that kind of style change though: it would be drastic, but it could also improve error messages.
I’ve found a few cases where inference failed on v1.6 and I had to ntuple(i-> foo(x, i), Val{N}()) instead of ntuple(Base.Fix1(foo, x), Val{N}()) to overcome this.