ANN: New package SnoopPrecompile

tim.holy · July 25, 2022, 9:40pm

EDIT: SnoopPrecompile has been replaced by PrecompileTools.

I’m pleased to announce the availability of SnoopPrecompile, a new package that can help reduce latency particularly on Julia 1.8 and higher. The concept of the package is to make precompilation easy. As described in greater detail in the documentation, you can use it like this:

module MyPackage

using SnoopPrecompile

# define methods, types, etc

@precompile_all_calls begin
    # In here put "toy workloads" that exercise the code you want to precompile
    some_workload_you_want_to_make_faster(args...)
end

end  # MyPackage

@precompile_all_calls does a small amount of work for you to help ensure that (particularly on Julia 1.8 and higher) all the needed support calls get precompiled. This style of precompilation is appropriate if some_workload_you_want_to_make_faster can be run safely on its arguments without bad side effects; delete_my_whole_harddrive() would not be a recommended candidate for this style of precompilation .

Improving Julia’s precompilation is an ongoing adventure; 1.8rc3 has some known issues but still should be better, for most users, than any previous Julia version. We expect additional progress in the area of precompilation in future versions of Julia. For those who want more detail, there is a JuliaCon talk on the topic on Thursday morning.

ChrisRackauckas · July 26, 2022, 12:53am

What’s the difference between this and just putting a let block of example code into a package that is run on using time?

Alec_Loudenback · July 26, 2022, 1:16am

What happens on 1.7 if you include the code snippets above? It just doesn’t cache(?) the precompilation items whereas 1.8 has that capability?

tim.holy · July 29, 2022, 12:06pm

It does 3 things:

run the block only when precompilling (it incorporates if ccall(:jl_generating_output, Cint, ()) == 1) so that if you’re running with --compiled-modules=no you don’t waste time
disables the interpreter when running the block (to ensure everything gets compiled)
intercept runtime dispatch to force precompilation of calls to methods defined in other packages

Item 3 is the really new thing. To explain in detail, any call made by runtime dispatch will break the chain of backedges. If those backedges don’t link back to the currently-precompiling package, the type-inferred code won’t be cached in the precompile file. Thus if the callee is a method in your package, no sweat (it’s already in your package so will be cached), but if the (runtime-dispatched) callee is in Base or elsewhere it won’t get precompiled by default on 1.8. @precompile_all_calls fixes this by snooping on inference and recording all new entrances to inference; once the @precompile_all_calls block exits it just iterates through the list of all newly-inferred MethodInstances and generates a manual precompile(f, (argtypes...)) command (which on 1.8 does force caching even if the method is external).

None of this is a concern for inferrable dispatch, because on Julia 1.8 all inferrable dispatch gets cached regardless of module ownership. This is really just to scoop up those runtime-dispatched dependencies.

This really eliminates the need to ever statically generate those precompile directives, unless your workload has undesirable side-effects.

tim.holy · July 29, 2022, 12:07pm

It shouldn’t do anything bad on earlier versions, but the main benefits will arrive only on 1.8 and higher.

ChrisRackauckas · July 29, 2022, 12:36pm

tim.holy:

Item 3 is the really new thing. To explain in detail, any call made by runtime dispatch will break the chain of backedges. If those backedges don’t link back to the currently-precompiling package, the type-inferred code won’t be cached in the precompile file. Thus if the callee is a method in your package, no sweat (it’s already in your package so will be cached), but if the (runtime-dispatched) callee is in Base or elsewhere it won’t get precompiled by default on 1.8. @precompile_all_calls fixes this by snooping on inference and recording all new entrances to inference; once the @precompile_all_calls block exits it just iterates through the list of all newly-inferred MethodInstances and generates a manual precompile(f, (argtypes...)) command (which on 1.8 does force caching even if the method is external).

That fixes the issues with LoopVectorization then!

Thanks, will use it right away

tim.holy · July 29, 2022, 1:12pm

LoopVectorization has some invalidation challenges, but I’m about to start looking into that in detail.

ChrisRackauckas · July 29, 2022, 1:17pm

Its biggest issue was that any instability before LV would cause all precompilation of LV to be removed, which is about 17 seconds of extra run time. So making that a lot more predictably precompiled is a very huge change, even if LV doesn’t precompile fully yet.

joa-quim · July 29, 2022, 2:31pm

I’m gonna leave in vacation tomorrow and won’t be able to follow this closely. But if I uncomment

github.com

GenericMappingTools/GMT.jl/blob/master/src/GMT.jl#L316


      
          		println("\nDetected a previously working GMT.jl version but something has broken meanwhile.\n" *
          		"(like updating your GMT instalation). Run this command in REPL and restart Julia.\n\n\t\tGMT.force_precompile()\n")
          		return
          	end
          
          
	clear_sessions(3600)		# Delete stray sessions dirs older than 1 hour
          	G_API[1] = GMT_Create_Session("GMT", 2, GMT_SESSION_BITFLAGS)
          	(GMTver >= v"6.2.0") && theme_modern()			# Set the MODERN theme and some more gmtlib_setparameter() calls
          	haskey(ENV, "JULIA_GMT_IMGFORMAT") && (FMT[1] = ENV["JULIA_GMT_IMGFORMAT"])
          	f = joinpath(readlines(`$(joinpath("$(GMT_bindir)", "gmt")) --show-userdir`)[1], "theme_jl.txt")
          	(isfile(f)) && (theme(readline(f));	ThemeIsOn[1] = false)	# False because we don't want it reset in showfig()
          	(GMTver < v"6.2.0") && extra_sets()		# some calls to gmtlib_setparameter() (theme_modern already called this)
          end
          
          
#@precompile_all_calls begin
          	#G_API[1] = GMT_Create_Session("GMT", 2, GMT_SESSION_BITFLAGS)
          	#plot(rand(5,2))
          	#makecpt(T=(0,10))
          	#grdimage(rand(Float32,32,32))
          #end

(plus line#6 and the Project.toml)

then I get a segmentation fault (Windows)

julia> cd("C:/v"); @time using GMT
[ Info: Precompiling GMT [5752ebe1-31b9-557e-87aa-f909b540aa54]

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffb0229d596 -- splitpath at C:\WINDOWS\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\joaqu\.julia\dev\GMT\src\GMT.jl:316
splitpath at C:\WINDOWS\SYSTEM32\ntdll.dll (unknown line)
Allocations: 83598528 (Pool: 83552476; Big: 46052); GC: 77
ERROR: Failed to precompile GMT [5752ebe1-31b9-557e-87aa-f909b540aa54] to C:\Users\joaqu\.julia\compiled\v1.7\GMT\jl_D31D.tmp.
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, ignore_loaded_modules::Bool)
   @ Base .\loading.jl:1466
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base .\loading.jl:1410
 [4] _require(pkg::Base.PkgId)
   @ Base .\loading.jl:1120
 [5] require(uuidkey::Base.PkgId)
   @ Base .\loading.jl:1013
 [6] require(into::Module, mod::Symbol)
   @ Base .\loading.jl:997
 [7] top-level scope
   @ timing.jl:220

Elrod · July 29, 2022, 9:06pm

julia> using TriangularSolve, LinearAlgebra, MKL;

julia> BLAS.set_num_threads(1)

julia> BLAS.get_config().loaded_libs
1-element Vector{LinearAlgebra.BLAS.LBTLibraryInfo}:
 LBTLibraryInfo(libmkl_rt.so, ilp64)

julia> N = 100;

julia> A = rand(N,N); B = rand(N,N); C = similar(A);

julia> @time TriangularSolve.rdiv!(C, A, UpperTriangular(B));
  2.799829 seconds (1.21 M allocations: 56.681 MiB, 98.19% compilation time: 97% of which was recompilation)

98.19% compile time, of which 97% was recompilation.

Checking @snoopr using TriangularSolve, there are a lot of invalidations.
Many of them still involve Static.jl.

My own focus will be on continuing to replace LoopVectorization in a way that allows us to avoid needing to add a ton of types that trigger further invalidations.

Although, many of these are easily fixable and thus could be worth spending time on, e.g. :

 inserting ifelse(u::Bool, v1::VectorizationBase.Double, v2::VectorizationBase.Double)
     @ VectorizationBase ~/.julia/packages/VectorizationBase/nsLCg/src/special/double.jl:108 invalidated:
   backedges: 1: superseding ifelse(condition::Bool, x, y)
     @ Base essentials.jl:565 with MethodInstance for ifelse(::Bool, ::Any, ::Any) (4 children)

I should definitely stop using CPUSummary.num_threads() (and use Threads.nthreads() instead).

ChrisRackauckas · July 29, 2022, 10:16pm

Ouch, could we please not specialize on that one in Base?

Oscar_Smith · July 29, 2022, 11:24pm

that’s the main method of ifelse. what else would it be?

Zach_Christensen · July 29, 2022, 11:38pm

There are some cases where inference fails to determine that a Bool is passed to ifelse.

tim.holy · July 30, 2022, 1:11pm

Dunno, is that reproducible? The segfault is in a windows system library, seems very unlikely to be due to this package.

mkitti · July 30, 2022, 1:48pm

This code looks like it needs to go in __init__. Right now it’s at the top level and is probably being executed before __init__.

tim.holy · August 1, 2022, 3:14pm

That’s a perceptive comment. I hadn’t thought about putting the precompiles in the __init__ but it seems like it should work. In such cases @precompile_setup will have no impact but @precompile_all_calls should work as intended. (It’s a bit subtle because __init__ will be compiled before it runs, but since it’s a method in your package all the backedges should be fine for anything that’s not runtime-dispatched, and once the snooping is on it should pick up the runtime-dispatched.)

marius311 · August 3, 2022, 7:38pm

I’ve been messing around with this and getting some pretty awesome results However, one place it doesn’t seem to do much is for precompiling Zygote gradients. It doesn’t seem to improve TTFG (time to first gradient) by much at all. For some random thing I was just trying e.g. it goes from 13s → 11s, whereas for some of my other non-gradient code I’m getting like 15s → 2s. Is this a known limitation or is there anything to do about this? (This is with 1.8 btw)

ChrisRackauckas · August 3, 2022, 10:44pm

This is most likely because Zygote gradients use so many closures, and closures are a different type per session so you cannot precompile them. If those closures were turned into callable structs everywhere in ChainRules.jl, then we’d probably be able to precompile a lot more (@chriselrod we should add this as another reason in SciMLStyle to not use closures ). I don’t know how the ChainRules.jl devs would feel about that kind of style change though: it would be drastic, but it could also improve error messages.

ChrisRackauckas · August 3, 2022, 11:28pm

Brought up here: Remove closures for callable types · Issue #657 · JuliaDiff/ChainRules.jl · GitHub We’ll see if we go through with it.

Zach_Christensen · August 3, 2022, 11:35pm

I’ve found a few cases where inference failed on v1.6 and I had to ntuple(i-> foo(x, i), Val{N}()) instead of ntuple(Base.Fix1(foo, x), Val{N}()) to overcome this.

Topic		Replies	Views
Understanding precompilation and its limitations (reducing latency) Internals & Design precompilation	13	2838	December 31, 2020
Re-using compiled code General Usage compilation , precompilation , latency	23	1572	January 30, 2023
Julia precompilation limits or are there really any? General Usage precompilation , snoopcompile , jit	20	3590	June 20, 2022
Mystery of invalidations in V>1.10. Even in precompiled code General Usage regression , method-invalidation , ttfp , ttfx	10	492	September 7, 2025
Struggling to understand how to use PrecompileTools Performance question	25	584	April 26, 2025

ANN: New package SnoopPrecompile

Related topics