# Understanding precompilation and its limitations (reducing latency)

Many people have wanted to know more about how to reduce latency and the internal details of precompilation. I’ve just added a variant of the discussion below to SnoopCompile’s docs (https://timholy.github.io/SnoopCompile.jl/dev/snoopi/#Understanding-precompilation-and-its-limitations-1), but since it seemed to address an unmet need I decided to cross-post it here and invite discussion. Note that this concerns `precompile` directives in your package rather than usage of `PackageCompiler`, but it may help explain why `precompile` directives sometimes “work” and sometimes do not and provide strategies for more successful precompilation.

Suppose your package includes the following method:

``````"""
idx = index_midsum(a)

Return the index of the first item more than "halfway to the cumulative sum,"
meaning the smallest integer so that `sum(a[begin:idx]) >= sum(a)/2`.
"""
function index_midsum(a::AbstractVector)
ca = cumsum(vcat(0, a))   # cumulative sum of items in a, starting from 0
s = ca[end]               # the sum of all elements
return findfirst(x->x >= s/2, ca) - 1  # compensate for inserting 0
end
``````

Now, suppose that you’d like to reduce latency in using this method, and you know that an important use case is when `a` is a `Vector{Int}`. Therefore, you might precompile it:

``````julia> precompile(index_midsum, (Vector{Int},))
true
``````

This will cause Julia to infer this method for the given argument types. If you add such statements to your package, it potentially saves your users from having to wait for it to be inferred each time they use your package.

But if you execute these lines in the REPL, and then check how well it worked, you might see something like the following:

``````julia> using SnoopCompile

julia> tinf = @snoopi index_midsum([1,2,3,4,100])
3-element Vector{Tuple{Float64, Core.MethodInstance}}:
(0.00048613548278808594, MethodInstance for cat_similar(::Int64, ::Type, ::Tuple{Int64}))
(0.010090827941894531, MethodInstance for (::Base.var"#cat_t##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vararg{Any, N} where N))
(0.016659975051879883, MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N))
``````

Even though we’d already said `precompile(index_midsum, (Vector{Int},))` in this session, somehow we needed more inference of various concatenation methods. Why does this happen? A detailed investigation (e.g., using Cthulhu or `@code_warntype`) would reveal that `vcat(0, a)` is not inferrable “all the way down,” and hence the `precompile` directive couldn’t predict everything that was going to be needed.

No problem, you say: let’s just precompile those methods too. The most expensive is the last one. You might not know where `__cat` is defined, but you can find out with

``````julia> mi = tinf[end][2]    # get the MethodInstance
MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N)

julia> mi.def               # get the Method
__cat(A, shape::Tuple{Vararg{Int64, M}}, catdims, X...) where M in Base at abstractarray.jl:1599

julia> mi.def.module        # which module was this method defined in?
Base
``````

Armed with this knowledge, let’s start a fresh session (so that nothing is precompiled yet), and in addition to defining `index_midsum` and precompiling it, we add

``````julia> precompile(Base.__cat, (Vector{Int64}, Tuple{Int64}, Tuple{Bool}, Int, Vararg{Any, N} where N))
true
``````

Now if you try that `tinf = @snoopi index_midsum([1,2,3,4,100])` line, you’ll see that the `__cat` call is omitted, suggesting success.

However, if you put all this into your package with such `precompile` in it and then check with `@snoopi` again, you may be in for a rude surprise: the `__cat` precompile directive doesn’t “work.” That turns out to be because your package doesn’t “own” that `__cat` method—the module is `Base` rather than `YourPackage`—and therefore Julia doesn’t know where to store its precompiled form. (Successfully precompiled code is cached in the `*.ji` files in your `~/.julia/compiled` directory.)

How to fix this? Fundamentally, the problem is that `vcat` call: if we can write it in a way so that inference succeeds, then all these problems go away. It turns out that `vcat` is fully inferrable if all the arguments have the same type, so just changing `vcat(0, a)` to `vcat([zero(eltype(a))], a)` fixes the problem. (Alternatively, you could make a copy and then use `pushfirst!`.) In a fresh Julia session:

``````function index_midsum(a::AbstractVector)
ca = cumsum(vcat([zero(eltype(a))], a))   # cumulative sum of items in a, starting from 0
s = ca[end]               # the sum of all elements
return findfirst(x->x >= s/2, ca) - 1  # compensate for inserting 0
end

julia> precompile(index_midsum, (Vector{Int},))
true

julia> using SnoopCompile

julia> tinf = @snoopi index_midsum([1,2,3,4,100])
Tuple{Float64, Core.MethodInstance}[]
``````

Tada! No additional inference was needed, ensuring that your users will not suffer any latency due to type-inference of this particular method/argument combination.

In other cases, manual inspection of the results from `@snoopi` may lead you in a different direction: you may discover that a huge number of specializations are being created for a method that doesn’t need them. Typical examples are methods that take types or functions as inputs: for example, there is no reason to recompile `methods(f)` for each separate `f`. In such cases, by far your best option is to add `@nospecialize` annotations to one or more of the arguments of that method. Such changes can have dramatic impact on the latency of your package.

The ability to make interventions like these–which can both reduce latency and improve runtime speed–is a major reason to consider `@snoopi` primarily as an analysis tool rather than just a utility to blindly generate lists of precompile directives.

EDIT: if you’re working to improve your package’s precompiles, I encourage you to do this work with Julia nightly (or recent `master` source build). Precompilation should work better if you’re not invalidating methods you depend on.

39 Likes

Thank you for this nice explanation. I still have a couple of questions:

• How do you identify functions that need to be looked at in the first place? I have tried snoopcompile a while ago and didn’t find it so easy to navigate the long lists of methods to find critical pieces. Ideally I’d want a sorted list of functions with their respective latency contributions so I can work from the top down with maximum efficiency.
• You said in the example that fundamentally, vcat was the problem. How did you determine that? None of the methods in the list is vcat.

I think in this case it’s simply because some `cat` methods were invalidated and the only place in the function in question that’s dealing with `cat` was the call to `vcat`.

That’s exactly what `@snoopi` gives you: a list of `(time, methodinstance)` tuples, sorted by `time`. You should start at the end of the list so that you’re working on the most costly inference cases. I typically use `@snoopi tmin=0.01` to ignore any inference cases that take less than 10ms (you can of course choose whatever threshold you want).

vcat was the problem. How did you determine that?

The `cat` in the names was a hint. But if you don’t know, then just do this:

``````using Cthulhu
@descend index_midsum([1,2,3,4,100])
``````

If you don’t know Cthulhu (everyone should know @vchuravy’s Cthulhu, it’s one of my all-time favorite packages ), hit `o` to turn off optimization and `w` to turn on red “warn” coloring. Then start browsing the calls until you find type instability, or in this case the calls that showed up in the `@snoopi` results. I’ve posted a video on fixing invalidations, but starting just moments after 7:00 it’s basically a Cthulhu tutorial.

11 Likes

Ok yes, the list has the times for each individual call. However, the culprit is not one of those, but a different function above them. So, it might make more sense to sum the times of each component to see if some parent method is the actual problem? Like a flame graph but not for compute time, but compiler latency?

By the way, it’s also worth emphasizing that it’s much easier to figure these things out if you `@snoopi` on single calls rather than, say, your package’s entire test suite. What I typically do is use it on `runtests.jl` to get a sense for the worse offender(s) but then restart my session and tackle them one-by-one. You end up with much shorter lists specifically needed to implement the call you’ve executed.

the culprit is not one of those, but a different function above them

Perhaps you’ll like this better, then:

``````ulia> function index_midsum(a::AbstractVector)
ca = cumsum(vcat(0, a))   # cumulative sum of items in a, starting from 0
s = ca[end]               # the sum of all elements
return findfirst(x->x > s/2, ca)
end
index_midsum (generic function with 1 method)

julia> using SnoopCompile

julia> tinf = @snoopi index_midsum([1,2,4,100])
4-element Vector{Tuple{Float64, Core.MethodInstance}}:
(0.0005469322204589844, MethodInstance for cat_similar(::Int64, ::Type, ::Tuple{Int64}))
(0.010178089141845703, MethodInstance for (::Base.var"#cat_t##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vararg{Any, N} where N))
(0.045697927474975586, MethodInstance for index_midsum(::Vector{Int64}))
(0.06638789176940918, MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N))

julia> mi = tinf[end][2]
MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N)

julia> using Cthulhu

julia> ascend(mi)
Choose a call for analysis (q to quit):
>   __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N)
_cat_t(::Val{1}, ::Type{Int64}, ::Int64, ::Vector{Int64})
#cat_t#121(::Val{1}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vector{Int64})
(::Base.var"#cat_t##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vector{Int64})
_cat(::Val{1}, ::Int64, ::Vector{Int64})
#cat#122(::Val{1}, ::typeof(cat), ::Int64, ::Vector{Int64})
(::Base.var"#cat##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(cat), ::Int64, ::Vector{Int64})
vcat(::Int64, ::Vector{Int64})
index_midsum(::Vector{Int64})
``````

One problem, though, is that a call made by runtime dispatch won’t have backedges, so the call chain will terminate.

So, it might make more sense to sum the times of each component to see if some parent method is the actual problem? Like a flame graph but not for compute time, but compiler latency?

Hah! Thanks to @NHDaly, we’re about to have that, perhaps even in the next day or so: https://github.com/timholy/SnoopCompile.jl/pull/139, which is based on https://github.com/JuliaLang/julia/pull/37749.

1 Like

Thanks for all your efforts to helps us tackle this despairing problem (spent hours in it and not managed to shave a single fraction of a second).

It happens that I also have those annoying cats

``````precompile(grdimage, (GMT.GMTgrid{Float32,2},))

julia> G = gmt("grdmath -R-15/15/-15/15 -I0.5 X Y HYPOT DUP 2 MUL PI MUL 8 DIV COS EXCH NEG 10 DIV EXP MUL =");

julia> tinf = @snoopi tmin=0.01 grdimage(G)
6-element Array{Tuple{Float64,Core.MethodInstance},1}:
(0.01399993896484375, MethodInstance for __cat(::Array{Float64,1}, ::Tuple{Int64}, ::Tuple{Bool}, ::Array{Float64,1}, ::Vararg{Any,N} where N))
(0.01699995994567871, MethodInstance for __cat(::Array{Any,2}, ::Tuple{Int64,Int64}, ::Tuple{Bool,Bool}, ::String, ::Vararg{Any,N} where N))
(0.026000022888183594, MethodInstance for common_shade(::Dict{Symbol,Any}, ::String, ::GMT.GMTgrid{Float32,2}, ::GMT.GMTcpt, ::Nothing, ::Nothing, ::String))
(0.06800007820129395, MethodInstance for GMTJL_Set_Object(::Ptr{Nothing}, ::GMT.GMT_RESOURCE, ::GMT.GMTcpt, ::Int64))
(0.09299993515014648, MethodInstance for common_get_R_cpt(::Dict{Symbol,Any}, ::String, ::String, ::String, ::Int64, ::GMT.GMTgrid{Float32,2}, ::Nothing, ::Nothing, ::String))
(0.39100003242492676, MethodInstance for finish_PS_module(::Dict{Symbol,Any}, ::String, ::String, ::Bool, ::Bool, ::Bool, ::GMT.GMTgrid{Float32,2}, ::Vararg{Any,N} where N))
``````

but I don’t have the slightest idea where those cats are coming from. Neither the grdimage program nor the utilities file call `hcat` or `vcat` directly (and don’t remember to make `[mat1 mat2]` constructions.

Also, when tried to use `Cthulhu`, it errored with

``````julia> ascend(mi)
ERROR: RadioMenu must have at least two options
Stacktrace:
[1] error(::String) at .\error.jl:33
[4] ascend(::Core.MethodInstance) at C:\Users\joaqu\.julia\packages\Cthulhu\mQqk0\src\Cthulhu.jl:336
[5] top-level scope at REPL[6]:1
``````

Sorry it’s proving frustrating. I get

You don’t seem to have GMT installed and I don’t currently install it automatically,

and installing it looks like more work than I’m willing to do, so I’m afraid I’ll have to limit myself to generic remarks.

First, the `RadioMenu must have at least two options` suggests you might be using an old version of Julia? You should do this analysis on nightly/`master`—that’s the future, and that’s where all the groundwork on eliminating invalidations (which allows precompilation to work better) has been done. That said, I suspect you’ll find that `mi.backedges` is undefined, indicating that it was called by runtime dispatch, and `ascend` won’t give you much useful information. If you want to find the caller(s), there are at least four options:

That said, it looks like you have bigger fish to fry than `__cat`. Anything in the ballpark of 100ms and above, that’s a pretty juicy target to precompile. The easiest one might be `common_get_R_cpt` simply on the basis of it not being a `Varargs` method. Presumably whatever is calling it has something non-inferrable about it, and here `@descend` (or `ascend`) should be your friend. You could also directly add that as a precompile statement, unless it’s a matter of it not “taking.”

1 Like

Yes, I understand that but GMT is not an easy package to … package. Trivial to install on Windows and presumably not very hard on Linux as well. Your distro probably has it

Regarding the `RadioMenu` after `] up`, although I didn’t see any message about updating `Cthulhu`, that error message disappeared.

I’m using Julia v1.5 because it’s faster that 1.6dev (14 days old). Although the dev shows less allocations, measuring effective time to FP, shows me that GMT is slower in dev. Not much but it is. But I’ll try with it.

I know the __cats are lower hanging but I just wanted to see a little bit of improvement.

I’ll try a bit more but waiting it’s also good option. `Cthulhu` it’s quite puzzling. The REPL just goes crazy.

Thanks for your efforts

You saw the link to the video above, right?

Right

1 Like

`@sprintf` seems to contribute in large to the inference issues. See first post of.
If I replace the `@sprintf` call here by a call to `sprintf` in the C lib, it shaves me 1 sec in Time to First Plot.

And curiously, the problem with `@sprintf` occurs only in v1.6. In 1.5 there is no reference to it

``````julia> using SnoopCompile

julia> precompile(plot, (Array{Float64,2},));

julia> @snoopi tmin=0.01 plot([0.0 0; 1.1 1])
5-element Array{Tuple{Float64,Core.MethodInstance},1}:
(0.010999917984008789, MethodInstance for ==(::Array{String,1}, ::Array{Any,1}))
(0.014999866485595703, MethodInstance for check_caller(::Dict{Symbol,Any}, ::String, ::String, ::String, ::String, ::Nothing, ::Bool))
(0.015000104904174805, MethodInstance for hvcat(::Tuple{Int64,Int64}, ::Float64, ::Vararg{Number,N} where N))
(0.025999784469604492, MethodInstance for add_opt(::Dict{Symbol,Any}, ::String, ::Char, ::Array{Symbol,2}, ::Symbol, ::Array{Union{Nothing, Array{Float64,2}},1}, ::NamedTuple{(:outline, :fill),Tuple{String,String}}))
(0.026999950408935547, MethodInstance for copyto!(::Array{Array{Symbol,N} where N,1}, ::Tuple{Array{Symbol,1},Array{Symbol,2},Array{Symbol,1},Array{Symbol,1},Array{Symbol,2}}))
``````
1 Like

There’s a new blog post underway that I pinged you on that should explain a lot of the fundamentals. Add a blog post on precompilation by timholy · Pull Request #1111 · JuliaLang/www.julialang.org · GitHub, with a preview at Tutorial on precompilation.

And pretty comprehensive docs on the upcoming release of SnoopCompile, including a “testbed” package for demonstrating many of the latency-reduction tricks. docs! by timholy · Pull Request #192 · timholy/SnoopCompile.jl · GitHub and a preview at SnoopCompile.jl · SnoopCompile.

9 Likes