Understanding precompilation and its limitations (reducing latency)

tim.holy · October 13, 2020, 11:40am

Many people have wanted to know more about how to reduce latency and the internal details of precompilation. I’ve just added a variant of the discussion below to SnoopCompile’s docs (Snooping on inference: @snoopi · SnoopCompile), but since it seemed to address an unmet need I decided to cross-post it here and invite discussion. Note that this concerns precompile directives in your package rather than usage of PackageCompiler, but it may help explain why precompile directives sometimes “work” and sometimes do not and provide strategies for more successful precompilation.

Suppose your package includes the following method:

"""
    idx = index_midsum(a)

Return the index of the first item more than "halfway to the cumulative sum,"
meaning the smallest integer so that `sum(a[begin:idx]) >= sum(a)/2`.
"""
function index_midsum(a::AbstractVector)
    ca = cumsum(vcat(0, a))   # cumulative sum of items in a, starting from 0
    s = ca[end]               # the sum of all elements
    return findfirst(x->x >= s/2, ca) - 1  # compensate for inserting 0
end

Now, suppose that you’d like to reduce latency in using this method, and you know that an important use case is when a is a Vector{Int}. Therefore, you might precompile it:

julia> precompile(index_midsum, (Vector{Int},))
true

This will cause Julia to infer this method for the given argument types. If you add such statements to your package, it potentially saves your users from having to wait for it to be inferred each time they use your package.

But if you execute these lines in the REPL, and then check how well it worked, you might see something like the following:

julia> using SnoopCompile

julia> tinf = @snoopi index_midsum([1,2,3,4,100])
3-element Vector{Tuple{Float64, Core.MethodInstance}}:
 (0.00048613548278808594, MethodInstance for cat_similar(::Int64, ::Type, ::Tuple{Int64}))
 (0.010090827941894531, MethodInstance for (::Base.var"#cat_t##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vararg{Any, N} where N))
 (0.016659975051879883, MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N))

Even though we’d already said precompile(index_midsum, (Vector{Int},)) in this session, somehow we needed more inference of various concatenation methods. Why does this happen? A detailed investigation (e.g., using Cthulhu or @code_warntype) would reveal that vcat(0, a) is not inferrable “all the way down,” and hence the precompile directive couldn’t predict everything that was going to be needed.

No problem, you say: let’s just precompile those methods too. The most expensive is the last one. You might not know where __cat is defined, but you can find out with

julia> mi = tinf[end][2]    # get the MethodInstance
MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N)

julia> mi.def               # get the Method
__cat(A, shape::Tuple{Vararg{Int64, M}}, catdims, X...) where M in Base at abstractarray.jl:1599

julia> mi.def.module        # which module was this method defined in?
Base

Armed with this knowledge, let’s start a fresh session (so that nothing is precompiled yet), and in addition to defining index_midsum and precompiling it, we add

julia> precompile(Base.__cat, (Vector{Int64}, Tuple{Int64}, Tuple{Bool}, Int, Vararg{Any, N} where N))
true

Now if you try that tinf = @snoopi index_midsum([1,2,3,4,100]) line, you’ll see that the __cat call is omitted, suggesting success.

However, if you put all this into your package with such precompile in it and then check with @snoopi again, you may be in for a rude surprise: the __cat precompile directive doesn’t “work.” That turns out to be because your package doesn’t “own” that __cat method—the module is Base rather than YourPackage—and therefore Julia doesn’t know where to store its precompiled form. (Successfully precompiled code is cached in the *.ji files in your ~/.julia/compiled directory.)

How to fix this? Fundamentally, the problem is that vcat call: if we can write it in a way so that inference succeeds, then all these problems go away. It turns out that vcat is fully inferrable if all the arguments have the same type, so just changing vcat(0, a) to vcat([zero(eltype(a))], a) fixes the problem. (Alternatively, you could make a copy and then use pushfirst!.) In a fresh Julia session:

function index_midsum(a::AbstractVector)
    ca = cumsum(vcat([zero(eltype(a))], a))   # cumulative sum of items in a, starting from 0
    s = ca[end]               # the sum of all elements
    return findfirst(x->x >= s/2, ca) - 1  # compensate for inserting 0
end

julia> precompile(index_midsum, (Vector{Int},))
true

julia> using SnoopCompile

julia> tinf = @snoopi index_midsum([1,2,3,4,100])
Tuple{Float64, Core.MethodInstance}[]

Tada! No additional inference was needed, ensuring that your users will not suffer any latency due to type-inference of this particular method/argument combination.

In other cases, manual inspection of the results from @snoopi may lead you in a different direction: you may discover that a huge number of specializations are being created for a method that doesn’t need them. Typical examples are methods that take types or functions as inputs: for example, there is no reason to recompile methods(f) for each separate f. In such cases, by far your best option is to add @nospecialize annotations to one or more of the arguments of that method. Such changes can have dramatic impact on the latency of your package.

The ability to make interventions like these–which can both reduce latency and improve runtime speed–is a major reason to consider @snoopi primarily as an analysis tool rather than just a utility to blindly generate lists of precompile directives.

EDIT: if you’re working to improve your package’s precompiles, I encourage you to do this work with Julia nightly (or recent master source build). Precompilation should work better if you’re not invalidating methods you depend on.

jules · October 13, 2020, 12:14pm

Thank you for this nice explanation. I still have a couple of questions:

How do you identify functions that need to be looked at in the first place? I have tried snoopcompile a while ago and didn’t find it so easy to navigate the long lists of methods to find critical pieces. Ideally I’d want a sorted list of functions with their respective latency contributions so I can work from the top down with maximum efficiency.
You said in the example that fundamentally, vcat was the problem. How did you determine that? None of the methods in the list is vcat.

Sukera · October 13, 2020, 12:18pm

I think in this case it’s simply because some cat methods were invalidated and the only place in the function in question that’s dealing with cat was the call to vcat.

tim.holy · October 13, 2020, 12:21pm

That’s exactly what @snoopi gives you: a list of (time, methodinstance) tuples, sorted by time. You should start at the end of the list so that you’re working on the most costly inference cases. I typically use @snoopi tmin=0.01 to ignore any inference cases that take less than 10ms (you can of course choose whatever threshold you want).

vcat was the problem. How did you determine that?

The cat in the names was a hint. But if you don’t know, then just do this:

using Cthulhu
@descend index_midsum([1,2,3,4,100])

If you don’t know Cthulhu (everyone should know @vchuravy’s Cthulhu, it’s one of my all-time favorite packages ), hit o to turn off optimization and w to turn on red “warn” coloring. Then start browsing the calls until you find type instability, or in this case the calls that showed up in the @snoopi results. I’ve posted a video on fixing invalidations, but starting just moments after 7:00 it’s basically a Cthulhu tutorial.

jules · October 13, 2020, 12:27pm

Ok yes, the list has the times for each individual call. However, the culprit is not one of those, but a different function above them. So, it might make more sense to sum the times of each component to see if some parent method is the actual problem? Like a flame graph but not for compute time, but compiler latency?

tim.holy · October 13, 2020, 12:33pm

By the way, it’s also worth emphasizing that it’s much easier to figure these things out if you @snoopi on single calls rather than, say, your package’s entire test suite. What I typically do is use it on runtests.jl to get a sense for the worse offender(s) but then restart my session and tackle them one-by-one. You end up with much shorter lists specifically needed to implement the call you’ve executed.

the culprit is not one of those, but a different function above them

Perhaps you’ll like this better, then:

ulia> function index_midsum(a::AbstractVector)
           ca = cumsum(vcat(0, a))   # cumulative sum of items in a, starting from 0
           s = ca[end]               # the sum of all elements
           return findfirst(x->x > s/2, ca)
       end
index_midsum (generic function with 1 method)

julia> using SnoopCompile

julia> tinf = @snoopi index_midsum([1,2,4,100])
4-element Vector{Tuple{Float64, Core.MethodInstance}}:
 (0.0005469322204589844, MethodInstance for cat_similar(::Int64, ::Type, ::Tuple{Int64}))
 (0.010178089141845703, MethodInstance for (::Base.var"#cat_t##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vararg{Any, N} where N))
 (0.045697927474975586, MethodInstance for index_midsum(::Vector{Int64}))
 (0.06638789176940918, MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N))

julia> mi = tinf[end][2]
MethodInstance for __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N)

julia> using Cthulhu

julia> ascend(mi)
Choose a call for analysis (q to quit):
 >   __cat(::Vector{Int64}, ::Tuple{Int64}, ::Tuple{Bool}, ::Int64, ::Vararg{Any, N} where N)
       _cat_t(::Val{1}, ::Type{Int64}, ::Int64, ::Vector{Int64})
         #cat_t#121(::Val{1}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vector{Int64})
           (::Base.var"#cat_t##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(Base.cat_t), ::Type{Int64}, ::Int64, ::Vector{Int64})
             _cat(::Val{1}, ::Int64, ::Vector{Int64})
               #cat#122(::Val{1}, ::typeof(cat), ::Int64, ::Vector{Int64})
                 (::Base.var"#cat##kw")(::NamedTuple{(:dims,), Tuple{Val{1}}}, ::typeof(cat), ::Int64, ::Vector{Int64})
                   vcat(::Int64, ::Vector{Int64})
                     index_midsum(::Vector{Int64})

One problem, though, is that a call made by runtime dispatch won’t have backedges, so the call chain will terminate.

So, it might make more sense to sum the times of each component to see if some parent method is the actual problem? Like a flame graph but not for compute time, but compiler latency?

Hah! Thanks to @NHDaly, we’re about to have that, perhaps even in the next day or so: Expose changes to Julia via `@snoopi_deep` to report timing profile breakdown of _all MethodInstances_ that are inferred during type inference by NHDaly · Pull Request #139 · timholy/SnoopCompile.jl · GitHub, which is based on https://github.com/JuliaLang/julia/pull/37749.

joa-quim · October 13, 2020, 9:44pm

Thanks for all your efforts to helps us tackle this despairing problem (spent hours in it and not managed to shave a single fraction of a second).

It happens that I also have those annoying cats

precompile(grdimage, (GMT.GMTgrid{Float32,2},))

julia> G = gmt("grdmath -R-15/15/-15/15 -I0.5 X Y HYPOT DUP 2 MUL PI MUL 8 DIV COS EXCH NEG 10 DIV EXP MUL =");

julia> tinf = @snoopi tmin=0.01 grdimage(G)
6-element Array{Tuple{Float64,Core.MethodInstance},1}:
 (0.01399993896484375, MethodInstance for __cat(::Array{Float64,1}, ::Tuple{Int64}, ::Tuple{Bool}, ::Array{Float64,1}, ::Vararg{Any,N} where N))
 (0.01699995994567871, MethodInstance for __cat(::Array{Any,2}, ::Tuple{Int64,Int64}, ::Tuple{Bool,Bool}, ::String, ::Vararg{Any,N} where N))
 (0.026000022888183594, MethodInstance for common_shade(::Dict{Symbol,Any}, ::String, ::GMT.GMTgrid{Float32,2}, ::GMT.GMTcpt, ::Nothing, ::Nothing, ::String))
 (0.06800007820129395, MethodInstance for GMTJL_Set_Object(::Ptr{Nothing}, ::GMT.GMT_RESOURCE, ::GMT.GMTcpt, ::Int64))
 (0.09299993515014648, MethodInstance for common_get_R_cpt(::Dict{Symbol,Any}, ::String, ::String, ::String, ::Int64, ::GMT.GMTgrid{Float32,2}, ::Nothing, ::Nothing, ::String))
 (0.39100003242492676, MethodInstance for finish_PS_module(::Dict{Symbol,Any}, ::String, ::String, ::Bool, ::Bool, ::Bool, ::GMT.GMTgrid{Float32,2}, ::Vararg{Any,N} where N))

but I don’t have the slightest idea where those cats are coming from. Neither the grdimage program nor the utilities file call hcat or vcat directly (and don’t remember to make [mat1 mat2] constructions.

Also, when tried to use Cthulhu, it errored with

julia> ascend(mi)
ERROR: RadioMenu must have at least two options
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] #RadioMenu#3 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\REPL\src\TerminalMenus\RadioMenu.jl:44 [inlined]
 [3] RadioMenu at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\REPL\src\TerminalMenus\RadioMenu.jl:44 [inlined]
 [4] ascend(::Core.MethodInstance) at C:\Users\joaqu\.julia\packages\Cthulhu\mQqk0\src\Cthulhu.jl:336
 [5] top-level scope at REPL[6]:1

tim.holy · October 13, 2020, 10:15pm

Sorry it’s proving frustrating. I get

You don’t seem to have GMT installed and I don’t currently install it automatically,

and installing it looks like more work than I’m willing to do, so I’m afraid I’ll have to limit myself to generic remarks.

First, the RadioMenu must have at least two options suggests you might be using an old version of Julia? You should do this analysis on nightly/master—that’s the future, and that’s where all the groundwork on eliminating invalidations (which allows precompilation to work better) has been done. That said, I suspect you’ll find that mi.backedges is undefined, indicating that it was called by runtime dispatch, and ascend won’t give you much useful information. If you want to find the caller(s), there are at least four options:

(maybe) There is a chance that https://github.com/timholy/SnoopCompile.jl/pull/139 will make this much easier, so just sitting tight for a few days might not be a bad option. (There’s also a chance that it won’t record the call chain either, since it “breaks” due to inference failures.)
(maybe) Home · MethodAnalysis.jl
(maybe) perhaps you can find it from the top via @descend
(certain) if you have a source build of Julia, edit that __cat method to add @show stacktrace(backtrace()). Revise.track(Base) will make this fairly painless.

That said, it looks like you have bigger fish to fry than __cat. Anything in the ballpark of 100ms and above, that’s a pretty juicy target to precompile. The easiest one might be common_get_R_cpt simply on the basis of it not being a Varargs method. Presumably whatever is calling it has something non-inferrable about it, and here @descend (or ascend) should be your friend. You could also directly add that as a precompile statement, unless it’s a matter of it not “taking.”

joa-quim · October 13, 2020, 10:56pm

Yes, I understand that but GMT is not an easy package to … package. Trivial to install on Windows and presumably not very hard on Linux as well. Your distro probably has it

Regarding the RadioMenu after ] up, although I didn’t see any message about updating Cthulhu, that error message disappeared.

I’m using Julia v1.5 because it’s faster that 1.6dev (14 days old). Although the dev shows less allocations, measuring effective time to FP, shows me that GMT is slower in dev. Not much but it is. But I’ll try with it.

I know the __cats are lower hanging but I just wanted to see a little bit of improvement.

I’ll try a bit more but waiting it’s also good option. Cthulhu it’s quite puzzling. The REPL just goes crazy.

Thanks for your efforts

tim.holy · October 13, 2020, 10:58pm

You saw the link to the video above, right?

joa-quim · October 13, 2020, 11:08pm

Right

joa-quim · December 28, 2020, 7:00pm

@sprintf seems to contribute in large to the inference issues. See first post of.
If I replace the @sprintf call here by a call to sprintf in the C lib, it shaves me 1 sec in Time to First Plot.

joa-quim · December 28, 2020, 7:16pm

And curiously, the problem with @sprintf occurs only in v1.6. In 1.5 there is no reference to it

julia> using SnoopCompile

julia> precompile(plot, (Array{Float64,2},));

julia> @snoopi tmin=0.01 plot([0.0 0; 1.1 1])
5-element Array{Tuple{Float64,Core.MethodInstance},1}:
 (0.010999917984008789, MethodInstance for ==(::Array{String,1}, ::Array{Any,1}))
 (0.014999866485595703, MethodInstance for check_caller(::Dict{Symbol,Any}, ::String, ::String, ::String, ::String, ::Nothing, ::Bool))
 (0.015000104904174805, MethodInstance for hvcat(::Tuple{Int64,Int64}, ::Float64, ::Vararg{Number,N} where N))
 (0.025999784469604492, MethodInstance for add_opt(::Dict{Symbol,Any}, ::String, ::Char, ::Array{Symbol,2}, ::Symbol, ::Array{Union{Nothing, Array{Float64,2}},1}, ::NamedTuple{(:outline, :fill),Tuple{String,String}}))
 (0.026999950408935547, MethodInstance for copyto!(::Array{Array{Symbol,N} where N,1}, ::Tuple{Array{Symbol,1},Array{Symbol,2},Array{Symbol,1},Array{Symbol,1},Array{Symbol,2}}))

tim.holy · December 31, 2020, 7:37pm

There’s a new blog post underway that I pinged you on that should explain a lot of the fundamentals. https://github.com/JuliaLang/www.julialang.org/pull/1111, with a preview at https://julialang.netlify.app/previews/pr1111/blog/2020/12/precompile_tutorial/.

And pretty comprehensive docs on the upcoming release of SnoopCompile, including a “testbed” package for demonstrating many of the latency-reduction tricks. https://github.com/timholy/SnoopCompile.jl/pull/192 and a preview at SnoopCompile.jl · SnoopCompile.