Profiling compilation/inference

mhauru · August 7, 2020, 12:57pm

I have a package for which running the test script results in around 20 minutes of compilation. Presumably somewhere in my package there’s something that’s giving inference a very hard time. I’ve done some work trying to make things as type stable as possible, and also tried adding annotations to guide the compiler, but it’s only gotten me this far. Moreover, making everything type stable doesn’t seem like the proper approach, given that a) it’s often hard, b) in many cases it doesn’t matter for runtime performance. The low-level bits that do matter for runtime performance I have already stabilized.

Besides, “never optimise without profiling” should also apply to compilation and inference. The question is, how do I profile, to find the bottlenecks?

I’ve tried using SnoopCompile, but it only gets me part of the way there. Something like @snoopi include("test/runtests.jl") tells me which of the really high-level methods take long to do inference for. The problem isn’t the body of these methods though, but rather the calls within them, and @snoopi doesn’t show inference times for those, since they are not top level. (By the way, can someone explain to me what exactly counts as “top level” in the sense of Snooping on inference: @snoopi · SnoopCompile ?). I’ve gotten some mileage out of manually running @snoopi on each call in some problematic function, and recursing on that, but it gets pretty tedious.

Does anyone have advice on tools to use or practices to follow to figure out where exactly inference is spending its 20 minutes?

Some related threads:

tim.holy · August 7, 2020, 4:20pm

Profiling inference is basically like other forms of profiling, with the exception that of course you can only do it in a fresh session. Let’s say that @snoopi identifies a particular call that you’re interested in; then restart Julia and do @profile f(args...) and then check the result. Because inference is recursive, this doesn’t always give you everything you might want, but if it turns out the bottleneck is in a very specific step then it will jump out at you.

What “top level” means is basically the call from which inference was entered. If you have a call that will require runtime method dispatch, that will return to top level.

mhauru · August 10, 2020, 12:33pm

Sorry, but I don’t think I quite understand. I ran @profile on a small script for which inference takes a long time. I got these huge stacks of calls of functions like typeinf and typeinf_ext, but I don’t see how this helps me figure out which parts of my code are causing inference to take so long. The only bits that I found that relate to my code, were things like

16933 (86.22%) samples spent in typeinf_ext
0 (ex.), 15173 (89.61%) (incl.) when called from top-level scope line 143
0 (ex.), 840 (4.96%) (incl.) when called from f line 112
0 (ex.), 114 (0.67%) (incl.) when called from f line 116
0 (ex.), 95 (0.56%) (incl.) when called from #random_MERA#21 line 251
.
.
.

f is the main function of my script where everything happens, and line 143 is where f is called. Apart from learning that 5% of typeinf calls came from a specific line of f, I don’t see anything useful here. All the usual granularity of profiling, being able to look at call stacks and flamegraphs to figure out where the bottlenecks are, is missing.

tim.holy · August 10, 2020, 12:49pm

I thought your interest was in finding out where in inference the bottleneck was. If you’re writing valid Julia code, then ideally inference doesn’t take this long, so the real fix should be in Julia itself. Hence the “guts” of where it’s spending its time in inference is relevant to Julia developers (and should be reported as an issue), if not to you personally.

If you need info on what’s currently being inferred, I don’t know of external tools that will help. If you can build Julia from source, you could modify typeinf(interp::AbstractInterpreter, frame::InferenceState) so that it has a print_callstack(frame) at the beginning. It’s possible you might have to add this to more functions, like typeinf_local, typeinf_edge, etc.

mhauru · August 10, 2020, 1:20pm

Ah, I see. I assumed that the fault was in me writing bad code, but it is a flattering proposal that I could instead blame Julia. : ) I could try to see if I can boil the problem down to something simpler; Filing an issue saying “the test script for my package takes ages to compile, please help” probably isn’t the most constructive thing.

Thanks, this in itself is helpful, at least I’m not simply unaware of some basic tool that I should use. I may try the insertion of print_callstacks that you proposed.

juthohaegeman · August 21, 2020, 9:52pm

Hi @tim.holy; thanks for your help with this. I have also been analyzing this a bit further, without much success. The package in question is TensorKit.jl, and packages building on top of it.

Now TensorKit.jl is a package to deal with tensors (with symmetry structure), and is designed to be efficient, both for small, medium and large tensors. So everything is concretely typed and all functions are made type stable, using tuple manipulations and lately also constant propagation. In fact, to test this, I wrote an improved @constinferred macro (source code) that tests for constant propagation of literal constants, as well as interpolated values, and that — unlike Test.@inferred — also shows up as an actual test that can pass or fail. If people are interested, I can make this into a package or submit this as a PR against Julia/stdlib .

So the whole design seems to work well with respect to runtime performance. There is a basic type (TensorMap{...}) that is parameterized by the type of vector space/symmetry structure, the number of indices (in fact there are two different types of indices), the storage type of the underlying data (which includes the element type), and some more parameters that have to do with internal structure but are fixed by the previous. There are several methods which branch into different cases, then calling deeper functions, depending on the values of these type parameters, so that irrelevant code for a specific case is compiled away and efficient specialized code is obtained. Some of the high-level functions can take about 5 seconds compilation time for a specific combination of arguments, but I think that is still reasonable, given the complexity of the underlying code. I have little to no experience with complicated C++ template code, but I assume that this is similar.

The real problem starts if now a user writes some high level routine, which is not type-stable, so that the compiler infers that objects are TensorMap instances, but looses some or all of the type parameters. For run time performance, this should have little to no effect, thanks to the function boundaries. However, when this user function is called, it is first being compiled, and for this inference is run against it. It then seems that the inference machinery is going down the whole call stack, exploring all different cases, and inference times blow up to several 10s or seconds or more, minutes or more (and even worse on older Julia versions).

Ideally, there should be some way to tell the compiler to give up as soon as some types are not concrete, or alternatively, to tell the compiler that this whole high-level user function should be run in interpreted mode. Evaluating the body of that high-level function directly at the REPL is fine, with the normal compilation time of any called method for the actual concrete run-time types of the given arguments.

Another idea might be that many functions are mutating a given destination argument, i.e. they are of the form

function f!(dest, other_args...)
    # several statements and branches depending on dest and args,
    # more calls to mutating functions on dest
    # but not a single statement of the form `dest = ...`
    return dest
end

In those cases, when inferred with respect to an abstract argument dest, my slow brain is much faster to decide on the best estimate for the return type of that function than the Julia compiler . (Unless there are some exotic metaprogramming tricks that could actually change the type of dest, but I don’t think this is possible). So maybe that pattern is sufficiently common that it could be special cased in the Julia compiler, and inference could just return the type of the argument dest without inferring everything down the call stack, at least for abstract arguments, for which this particular method instance will not ever actually be compiled and run. Indeed, while Base.return_types or code_warntype on that function for those abstract arguments takes a long time, precompile just returns false almost instantaneously, which I interpret as the fact that that method instance is not being compiled because it will never actually be ran?

tim.holy · August 21, 2020, 11:21pm

Great observations. Out of curiosity, can you tell us which Julia versions you’ve tried? The reason I ask is that the master branch (1) has some changes to heuristics that further limit the effort that the compiler goes to in cases of poor inference, and (2) as part of the effort to reduce invalidation, Julia now has more internal type assertions that reduce the propagation of type-uncertainty in cases of partial inference. As a simple example, in several places we’ve now annotated eof(io)::Bool so that you get good compilation even in cases where the precise type of io is unknown.

So maybe that pattern is sufficiently common that it could be special cased in the Julia compiler,

I’ve wondered about that same case but never done as much as you have to raise it. Have you looked to see if an issue has been filed? If not, an issue with a simple test case might speed progress on fixing the issue.

juthohaegeman · August 21, 2020, 11:42pm

Just to make sure, say I have a type stable function f_stable and a second function f_unstable which calls f_stable, but which has some type instabilities in its body, such that the concrete types of the arguments with which f_stable is called are unknown (but are probably known to be more specific then Any, just some concrete type with unknown type parameters, or some small union). Now if I add a type assertion to the return value of f_stable, even though this is not necessary for run-time performance, does this mean that when f_unstable is called, it will not try to infer the whole body of f_stable with abstract types for the arguments, but rather just immediately return the type asserted value?

Another practical question, is there a way to clear the method table, such that I can easily test these kind of things without having to restart Julia every time?

Final question, while there was a time that I would use @generated functions for everything, currently I take pride on not having to use this anywhere (aside from the kernel generating function in Strided.jl). However, it does seem that if I would make some of these methods with probablematic compilation times (for abstract types, not concrete types) be defined using @generated, then when the compiler tries to infer them with abstract types, it straight away returns Any without trying to be clever, since it doesn’t even have the information it needs to build the function body, let alone try to infer it. Does that sound correct? So maybe that is also a (workaround) solution.

juthohaegeman · August 22, 2020, 12:07am

As an example, though not one I expect anyone to try or investigate and go down the call stack of this:

julia> using TensorKit
[ Info: Precompiling TensorKit [07d1fe3e-3e46-537d-9eac-e9e13d0d4cec]

julia> function f(u, w, op)
       @tensor(
                          scaled_op[-100 -200 -300; -400 -500 -600 -1000] :=
                          w[5 6; -400] * w[9 8; -500] * w[16 15; -600] *
                          u[1 2; 6 9] * u[10 12; 8 16] *
                          op[3 4 14; 1 2 10 -1000] *
                          u[7 13; 3 4] * u[11 17; 14 12] *
                          w[-100; 5 7] * w[-200; 13 11] * w[-300; 17 15]
                         )
                         return scaled_op
                       end

julia> @time Base.return_types(f, (TensorMap, TensorMap, TensorMap))

On Julia1.5, this takes 220 seconds. On a freshly built Julia master, 48 seconds. So definitely improvement, but not sufficient. With actual concrete types,

julia> @time Base.return_types(f, (tensormaptype(Z2Space,2,2,Float64),tensormaptype(Z2Space,2,1,Float64),tensormaptype(Z2Space,3,4,Float64)))

this takes 18 seconds on julia master and 21 seconds on 1.5. So that’s actually also longer than what I said before. So actually also quite long; but in that case I am 99.9% that there is not a single type instability throughout the whole call stack.

I have now just tested with making one particularly central function be generated instead; on Julia1.5 this brings down the inference time of the above example with abstract types from 220 seconds to 7 seconds, and with concrete types from 21 seconds to 11 seconds. So maybe this is really the way to go.

tim.holy · August 23, 2020, 6:36pm

I’m pleased by the progress on master but I see why you’d want more. Interesting that the generated functions are so much faster; you’re probably aware that they are a lot slower for certain operations, but you may have found a case where the converse holds.

Back to the issue of abstract inference, I don’t know TensorKit at all, but I’m guessing that concrete objects have a dimensionality? Can you pass TensorMap{N}? Again, you may have found your solution and this experiment may not be worth your time, but if you’re still poking at this and try it, I’d be curious to know the answer.

juthohaegeman · August 24, 2020, 10:07pm

Hi @tim.holy; thanks for your question. Could you shed any light on when @generated functions are slower? I did try not to use them because I remembered something about them being problematic for static compilation, whenever that becomes readily available. But I did not think that they would ever be slower?

As for your question/suggestion: So actually the TensorMap has two integer type parameters that have to do with dimensionality: a TensorMap object represents a linear map from a tensor product of N2 vector spaces to a tensor product of N1 vector spaces. These spaces can be reshuffled between domain and codomain, but that yields a TensorMap with different N1′ and N2′.

Here are a whole bunch of timings (for which one should restart Julia every time, unless there would be a mechanism to clear the method table?)

On Julia 1.5


julia> @time Base.return_types(f, (TensorMap, TensorMap, TensorMap))
209.706046 seconds (375.58 M allocations: 28.486 GiB, 4.23% gc time)
1-element Array{Any,1}:
 TensorMap{S,N₁,N₂,G,A,F₁,F₂} where F₂ where F₁ where A<:(Union{TensorKit.SortedVectorDict{G,var"#s99"} where var"#s99"<:(DenseArray{T,2} where T), var"#s100"} where var"#s100"<:(DenseArray{T,2} where T)) where G<:Sector where N₂ where N₁ where S<:ElementarySpace

julia> S = Z2Space
ℤ₂Space

julia> @time Base.return_types(f, (TensorMap{S}, TensorMap{S}, TensorMap{S}))
 51.242267 seconds (116.99 M allocations: 7.747 GiB, 5.93% gc time)
  TensorMap{ℤ₂Space,_A,_B,ℤ₂,_C,_D,_E} where _E where _D where _C<:(Union{TensorKit.SortedVectorDict{ℤ₂,var"#s99"} where var"#s99"<:(DenseArray{T,2} where T), var"#s100"} where var"#s100"<:(DenseArray{T,2} where T)) where _B where _A

 julia> @time Base.return_types(f, (TensorMap{<:IndexSpace,2,2}, TensorMap{<:IndexSpace,2,1}, TensorMap{<:IndexSpace,3,4}))
262.097488 seconds (432.13 M allocations: 32.961 GiB, 3.84% gc time)
1-element Array{Any,1}:
 TensorMap{S,N₁,N₂,G,A,F₁,F₂} where F₂ where F₁ where A<:(Union{TensorKit.SortedVectorDict{G,var"#s99"} where var"#s99"<:(DenseArray{T,2} where T), var"#s100"} where var"#s100"<:(DenseArray{T,2} where T)) where G<:Sector where N₂ where N₁ where S<:ElementarySpace

julia> S = Z2Space
ℤ₂Space
julia> @time Base.return_types(f, (TensorMap{S,2,2}, TensorMap{S,2,1}, TensorMap{S,3,4}))
 56.762282 seconds (120.42 M allocations: 8.163 GiB, 5.60% gc time)
 TensorMap{ℤ₂Space,_A,_B,ℤ₂,_C,_D,_E} where _E where _D where _C<:(Union{TensorKit.SortedVectorDict{ℤ₂,var"#s99"} where var"#s99"<:(DenseArray{T,2} where T), var"#s100"} where var"#s100"<:(DenseArray{T,2} where T)) where _B where _A

julia> @time Base.return_types(f, (tensormaptype(S,2,2,Float64),tensormaptype(S,2,1,Float64),tensormaptype(S,3,4,Float64)))
 25.287506 seconds (107.81 M allocations: 5.063 GiB, 11.22% gc time)
  TensorMap{ℤ₂Space,3,4,ℤ₂,TensorKit.SortedVectorDict{ℤ₂,Array{Float64,2}},FusionTree{ℤ₂,3,1,2,Nothing},FusionTree{ℤ₂,4,2,3,Nothing}}

On Julia 1.6/master


julia> @time Base.return_types(f, (TensorMap, TensorMap, TensorMap))
 46.389146 seconds (99.33 M allocations: 6.786 GiB, 4.18% gc time)
1-element Vector{Any}:
 TensorMap

julia> S = Z2Space
julia> @time Base.return_types(f, (TensorMap{S}, TensorMap{S}, TensorMap{S}))
 42.310939 seconds (87.23 M allocations: 6.132 GiB, 3.06% gc time)
1-element Vector{Any}:
 TensorMap{ℤ₂Space,_A,_B,ℤ₂,_C,_D,_E} where _E where _D where _C<:(Union{TensorKit.SortedVectorDict{ℤ₂,var"#s88"} where var"#s88"<:(DenseMatrix{T} where T), var"#s89"} where var"#s89"<:(DenseMatrix{T} where T)) where _B where 
 
 julia> @time Base.return_types(f, (TensorMap{<:IndexSpace,2,2}, TensorMap{<:IndexSpace,2,1}, TensorMap{<:IndexSpace,3,4}))
 48.019551 seconds (104.80 M allocations: 7.120 GiB, 4.45% gc time)
1-element Vector{Any}:
 TensorMap

julia> @time Base.return_types(f, (TensorMap{S,2,2}, TensorMap{S,2,1}, TensorMap{S,3,4}))
 41.746267 seconds (84.82 M allocations: 5.918 GiB, 3.65% gc time)
1-element Vector{Any}:
 TensorMap{ℤ₂Space,_A,_B,ℤ₂,_C,_D,_E} where _E where _D where _C<:(Union{TensorKit.SortedVectorDict{ℤ₂,var"#s88"} where var"#s88"<:(DenseMatrix{T} where T), var"#s89"} where var"#s89"<:(DenseMatrix{T} where T)) where _B where _A
 
 julia> @time Base.return_types(f, (tensormaptype(S,2,2,Float64),tensormaptype(S,2,1,Float64),tensormaptype(S,3,4,Float64)))
 19.412151 seconds (95.24 M allocations: 4.940 GiB, 13.05% gc time)
1-element Vector{Any}:
 TensorMap{ℤ₂Space,3,4,ℤ₂,TensorKit.SortedVectorDict{ℤ₂,Matrix{Float64}},FusionTree{ℤ₂,3,1,2,Nothing},FusionTree{ℤ₂,4,2,3,Nothing}}

So the largest reduction comes from the first argument of TensorMap, which is the type of vector space. This makes sense, as there are several functions which have an explicit branch depending on static properties of that parameter (not directly on that parameter itself, which is why I did not define separate methods, but rather had explicit branches, relying on dead branch elimination by the compiler). So as soon as this parameter is known, it does indeed seem that the compiler excludes several paths, and compilation time speeds up. In a practical simulation, this first type parameter will be fixed and will typically not be lost even in sloppy code. The only reason why it can sometimes not be inferred is in typical test functions, where you also vary over different values of that parameter, and then write a test function which does not specialize on that type. This happened in Julia 1.5 if that type was an argument of your test function, but without an explicit type parameter, as by default functions were not specialized on arguments which are types.

Adding the dimensionality parameters N1 and N2 actually lead to a worse compilation time if the first parameter is not known. If it is known, they only lead to a modest improvement in compilation time. The fully concrete type (which for convenience can be constructed using a call to tensormaptype) furthermore includes information about the storage type and thus also the eltype of the tensor data and does of course lead to the fastest compilation time.

Topic		Replies	Views
Tools to Analyze Long Julia Compilation Times New to Julia question , compilation	4	583	February 1, 2024
New tools for reducing compiler latency Package Announcements	7	3997	January 28, 2021
Measuring time of type inference Internals & Design	8	869	March 25, 2019
Understanding SnoopCompile timings Profiling compilation , snoopcompile	2	169	July 20, 2024
Significant time spent in type inference in profiler flame graphs Performance question , profiling	4	520	October 22, 2022

Profiling compilation/inference

Related topics