Understanding and optimizing compiler time (just a bit)

Have you tried experimenting with the ForwardDiff.jl chunk size? Lowering it can dramatically improve the compilation latency. I believe it comes from the compiler being slow with long tuples, both Svectors and the ForwardDiff.jl Chunk are essentially long tuples.

3 Likes

SnoopCompile sounds like it could help you. There’s also possibilities to supress reporting of functions with fast inference, so it should record compile times, but I don’t exactly know how to filter for the parts that take the most time. @tim.holy might be able to tell you if this is possible

2 Likes

There are big changes afoot in the SnoopCompile world; if you’re really brave see https://github.com/JuliaLang/www.julialang.org/pull/1093 (the preview is here: https://julialang.netlify.app/previews/pr1093/blog/2020/12/package_latency/) and try to follow along. Some names have already changed, and others likely will change in the coming days, so keep a close eye on PRs (both open and merged) to SnoopCompile. Or wait a week or two until this gets officially released.

13 Likes

Much of the focus is on inference, but LoopVectorization’s test suite is about 5% inference and 75% code gen:

640.984369 seconds (379.33 M allocations: 25.899 GiB, 1.58% gc time, 99.55% compilation time)
Test Summary:        |  Pass  Total
LoopVectorization.jl | 20429  20429
1572.714641 seconds (1.61 G allocations: 107.668 GiB, 2.26% gc time, 99.16% compilation time)
ROOT                      :  0.07 %   3378911825
GC                        :  2.26 %   106785484285
LOWERING                  :  2.82 %   133262113044
PARSING                   :  0.01 %   445595991
INFERENCE                 :  5.07 %   239589938800
CODEGEN                   : 75.02 %   3544299937574
METHOD_LOOKUP_SLOW        :  0.01 %   393193036
METHOD_LOOKUP_FAST        :  0.46 %   21748429238
LLVM_OPT                  : 10.95 %   517271539743
LLVM_MODULE_FINISH        :  0.11 %   5067346249
METHOD_MATCH              :  0.30 %   14070103600
TYPE_CACHE_LOOKUP         :  0.78 %   36937850996
TYPE_CACHE_INSERT         :  0.00 %   85812602
STAGED_FUNCTION           :  1.28 %   60412449084
MACRO_INVOCATION          :  0.00 %   167520658
AST_COMPRESS              :  0.37 %   17331975652
AST_UNCOMPRESS            :  0.30 %   14386285566
SYSIMG_LOAD               :  0.00 %   155666605
ADD_METHOD                :  0.01 %   677747888
LOAD_MODULE               :  0.00 %   149572157
INIT_MODULE               :  0.00 %   12006711
     Testing LoopVectorization tests passed

While 640s to run the testsuite locally isn’t bad, it takes about 10x longer on GitHub CI.

I think it’s worth building a copy of Julia with ENABLE_TIMINGS defined, to better understand where time is being spent.
I’ll try this with code heavy in ForwardDiff, StaticArrays, and DifferentialEquations to see how these compare.

11 Likes

Yes, the emphasis is definitely mostly on inference, and you’re right that for some packages inference is a minor component of the time. (Does LoopVectorization’s src/precompile.jl shave much time off? Once it’s precompiled, inference doesn’t have to run again, so the fact that it’s only 5% might be considered a triumph if it’s 15% when you disable the precompile file…)

But everything that has to get compiled must be inferred first, and our tools are much better for analyzing inference than code gen. So if you disable any precompile statements you have, you can get a pretty comprehensive understanding of “what’s happening at the level of the compiler” and from there make some decisions. “Gee, maybe this isn’t worth specializing after all…” saves both inference and codegen.

But it would definitely be good to have better tools for peering inside that codegen box.

3 Likes

Hmm:

654.318623 seconds (380.08 M allocations: 25.946 GiB, 1.55% gc time, 99.56% compilation time)
Test Summary:        |  Pass  Total
LoopVectorization.jl | 20429  20429
1610.347997 seconds (1.63 G allocations: 108.695 GiB, 2.21% gc time, 99.17% compilation time)
ROOT                      :  0.07 %   3621117210
GC                        :  2.21 %   106923027420
LOWERING                  :  2.73 %   132035152379
PARSING                   :  0.01 %   423049837
INFERENCE                 :  5.01 %   242481817151
CODEGEN                   : 75.82 %   3667567191185
METHOD_LOOKUP_SLOW        :  0.01 %   392322122
METHOD_LOOKUP_FAST        :  0.46 %   22169082841
LLVM_OPT                  : 10.72 %   518514550503
LLVM_MODULE_FINISH        :  0.10 %   5020289896
METHOD_MATCH              :  0.29 %   14233310191
TYPE_CACHE_LOOKUP         :  0.76 %   36875756678
TYPE_CACHE_INSERT         :  0.00 %   87588140
STAGED_FUNCTION           :  0.96 %   46499016301
MACRO_INVOCATION          :  0.00 %   177027321
AST_COMPRESS              :  0.35 %   17031523906
AST_UNCOMPRESS            :  0.30 %   14559510637
SYSIMG_LOAD               :  0.00 %   154678571
ADD_METHOD                :  0.01 %   672171975
LOAD_MODULE               :  0.00 %   123102907
INIT_MODULE               :  0.00 %   12363336

Inference is reported as taking an even smaller percent of the time now, somehow?
But tests took over 13 seconds longer. I don’t know what to make of it; perhaps that is noise.

I think it’s easy to be misled when judging the precompile file from this.
The vast majority of compilation time when running the test suite is in compiling @avx functions within that test suite, not the actual functions inside of LoopVectorization that could benefit from the file.

The first time I use @avx, it takes 5 seconds longer than it seems like it ought to. Subsequent uses are much faster. Those 5-10 seconds easily get lost in the noise of hundreds of seconds of compilation, but are a very real part of the user experience. (Also, how many of those seconds are actually in VectorizationBase?).

I’d need to take a much closer look to judge the value of the precompile file.

But at least in terms of the functions @avx is generating, code gen takes a lot longer than inference. So I would like to know more about how to possibly decrease that time (as long as it doesn’t negatively impact runtime performance!).

Still, I’m working on rewriting much of LoopVectorization (not yet for the AbstractInterpreter), and will try and pay more attention to this, being e.g. stricter about limiting methods to a single combination of concrete types to ensure single specializations and tricks like from your blog post in separating functions into pieces dealing with/avoiding objects that could have a large number of types (as many parts of the code inevitably must run into, dealing with Expr and all).

1 Like

Have you seen @aviatesk 's JET.jl https://twitter.com/kdwkshh/status/1337639010533928960 ?

It also detects type instabilities. Is there scope to share work/code?

1 Like

I have, and while I haven’t played with it, it looks wonderful. But it’s a completely different approach to “collecting the data” (JET uses https://github.com/JuliaLang/julia/pull/35845 whereas SnoopCompile uses https://github.com/JuliaLang/julia/pull/37749), and so there are going to be some differences in what they can tell you.

I’m writing the analysis code and the test suite now, and the timely reminder about JET prompts me to think I should try JET’s demos in both packages and compare.

7 Likes

I’m writing analysis tools right now; I think one important tool will be a scatter plot of “runtime vs compile time.” It might help direct attention to methods that just aren’t worth specializing. See https://github.com/timholy/SnoopCompile.jl/pull/169. But the whole point of a package like LoopVectorization is to specialize, and so I’m not sure it will be as useful to LV as it might to other packages.

Those 5-10 seconds easily get lost in the noise of hundreds of seconds of compilation, but are a very real part of the user experience.

Yep, this is what I’m really after. If you want to improve that user experience, better precompile files might help (especially once we can cache native code :wink:). See https://github.com/timholy/SnoopCompile.jl/pull/168. Docs will come eventually, but this is unfolding right now…

5 Likes

Is there an easy way to make Julia forget all the work it put into inference and compilation?

Restarting the process every time I want to measure things is a minor inconvenience.

1 Like

For the curious: 90% of excess inference time in my code can be reduced to an (over|ab)use of SArrays. Here is an

MWE (all latest package versions, Julia 1.6)

using StaticArrays, ForwardDiff, LinearAlgebra, BenchmarkTools

# SETUP
_dual(x, ::Val{N}) where N = ForwardDiff.Dual(x, ntuple(_ -> x, Val(N)))
K = 30 # too large
N = 6
T = Float64
s = _dual.(rand(SVector{K,T}), Val(6));
v = Vector(s);
V = rand(T, K);

# time in a fresh session --- it is mostly compilation time
@time dot(V, s)  # mixed case, the worst to infer
@time dot(V, v)  # easiest to infer, runs comparably fast

@btime dot($V, $s)
@btime dot($V, $v)

Output

julia> @time dot(V, s)  # mixed case, the worst to infer
  0.229272 seconds (515.34 k allocations: 31.705 MiB, 5.10% gc time, 99.92% compilation time)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)

julia> @time dot(V, v)  # easiest to infer, fastest to run
  0.027791 seconds (39.30 k allocations: 2.403 MiB, 97.29% compilation time)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)

julia> @btime dot($V, $s)
  40.695 ns (0 allocations: 0 bytes)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)

julia> @btime dot($V, $v)
  46.788 ns (0 allocations: 0 bytes)
Dual{Nothing}(7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407,7.860216810977407)

Background

The code is for estimating an economic model with a 3D state space using a Smolyak approximation. But most of the tests use a 2D state space and a lower-level Smolyak approximation (K=8 instead of 30–60).

Because it was a runtime improvement and the tests did not display any significant inference lag, we wrote it using SVectors, often nested 2–3 levels. Then expanding the (static) dimension and adding AD on top of it really taxes the compiler, for little tangible benefit — the actual problem is like the MWE above, but compounded by various type combinations.

Thanks!

Kudos to @timholy and others who contributed to the tooling for introspecting these issues. Learning them required about an afternoon of initial investment for me (for the first round, I am sure there is more to learn), but it really paid off. The workflow is well-documented in recent PRs to SnoopCompile and is comparable to runtime profiling. Having this available in Julia makes the language even more awesome.

14 Likes

BTW, the bottom line is that the compilation time of the original problem went from 2800 to 7 seconds after working through the code with @snoopi_deep. Yes, you are reading that right. And runtime got 3x faster as an incidental bonus, though it was not targeted.

(There is, of course, a downside: I was using the compilation time to practice fingerstyle guitar and got pretty good at some songs, now I will have to find other time for that.)

37 Likes

Dang. I think you hold the record for greatest benefit. I was happy with 2x-3x latency reductions, but this…

That’s great news, I’m very pleased the new tools have been useful! For everyone else, I’m planning to roll out the official release starting on Jan 5, and there will be a series of blog posts (which will be essentially expanded versions of documentation that will, as of tomorrow, already be up at SnoopCompile).

20 Likes

I think that happened because the original code was somewhat naively written. I am now facing the following dilemma: I want stack allocation (so, SArray) but no excessive specialization based on size (so, no SArray). Instead of continuing this discussion with a tangent, I opened a new one here:

1 Like
7 Likes

Amazing story. Since this calculations runs for weeks additional 3x improvement in runtime can be true blessing. Does this runtime payoff scale well with the size of problem?

Again, the problem was overenthusiastic use of nested SArrays. Converting the outer 1-2 into Array and similar fixed the compilation time issues (along with some other minor fixes), and now the compilation time is insignificant (< 1 min).

3 Likes

I belive both compilations time and runtime performance in this case is due problem that @ChrisRackauckas once mentioned. But this is issue in compilators technology about which I know nothing.

Possibly, I am not sure. In any case, perhaps having a variant of SArray that does not unroll would be worth exploring.

1 Like

I’d at least like a version of MArray that does not unroll.
SArray itself would be very difficult to make work without unrolling. Even Base.setindex(::NTuple{N,T}, ::T, ::Int) is unrolled.

It’s easy to reach extremely problematic levels of unrolling with nested dual numbers.

3 Likes