Potential performance regressions in Julia 1.8 for special un-precompiled type dispatches and how to fix them

ChrisRackauckas · August 26, 2022, 12:14pm

There’s two facts that collide to make this matter more now. One is:

github.com/JuliaLang/julia

Cache external CodeInstances

JuliaLang:master ← JuliaLang:teh/cache_external

opened 08:10PM - 30 Jan 22 UTC

timholy

+724 -96

Builds on #43793 and #43881. Tested each package with a task for which it was pr…ecompiled (I've devved most of these, see diffs that force precompilation [here](https://gist.github.com/timholy/3e7185761d151a9c21072e28b8804a3d); ModelingToolkit is [here](https://github.com/SciML/ModelingToolkit.jl/pull/1215#issuecomment-1026902709)). TTFT = Time To Finish Task EDIT: latest benchmarks are in https://github.com/JuliaLang/julia/pull/43990#issuecomment-1044612013 | Package | Load (nightly) | Load (PR) | Load update (PR) | TTFT (nightly) | TTFT (PR) | |:--- | ---:| ---:| ---:| ---:| ---:| | CSV | 2.6 | 2.0 | | 15.8 | 12.9 | | DataFrames | 1.3 | 1.5 | | 12.5 | 10.4 | | Plots | 4.1 | 6.7 | 4.6 | 8.9 | 9.5 | | GLMakie | 7.3 | 15.4 | 11.4 | 60.4 | 42.1 | | OrdinaryDiffEq | 6.8 | 8.8 | | 1.4 | 1.3 | | ModelingToolkit | 11.8 | 17.1 | | 36.7 | 20.3 | | Flux | 7.5 | 8.7 | 8.2 | 24.9 | 15.7 | | JuMP | 4.8 | 5.0 | | 2.5 | 1.6 | | ImageFiltering | 2.5 | 3.0 | | 1.2 | 1.2 | | LV (see below) | 2.5 | 2.9 | | 8.6 | 0.2 | "Load update" includes the changes in https://github.com/JuliaLang/julia/pull/43990#issuecomment-1026222643. (If not reported, they seemed unchanged from the previous measurement.) Looks like a clear win. (It's not obvious the TTFT for Plots is a significant difference, I've never replicated that gap.) Closes #42016 Fixes #35972 and possibly the other issues listed in https://github.com/JuliaGraphics/Colors.jl/issues/496 (I haven't checked yet).

This means that if a precompile is missing in package X that is used/needed to precompile a call in package Y, it will now precompile with the ownership of package Y. This has 2 effects: one is that more precompilation will happen, two is that if package Z also needs the missing precompile, then package Y and package Z will precompile their own versions of the call from package X.

The more precompilation will increase load times but normally decrease first solve time, if types tend to match etc. But, it will increase load times more if a call is precompiled multiple times. The solution then is to try and precompile “what we know is needed” in package X, and use the system of external precompilation as sparingly as possible. It’s required to make things work (for example, Base misses precompilation of Vector(::Uninitiaialized,::Tuple) so oops you might need that), but don’t overrely on it.

The next is how SnoopPrecompile changes the game:

The main fact is that uninferred calls can now do precompilation effectively. Go back and re-read this issue in full:

github.com/SciML/DifferentialEquations.jl

22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times!

opened 11:49PM - 12 Aug 21 UTC

closed 11:14AM - 18 Sep 23 UTC

ChrisRackauckas

# Take awhile to precompile with style all of the DiffEq stack, no denial Our… goal is to get the entire DiffEq stack compile times down. From now on we should treat compile time issues and compile time regressions just like we treat performance issues, 😢 😿 😭 and then fix them. For a very long time we did not do this because, unlike runtime performance, we did not have a good way to diagnose the causes, so it was 🤷 whatever we got is what we got, move on. But now, thanks to @timholy, we have the right tools and have learned how to handle compile time issues in a very deep way so the game is on. Our goal is to get compile times of all standard workflows to at least 0.1 seconds. That is quick enough that you wouldn't care all that much in interactive usage, but still not an unreasonable goal given the benchmarks. This issue is how to get there and how the community is can help. This will cover: - What have we done so far (so you can learn from our experience) - How much has it mattered and to what use cases - How can a user give helpful compile time issues - What are some helpful tricks and knowledge to share? - What are some of the next steps Let's dig in. ## What were our first strides? We have already made a great leap forward in the last week+ since the JuliaCon hackathon. The very long tl;dr i in https://github.com/SciML/DiffEqBase.jl/pull/698 . However, that doesn't capture the full scope of what was actually done: - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1460 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1467 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1468 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1469 - https://github.com/SciML/DiffEqBase.jl/pull/688 - https://github.com/SciML/DiffEqBase.jl/pull/696 - https://github.com/SciML/DiffEqBase.jl/pull/697 - https://github.com/SciML/DiffEqBase.jl/pull/698 - https://github.com/SciML/SciMLBase.jl/pull/95 - https://github.com/JuliaDiff/SparseDiffTools.jl/pull/147 - https://github.com/JuliaDiff/SparseDiffTools.jl/pull/149 - https://github.com/YingboMa/RecursiveFactorization.jl/pull/29 - https://github.com/YingboMa/RecursiveFactorization.jl/pull/30 - https://github.com/JuliaSIMD/TriangularSolve.jl/pull/8 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1470 (about to merge after a few more fixes) with some bonus PRs and issues like: - https://github.com/JuliaGraphs/LightGraphs.jl/pull/1581 - https://github.com/JuliaDiff/DiffRules.jl/pull/64 - https://github.com/JuliaDebug/Cthulhu.jl/issues/184 - https://github.com/JuliaDebug/Cthulhu.jl/pull/185 - https://github.com/JuliaLang/julia/issues/41750 - https://github.com/JuliaLang/julia/pull/41813 ## Show me some results The net result is something like this. On non-stiff ODEs, compile times dropped from about 5 seconds to sub 1 second, and on stiff ODEs compile times dropped from about 22 seconds to 2.5 seconds. The tests are things like: https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465 ```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Tsit5() tinf = @snoopi_deep solve(prob,alg) itrigs = inference_triggers(tinf) itrig = itrigs[13] ascend(itrig) @time solve(prob,alg) ``` ```julia v5.60.2 InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children Before InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children Without `@turbo` InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children With `@turbo` InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` everywhere InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children # Today, a week after that PR InferenceTimingNode: 0.634172/0.875295 on Core.Compiler.Timings.ROOT() with 1 direct children 🎉 (it automatically does the emoji because the computer is happy too) ``` You read this as, it used to take 1.25 seconds for inference and 4.88 seconds for compilation in full, but now it's 0.63 and 0.88. and https://github.com/SciML/DiffEqBase.jl/pull/698 ```julia function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) using OrdinaryDiffEq, SnoopCompile prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) ``` ```julia After basic precompilation (most of the above PRs): InferenceTimingNode: 1.460777/16.030597 on Core.Compiler.Timings.ROOT() with 46 direct children After fixing precompilation of the LU-factorization tools: InferenceTimingNode: 1.077774/2.868269 on Core.Compiler.Timings.ROOT() with 11 direct children ``` So that's the good news. The bad news. ## The precompilation results do not always generalize Take for example https://github.com/SciML/DifferentialEquations.jl/issues/785: ```julia using DifferentialEquations, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) InferenceTimingNode: 1.535779/13.754596 on Core.Compiler.Timings.ROOT() with 7 direct children ``` "But Chris, I thought you just said that was sub 3 seconds, not 13.75 seconds compile time!". Well, that's with `using OrdinaryDiffEq` instead of `using DifferentialEquations`. And we see this when DiffEqSensitivity.jl or DiffEqFlux.jl gets involved. So compile times are "a lot better"*, and the `"`+`*` are right now necessary when saying that. We need to fix that aspect of it. ## How can I as a user help? Good question, thanks for asking! Sharing profiles is extremely helpful. Take another look at https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-895152646 . What was ran was: ```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) using ProfileView ProfileView.view(flamegraph(tinf)) ``` ![image](https://user-images.githubusercontent.com/1814174/129282082-ac51270f-5843-4bcc-a452-8aa663c458b8.png) What this was saying was that the vast majority of the compile time was because the `DEFAULT_LINSOLVE` calling RecursiveFactorization.jl was not precompiling. Since we use our own full Julia-based BLAS/LAPACK stack, that gave a full 13 seconds of compilation since it would compile RecursiveFactorization.jl, TriangularSolve.jl, etc. in sequence on each first solve call of a session. This allowed us to identify the issue can create a tizzy of PRs that finally made that get cached. If you check the DifferentialEquations.jl compile times, you'll see that part is back. Why won't it compile? Well that's a harder question, discussed in https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-896984234, with the fix in https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-897188008, but apparently gets invalidated. If that all doesn't make sense to you, that's fine! But if you can help us narrow in on what the real issues are, that will help us immensely. And of course... you can always give us a sponsor and star the repos to help 😅 . But seriously though, I hope we can start using SciML funds to start scouring our repos and get a lot of people fixing compile time issues. More on that soon... very soon... 🤐 ## What are some helpful tricks and knowledge to share? Yeah, what were our tricks? Well the big one is forcing compilation in using calls through small solves. In many cases compilation is solved by just putting a prototype solve inside of the package itself. For example, take a look at the precompile section of OrdinaryDiffEq.jl (https://github.com/SciML/OrdinaryDiffEq.jl/blob/v5.61.1/src/OrdinaryDiffEq.jl#L175-L193) ```julia let while true function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end lorenzprob = ODEProblem(lorenz,[1.0;0.0;0.0],(0.0,1.0)) solve(lorenzprob,Tsit5()) solve(lorenzprob,Vern7()) solve(lorenzprob,Vern9()) solve(lorenzprob,Rosenbrock23())(5.0) solve(lorenzprob,TRBDF2()) solve(lorenzprob,Rodas4(autodiff=false)) solve(lorenzprob,KenCarp4(autodiff=false)) solve(lorenzprob,Rodas5()) break end end ``` Lorenz equation takes nanoseconds to solve, so we take this equation and solve it a few times at `using` time, which will then trigger precompilation of as many functions hit in that call stack as Julia will allow for. For some things, the reason why they are not precompiled is simply because they are never called during `using` time, so a quick fix for many of the compile time issues is to simply add a short little statement like this to `using` time and then Julia will cache its results. For everything else, there's Mastercard, and it'll be much more costly to solve, so those will need issues and the experts, and possibly some Base compiler changes. But we should at least grab all of the low hanging fruit ASAP. **This is actually a necessary condition for getting precompilation, since if a method is never called in using then it will never precompile, so this is a first step among many.** A major part of the solution was avoiding codegen when unnecessary. If you take a look at https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465, you'll see things like: ```julia @muladd function perform_step!(integrator, cache::Tsit5Cache{<:Array}, repeat_step=false) @unpack t,dt,uprev,u,f,p = integrator uidx = eachindex(integrator.uprev) @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache.tab @unpack k1,k2,k3,k4,k5,k6,k7,utilde,tmp,atmp = cache a = dt*a21 @inbounds @simd ivdep for i in uidx tmp[i] = uprev[i]+a*k1[i] end f(k2, tmp, p, t+c1*dt) @inbounds @simd ivdep for i in uidx tmp[i] = uprev[i]+dt*(a31*k1[i]+a32*k2[i]) end ... ``` i.e. new solver dispatches specifically for `Array`, to avoid dispatches like: ```julia @muladd function perform_step!(integrator, cache::Tsit5Cache, repeat_step=false) @unpack t,dt,uprev,u,f,p = integrator @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache.tab @unpack k1,k2,k3,k4,k5,k6,k7,utilde,tmp,atmp = cache a = dt*a21 @.. tmp = uprev+a*k1 f(k2, tmp, p, t+c1*dt) @.. tmp = uprev+dt*(a31*k1+a32*k2) ... ``` The reason is that compile time profiling showcased that the major contributor was these https://github.com/YingboMa/FastBroadcast.jl code generation steps. Base Broadcast too is a major contributor to compile times. So at least on the pieces of code that 99% of users are using, we just expanded them out by hand, forced them to precompile, and that gave the 5 seconds to 1 second compile time change. I wouldn't say I am recommending you shouldn't use broadcast, but this is something useful to keep in mind. Broadcast is more generic, so you have to pay in compile time to use it. If you're willing to maintain the code, doing `if u isa Array` or making a separate dispatch can remove that factor. Map is also a major contributor. We actually knew this was the case already, since there was a map call which changed Pumas compile times on a small example from 3 seconds to over 40 (https://github.com/SciML/SciMLBase.jl/pull/45). Removing map calls factored into these changes as well (https://github.com/JuliaDiff/SparseDiffTools.jl/pull/149). Again, this is something that could/should be handled at the Base compiler level, but we wanted as much of a fix ASAP so this was a nice and easy change which was within our power to ship in a week, and you can do the same. The last thing, and the major step forward, was https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-897188008 . As it explains, using a function barrier can cause inference to not know what functions it will need in a call, which makes Julia **seem to compile a whole lot of junk**. That can probably get fixed at the Base compiler level to some extent, as the example there was spending 13 seconds compiling `(::DefaultLinSolve)(::Vector{Float64}, ::Any, ::Vector{Float64}, ::Bool)` junk which nobody could ever call, since any call to that function would specialize on the second argument (a matrix type) and so it should've been compiling `(::DefaultLinSolve)(::Vector{Float64}, ::Matrix{Float64}, ::Vector{Float64}, ::Bool)`. But since that call was already precompiled, if inference ends up good enough that it realized it would call that instead, then bingo compile times dropped from 16 seconds to sub 3. That's a nice showcase that the easiest way to fix compile time issues may be to fix inference issues in a package, because that can make Julia narrow down what methods it needs to compile more, and then it may have a better chance of hitting the pieces that you told it to precompile (via a using-time example in a precompile.jl). However, even if you have dynamism, you can still improve inference. The DiffEq problem was that ForwardDiff has a dynamically chosen `chunksize` based on the size of the input ODEs `u`. We previously would defer this calculation of the chunksize until the caches for the Jacobian were built, so then all dual number caches would have a `where N` on the chunk size from inference. But a function barrier makes the total runtime cost 100ns, so whatever that's solved right? But because all of the caches have this non-inference of the chunksize, Julia's compiler could not determine all of the `N`'s were the same, so inference would see far too many types in expressions and just give up. That's how `A::Matrix{Float64}` became `Any` in the inference passes. When Julia hits the function barrier, it will cause inference to trigger again, in which case inside the function it will now be type-stable, so again no runtime cost, but at the first compile it will compile all possible methods it may ever need to call... which is a much bigger universe than what we actually hit. We noticed this was the case because if the user manually put in a chunksize, i.e. `Rodas5(chunk_size = 3)`, then this all went away. So what we did was setup a hook so that the moment `solve` is called, it goes "have you defined the chunk size? If not, let's figure it out right away and then hit `__solve`. By doing this function barrier earlier, it at least knows that the chunk size should always match the `N` type parameter of the solver algorithm, so all of the `N`s are the same, which makes it realize there's less types and use more of the compile caches before. That one is harder than the others, but it makes the more general point that caching compilation is only good if inference is good enough. And of course, check out Tim Holy's workshop and JuliaCon video which is immensely helpful at getting started. https://www.youtube.com/watch?v=rVBgrWYKLHY https://www.youtube.com/watch?v=wXRMwJdEjX4 ## What are the next steps? Well, the next steps are to get the compile times below 0.1 seconds everywhere of course. I've already identified some issues to solve: - [x] https://github.com/SciML/DifferentialEquations.jl/issues/785 - [ ] https://github.com/SciML/DiffEqFlux.jl/pull/604 - [ ] https://github.com/SciML/DiffEqSensitivity.jl/pull/471 - [ ] How can we take the non-stiff ODE solvers even lower? There's pieces in the profile like Base.Logging that could probably get added to the Julia system image, etc. This needs a full investigation by someone who really knows their stuff. - [ ] Can we get effective CI for finding compile time regressions? Are there more things just missing some simple precompiles? Are there some major threads we're missing? Help us identify where all of this is so we can solve it in full. Compile times, here we come!

That was the old situation. The issue was, if we can get inference to happen higher, then precompilation will happen on RecursiveFactorization.jl and that will send the first solve time with implicit methods from 22 seconds to 3. Now with SnoopPrecompile, the pre-changed version probably already hits 3 (on current release, it’s now 0.5 seconds BTW).

Basically, this means a lot more precompiles. But because a lot more precompiles, doubling precompiles hurts more. And invalidating functions hurts even more. If you do @time using OrdinaryDiffEq, the real important stat is 75% of the time is recompilation. This is invalidations taking what was precompiled and throwing it away because loading a different package (Static.jl, LoopVectorization.jl) invalidates the precompiled version.

So in the end, a lot more gets precompiled so the load time is increased (because of the ownership and non-inferred help), this does have a major improvement on the first solve time, but it increases using time, which then explodes because invalidations throw away more than a majority of that precompile work.

Therefore, invalidations matter a whole lot more now. It’s time to fix as much as we can there.

Topic		Replies	Views
Shouldn't 1.8.0 be faster than Julia 1.7? Performance	30	2612	September 16, 2022
Taking TTFX seriously: Can we make common packages faster to load and use Performance ttfp	125	11889	June 20, 2022
22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times! Performance precompilation	6	5055	August 17, 2021
Package load time regressions in v1.8-beta3 Internals & Design	23	3237	June 1, 2022
Help building MWE for 1.8 latency regression Performance	11	973	November 2, 2022

Potential performance regressions in Julia 1.8 for special un-precompiled type dispatches and how to fix them

Related topics