Potential performance regressions in Julia 1.8 for special un-precompiled type dispatches and how to fix them

sloede · August 25, 2022, 8:27pm

TL;DR

With Julia 1.8.0 (as opposed to Julia 1.7.3), for OrdinaryDiffEq v6.24.0 and Trixi.jl v0.4.44, we observe that

package installation time increased by 20-50%
package loading time increased by 30-50%
compilation time increased by 7-10%
time-to-first-solution (without precompilation) increased by ~15%
runtime of our numerical simulation kernels
- with bounds checking increased by 7%
- without bounds checking decreased by 2%

Questions:

Is this a known/expected behavior?
What might be the causes of the observed performance regressions?
Is there something we can (easily) do about it?

Longer version

First of all, kudos and thanks a lot to the JuliaLang developers, maintainers, and contributors, who made Julia 1.8 happen! As one of the maintainers of the Trixi.jl numerical simulation framework, I was very excited about many incoming improvements in terms of speed and usability (Apple Silicon support ).

Naturally, we immediately wanted to test how the Trixi.jl performance behaves with Julia 1.8, and we ran some performance tests with a real-world numerical simulation setup, using the current versions of OrdinaryDiffEq (v6.24.0) and Trixi.jl (v0.4.44). To our surprise, we saw that there are number of areas where it seems like Julia 1.8 is actually slower than v1.7. Therefore, I wrote this post to ask for input on what might be the causes of the observed performance regressions, and if there is something we can do to (easily) fix them.

Just to emphasize: I do not want to critize the v1.8 release, I am just curious to find out how we can get our performance results from Julia 1.7 back, and maybe even improve upon them.

Since this is going to be a somewhat lengthy post, I will break it up into several sections.

Setup and measurements

We used the official binaries for Julia 1.7.3 and 1.8.0 on a headless Linux machine. Storage is a fast NVMe SSD. This is the output of versioninfo():

Julia 1.7.3:

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 3990X 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

Julia 1.8.0:

Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD Ryzen Threadripper 3990X 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 128 virtual cores

For each different performance metric, a separate JULIA_DEPOT_PATH was used to avoid any interference. Each measurement was repeated at least three times, and the results were averaged (with very low mean deviation). All measurements are given in seconds (wall time) and were obtained with serial execution (no threading or MPI etc.).

A collection of the scripts and manifests we used can be found here:

However, please note that this repo is not (yet?) fully self-explanatory

Package installation

Here, we measured how long it takes to use Pkg to install each package into a project.

The increase is 17% for OrdinaryDiffEq and 43% for Trixi.

Package loading

Here, we measured how long it takes to load each package one after the other (OrdinaryDiffEq first, Trixi second). This was done after a full test run, i.e, these times are without precompilation.

The increase is 47% for OrdinaryDiffEq and 33% for Trixi.

Compilation

Here, we measured how long it takes to compile the Julia code (after the packages have already been loaded, i.e., again without precompilation). Compilation time is here computed as the runtime for the first execution of our test setup (a numerical simulation) minus the time for a second execution of the identical code.

The increase in compilation time is 7-10%.

Time-to-first-solution

Here, we measured how long it takes from after Julia has been started until the first simulation is finished (including using times). Again, these times are without precompilation. The results shown are for a run with bounds checking disabled, but the result with bounds checking enabled is very similar.

The increase in the time-to-first-solution is 13-16%.

Runtime of numerical simulation kernels

Here, we show a Trixi-internal performance metric, named “performance index”. It is essentially a measure for the performance of the core algorithms by computing the time of a semidiscretization update per each degree of freedom. Thus again, lower is better/faster.

Here, the story is split: When bounds checking is enabled (using the default), the runtime increases by 7%. When bounds checking is disabled, the runtime decreases by 2%.

Conclusions

It seems like that in our use case, a fairly complex scientific software package for numerical simulation (which has already been performance tuned, e.g., in this paper or this paper), for many of the performance metrics we look at, Julia 1.8 is slower than v1.7. The good news is that the ultimately relevant metric (run time without bounds checking) has improved (if only barely)! However, reiterating my questions from above: It would be great to figure out why this is the case, and what we can do to fix it.

cc @ranocha @jlchan @gregorgassner @tim.holy

giordano · August 25, 2022, 8:53pm

The only known effect in Julia v1.8 I’m aware of is that loading time of packages may increase a bit because more code is cached, which should however result in overall slightly faster TTFX. I’ve seen this in a few small packages.

What you’re observing instead sounds similar to

giordano · August 25, 2022, 9:27pm

Can you see if https://github.com/JuliaLang/julia/pull/46366 helps with loading time? Related issue (also opened by Marius): Large number of invalidations by this package and seems to really slow down certain jll loads · Issue #77 · SciML/Static.jl · GitHub

ChrisRackauckas · August 26, 2022, 1:40am

First of all. I’m writing a whole blog post on this, lots of work to do, etc. Will fill in all details. Short stuff now. But see.

no_jit_lag

That’s what load times are like now on OrdinaryDiffEq v6.24, so the start there is a bit misleading if someone doesn’t know what they are reading. Time to first solution is down an order of magnitude and now it fully precompiles in a system image. Now, I don’t think you did that on purpose. You saw this with Trixi.jl, and your statements are true while what I just showed is true. The question is how to reconcile both, and what to do about it.

I’m not going to address the runtime stuff because the 7% is just pure Julia changes, probably inlining and the effects system.

The real thing is time to first solution and package load time. The issue is that everything now fully precompiles on “standard types” so “most people” have much lower load times. Standard types being Vector{Float64} for u0, Float64 for tspan, and NullParameters or Vector{Float64} for parameters. I don’t think anyone will disagree that this means almost all users (more than 99%) will experience a faster first solution and all. But what it does mean is that everyone gets those precompiled. Trixi.jl uses its own parameter type, and thus it bypasses this system, and hence you see the increased precompiles time and load time without the benefit.

The answer to this is multi-fold. One,

github.com/SciML/SciMLBase.jl

Generalizing the no-recompile structure to use Preferences

opened 07:12PM - 23 Aug 22 UTC

ChrisRackauckas

https://github.com/SciML/JumpProcesses.jl/blob/v9.1.1/src/problem.jl#L230-L241 …https://github.com/SciML/DiffEqBase.jl/pull/808 https://github.com/SciML/SciMLSensitivity.jl/pull/713 https://github.com/SciML/SciMLSensitivity.jl/pull/712 https://github.com/SciML/ModelingToolkit.jl/pull/1767 https://github.com/SciML/JumpProcesses.jl/commit/8a8b8cb3308210cbb022a1d056b68f7637fea574 https://github.com/SciML/SciMLBase.jl/pull/224 Captures a lot of the different things required to make `Vector{Float64}` types supported in the no-recompile mode. To generalize this, we'd need to have an array of arguments that is then checked in all of these kinds of spots where wrapping and unwrapping functions occurs. That definition could live in SciMLBase, and then preferences would then change the list.

Using the preferences system to determine what to precompile. For now if you just comment out:

github.com

SciML/OrdinaryDiffEq.jl/blob/v6.24.0/src/OrdinaryDiffEq.jl#L195-L275


      
          SnoopPrecompile.@precompile_all_calls begin
              function lorenz(du,u,p,t)
                  du[1] = 10.0(u[2]-u[1])
                  du[2] = u[1]*(28.0-u[3]) - u[2]
                  du[3] = u[1]*u[2] - (8/3)*u[3]
              end
          
              function lorenz_oop(u,p,t)
                  [10.0(u[2]-u[1]),u[1]*(28.0-u[3]) - u[2],u[1]*u[2] - (8/3)*u[3]]
              end
          
              solver_list = [
                BS3(), Tsit5(), Vern7(), Vern9(),
          
                Rosenbrock23(), Rosenbrock23(autodiff=false),
                #Rosenbrock23(chunk_size = 1), Rosenbrock23(chunk_size = Val{1}()),
          
                Rodas4(), Rodas4(autodiff=false),
                #Rodas4(chunk_size = 1), Rodas4(chunk_size = Val{1}()),

This file has been truncated. show original

The load times should go back. Please test this. If so, then a preferences system to disable that is the solution.

Two, I have been calling for help to setup upstream SnoopCompiles, so please help. For example, https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/v0.2.11/src/RecursiveFactorization.jl that should be changed to snoopprecompile etc. That will reduce the number of repeated compilations and reduce the precompile and load times overall. Every upstream package should probably get a small representative workflow snooped.

Three, see the invalidation report from this week.

github.com/SciML/DifferentialEquations.jl

22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times!

opened 11:49PM - 12 Aug 21 UTC

closed 11:14AM - 18 Sep 23 UTC

ChrisRackauckas

# Take awhile to precompile with style all of the DiffEq stack, no denial Our… goal is to get the entire DiffEq stack compile times down. From now on we should treat compile time issues and compile time regressions just like we treat performance issues, 😢 😿 😭 and then fix them. For a very long time we did not do this because, unlike runtime performance, we did not have a good way to diagnose the causes, so it was 🤷 whatever we got is what we got, move on. But now, thanks to @timholy, we have the right tools and have learned how to handle compile time issues in a very deep way so the game is on. Our goal is to get compile times of all standard workflows to at least 0.1 seconds. That is quick enough that you wouldn't care all that much in interactive usage, but still not an unreasonable goal given the benchmarks. This issue is how to get there and how the community is can help. This will cover: - What have we done so far (so you can learn from our experience) - How much has it mattered and to what use cases - How can a user give helpful compile time issues - What are some helpful tricks and knowledge to share? - What are some of the next steps Let's dig in. ## What were our first strides? We have already made a great leap forward in the last week+ since the JuliaCon hackathon. The very long tl;dr i in https://github.com/SciML/DiffEqBase.jl/pull/698 . However, that doesn't capture the full scope of what was actually done: - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1460 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1467 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1468 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1469 - https://github.com/SciML/DiffEqBase.jl/pull/688 - https://github.com/SciML/DiffEqBase.jl/pull/696 - https://github.com/SciML/DiffEqBase.jl/pull/697 - https://github.com/SciML/DiffEqBase.jl/pull/698 - https://github.com/SciML/SciMLBase.jl/pull/95 - https://github.com/JuliaDiff/SparseDiffTools.jl/pull/147 - https://github.com/JuliaDiff/SparseDiffTools.jl/pull/149 - https://github.com/YingboMa/RecursiveFactorization.jl/pull/29 - https://github.com/YingboMa/RecursiveFactorization.jl/pull/30 - https://github.com/JuliaSIMD/TriangularSolve.jl/pull/8 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1470 (about to merge after a few more fixes) with some bonus PRs and issues like: - https://github.com/JuliaGraphs/LightGraphs.jl/pull/1581 - https://github.com/JuliaDiff/DiffRules.jl/pull/64 - https://github.com/JuliaDebug/Cthulhu.jl/issues/184 - https://github.com/JuliaDebug/Cthulhu.jl/pull/185 - https://github.com/JuliaLang/julia/issues/41750 - https://github.com/JuliaLang/julia/pull/41813 ## Show me some results The net result is something like this. On non-stiff ODEs, compile times dropped from about 5 seconds to sub 1 second, and on stiff ODEs compile times dropped from about 22 seconds to 2.5 seconds. The tests are things like: https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465 ```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Tsit5() tinf = @snoopi_deep solve(prob,alg) itrigs = inference_triggers(tinf) itrig = itrigs[13] ascend(itrig) @time solve(prob,alg) ``` ```julia v5.60.2 InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children Before InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children Without `@turbo` InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children With `@turbo` InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` everywhere InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children # Today, a week after that PR InferenceTimingNode: 0.634172/0.875295 on Core.Compiler.Timings.ROOT() with 1 direct children 🎉 (it automatically does the emoji because the computer is happy too) ``` You read this as, it used to take 1.25 seconds for inference and 4.88 seconds for compilation in full, but now it's 0.63 and 0.88. and https://github.com/SciML/DiffEqBase.jl/pull/698 ```julia function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) using OrdinaryDiffEq, SnoopCompile prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) ``` ```julia After basic precompilation (most of the above PRs): InferenceTimingNode: 1.460777/16.030597 on Core.Compiler.Timings.ROOT() with 46 direct children After fixing precompilation of the LU-factorization tools: InferenceTimingNode: 1.077774/2.868269 on Core.Compiler.Timings.ROOT() with 11 direct children ``` So that's the good news. The bad news. ## The precompilation results do not always generalize Take for example https://github.com/SciML/DifferentialEquations.jl/issues/785: ```julia using DifferentialEquations, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) InferenceTimingNode: 1.535779/13.754596 on Core.Compiler.Timings.ROOT() with 7 direct children ``` "But Chris, I thought you just said that was sub 3 seconds, not 13.75 seconds compile time!". Well, that's with `using OrdinaryDiffEq` instead of `using DifferentialEquations`. And we see this when DiffEqSensitivity.jl or DiffEqFlux.jl gets involved. So compile times are "a lot better"*, and the `"`+`*` are right now necessary when saying that. We need to fix that aspect of it. ## How can I as a user help? Good question, thanks for asking! Sharing profiles is extremely helpful. Take another look at https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-895152646 . What was ran was: ```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) using ProfileView ProfileView.view(flamegraph(tinf)) ``` ![image](https://user-images.githubusercontent.com/1814174/129282082-ac51270f-5843-4bcc-a452-8aa663c458b8.png) What this was saying was that the vast majority of the compile time was because the `DEFAULT_LINSOLVE` calling RecursiveFactorization.jl was not precompiling. Since we use our own full Julia-based BLAS/LAPACK stack, that gave a full 13 seconds of compilation since it would compile RecursiveFactorization.jl, TriangularSolve.jl, etc. in sequence on each first solve call of a session. This allowed us to identify the issue can create a tizzy of PRs that finally made that get cached. If you check the DifferentialEquations.jl compile times, you'll see that part is back. Why won't it compile? Well that's a harder question, discussed in https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-896984234, with the fix in https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-897188008, but apparently gets invalidated. If that all doesn't make sense to you, that's fine! But if you can help us narrow in on what the real issues are, that will help us immensely. And of course... you can always give us a sponsor and star the repos to help 😅 . But seriously though, I hope we can start using SciML funds to start scouring our repos and get a lot of people fixing compile time issues. More on that soon... very soon... 🤐 ## What are some helpful tricks and knowledge to share? Yeah, what were our tricks? Well the big one is forcing compilation in using calls through small solves. In many cases compilation is solved by just putting a prototype solve inside of the package itself. For example, take a look at the precompile section of OrdinaryDiffEq.jl (https://github.com/SciML/OrdinaryDiffEq.jl/blob/v5.61.1/src/OrdinaryDiffEq.jl#L175-L193) ```julia let while true function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end lorenzprob = ODEProblem(lorenz,[1.0;0.0;0.0],(0.0,1.0)) solve(lorenzprob,Tsit5()) solve(lorenzprob,Vern7()) solve(lorenzprob,Vern9()) solve(lorenzprob,Rosenbrock23())(5.0) solve(lorenzprob,TRBDF2()) solve(lorenzprob,Rodas4(autodiff=false)) solve(lorenzprob,KenCarp4(autodiff=false)) solve(lorenzprob,Rodas5()) break end end ``` Lorenz equation takes nanoseconds to solve, so we take this equation and solve it a few times at `using` time, which will then trigger precompilation of as many functions hit in that call stack as Julia will allow for. For some things, the reason why they are not precompiled is simply because they are never called during `using` time, so a quick fix for many of the compile time issues is to simply add a short little statement like this to `using` time and then Julia will cache its results. For everything else, there's Mastercard, and it'll be much more costly to solve, so those will need issues and the experts, and possibly some Base compiler changes. But we should at least grab all of the low hanging fruit ASAP. **This is actually a necessary condition for getting precompilation, since if a method is never called in using then it will never precompile, so this is a first step among many.** A major part of the solution was avoiding codegen when unnecessary. If you take a look at https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465, you'll see things like: ```julia @muladd function perform_step!(integrator, cache::Tsit5Cache{<:Array}, repeat_step=false) @unpack t,dt,uprev,u,f,p = integrator uidx = eachindex(integrator.uprev) @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache.tab @unpack k1,k2,k3,k4,k5,k6,k7,utilde,tmp,atmp = cache a = dt*a21 @inbounds @simd ivdep for i in uidx tmp[i] = uprev[i]+a*k1[i] end f(k2, tmp, p, t+c1*dt) @inbounds @simd ivdep for i in uidx tmp[i] = uprev[i]+dt*(a31*k1[i]+a32*k2[i]) end ... ``` i.e. new solver dispatches specifically for `Array`, to avoid dispatches like: ```julia @muladd function perform_step!(integrator, cache::Tsit5Cache, repeat_step=false) @unpack t,dt,uprev,u,f,p = integrator @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache.tab @unpack k1,k2,k3,k4,k5,k6,k7,utilde,tmp,atmp = cache a = dt*a21 @.. tmp = uprev+a*k1 f(k2, tmp, p, t+c1*dt) @.. tmp = uprev+dt*(a31*k1+a32*k2) ... ``` The reason is that compile time profiling showcased that the major contributor was these https://github.com/YingboMa/FastBroadcast.jl code generation steps. Base Broadcast too is a major contributor to compile times. So at least on the pieces of code that 99% of users are using, we just expanded them out by hand, forced them to precompile, and that gave the 5 seconds to 1 second compile time change. I wouldn't say I am recommending you shouldn't use broadcast, but this is something useful to keep in mind. Broadcast is more generic, so you have to pay in compile time to use it. If you're willing to maintain the code, doing `if u isa Array` or making a separate dispatch can remove that factor. Map is also a major contributor. We actually knew this was the case already, since there was a map call which changed Pumas compile times on a small example from 3 seconds to over 40 (https://github.com/SciML/SciMLBase.jl/pull/45). Removing map calls factored into these changes as well (https://github.com/JuliaDiff/SparseDiffTools.jl/pull/149). Again, this is something that could/should be handled at the Base compiler level, but we wanted as much of a fix ASAP so this was a nice and easy change which was within our power to ship in a week, and you can do the same. The last thing, and the major step forward, was https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-897188008 . As it explains, using a function barrier can cause inference to not know what functions it will need in a call, which makes Julia **seem to compile a whole lot of junk**. That can probably get fixed at the Base compiler level to some extent, as the example there was spending 13 seconds compiling `(::DefaultLinSolve)(::Vector{Float64}, ::Any, ::Vector{Float64}, ::Bool)` junk which nobody could ever call, since any call to that function would specialize on the second argument (a matrix type) and so it should've been compiling `(::DefaultLinSolve)(::Vector{Float64}, ::Matrix{Float64}, ::Vector{Float64}, ::Bool)`. But since that call was already precompiled, if inference ends up good enough that it realized it would call that instead, then bingo compile times dropped from 16 seconds to sub 3. That's a nice showcase that the easiest way to fix compile time issues may be to fix inference issues in a package, because that can make Julia narrow down what methods it needs to compile more, and then it may have a better chance of hitting the pieces that you told it to precompile (via a using-time example in a precompile.jl). However, even if you have dynamism, you can still improve inference. The DiffEq problem was that ForwardDiff has a dynamically chosen `chunksize` based on the size of the input ODEs `u`. We previously would defer this calculation of the chunksize until the caches for the Jacobian were built, so then all dual number caches would have a `where N` on the chunk size from inference. But a function barrier makes the total runtime cost 100ns, so whatever that's solved right? But because all of the caches have this non-inference of the chunksize, Julia's compiler could not determine all of the `N`'s were the same, so inference would see far too many types in expressions and just give up. That's how `A::Matrix{Float64}` became `Any` in the inference passes. When Julia hits the function barrier, it will cause inference to trigger again, in which case inside the function it will now be type-stable, so again no runtime cost, but at the first compile it will compile all possible methods it may ever need to call... which is a much bigger universe than what we actually hit. We noticed this was the case because if the user manually put in a chunksize, i.e. `Rodas5(chunk_size = 3)`, then this all went away. So what we did was setup a hook so that the moment `solve` is called, it goes "have you defined the chunk size? If not, let's figure it out right away and then hit `__solve`. By doing this function barrier earlier, it at least knows that the chunk size should always match the `N` type parameter of the solver algorithm, so all of the `N`s are the same, which makes it realize there's less types and use more of the compile caches before. That one is harder than the others, but it makes the more general point that caching compilation is only good if inference is good enough. And of course, check out Tim Holy's workshop and JuliaCon video which is immensely helpful at getting started. https://www.youtube.com/watch?v=rVBgrWYKLHY https://www.youtube.com/watch?v=wXRMwJdEjX4 ## What are the next steps? Well, the next steps are to get the compile times below 0.1 seconds everywhere of course. I've already identified some issues to solve: - [x] https://github.com/SciML/DifferentialEquations.jl/issues/785 - [ ] https://github.com/SciML/DiffEqFlux.jl/pull/604 - [ ] https://github.com/SciML/DiffEqSensitivity.jl/pull/471 - [ ] How can we take the non-stiff ODE solvers even lower? There's pieces in the profile like Base.Logging that could probably get added to the Julia system image, etc. This needs a full investigation by someone who really knows their stuff. - [ ] Can we get effective CI for finding compile time regressions? Are there more things just missing some simple precompiles? Are there some major threads we're missing? Help us identify where all of this is so we can solve it in full. Compile times, here we come!

The convert overloads is already taken care of, but the Static ones are ongoing.

The big ones are:

github.com/JuliaDiff/ChainRulesCore.jl

Invalidations from ChainRulesCore Tangent overload on Tail

opened 10:25AM - 21 Aug 22 UTC

ChrisRackauckas

Found in https://github.com/SciML/DifferentialEquations.jl/issues/786 ```jul…ia julia> show(trees[end-3]) inserting tail(t::ChainRulesCore.Tangent{<:NamedTuple{<:Any, <:Tuple{}}}) in ChainRulesCore at C:\Users\accou\.julia\packages\ChainRulesCore\ctmSK\src\tangent_types\tangent.jl:110 invalidated: mt_backedges: 1: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base._cshp(::Int64, ::Tuple{Bool}, ::Tuple{Int64}, ::Any) (0 children) 2: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base._cshp(::Int64, ::Tuple{Bool}, ::Tuple{Any, Vararg{Any}}, ::Any) (0 children) 3: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base.Iterators._zip_isdone(::Tuple, ::Any) (0 children) 4: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base.Iterators._zip_iterate_some(::Tuple, ::Any, ::Tuple{Missing, Vararg{Any}}, ::Missing) (0 children) 5: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base.Iterators._zip_iterate_some(::Tuple, ::Any, ::Tuple{Any, Vararg{Any}}, ::Missing) (0 children) 6: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base.Iterators._zip_iterate_some(::Tuple, ::Any, ::Tuple{Bool, Vararg{Any}}, ::Bool) (0 children) 7: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base.Iterators._zip_iterate_some(::Tuple, ::Any, ::Tuple{Any, Vararg{Any}}, ::Bool) (0 children) 8: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for iterate(::Base.Iterators.Enumerate{Vector{VersionNumber}}, ::Any) (0 children) 9: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for VSCodeServer.JuliaInterpreter.optimize!(::Core.CodeInfo, ::Method) (0 children) 10: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for JuliaInterpreter.optimize!(::Core.CodeInfo, ::Method) (0 children) 11: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for LoweredCodeUtils.step_through_methoddef(::Any, ::JuliaInterpreter.Frame, ::Any) (0 children) 12: signature Tuple{typeof(Base.tail), Any} triggered MethodInstance for Base.tail(::NamedTuple{names}) where names (450 children) false ````

github.com/SciML/Static.jl

Remove invalidating `!` overloads

SciML:master ← SciML:invalidating_!

opened 10:41AM - 21 Aug 22 UTC

ChrisRackauckas

+15 -13

Running things downstream, life seems to go on without them just fine. Found in …https://github.com/SciML/DifferentialEquations.jl/issues/786 and https://github.com/SciML/Static.jl/issues/77, these methods contribute to a ton of recompilation. Methods that cause lots of compilation but aren't used? Bye bye. Users of Static.jl can just manually handle `!`. It's safe because it just throws an error otherwise. We can put it in an FAQ if it's that much of an issue. But this is definitely not worth causing seconds of JIT lag downstream.

Please help with the second one by writing a Cassette pass for LoopVectorization so that it can do function replacement on ! → static_! so that ! does not need to be overloaded. That would remove most of the recompilation.

Also, it would be helpful to see a representative Trixi invalidation report if you can generate one. Just do exactly as from that DifferentialEquations.jl issue, and share what the top 10 or so invalidators are.

ranocha · August 26, 2022, 5:50am

Looking at invalidations, Static.jl with !(::False) is currently the worst when using Trixi.jl. This should hopefully be fixed by Remove invalidating `!` overloads by ChrisRackauckas · Pull Request #78 · SciML/Static.jl · GitHub.

julia> using SnoopCompileCore

julia> invalidations = @snoopr begin
           using Trixi
           trixi_include(default_example())
       end

julia> trees = invalidation_trees(invalidations);

Edit: But also see the list of PRs fixing related invalidations in the post below.

ranocha · August 26, 2022, 11:47am

Okay, so I went a bit invalidation hunting this morning:

Static.jl invalidates quite a lot, see
fix invalidations in logging by ranocha · Pull Request #46481 · JuliaLang/julia · GitHub,
fix invalidations for Dicts from Static.jl by ranocha · Pull Request #46490 · JuliaLang/julia · GitHub,
fix invalidations in sort! from Static.jl by ranocha · Pull Request #46491 · JuliaLang/julia · GitHub,
fix invalidations of `isinf` from Static.jl by ranocha · Pull Request #46493 · JuliaLang/julia · GitHub,
fix invalidations in REPLCompletions from Static.jl by ranocha · Pull Request #46494 · JuliaLang/julia · GitHub,
fix invalidations from Static.jl by ranocha · Pull Request #140 · JuliaIO/Tar.jl · GitHub,
fix API invalidations from Static.jl by ranocha · Pull Request #3179 · JuliaLang/Pkg.jl · GitHub
Unrolled.jl invalidates a bunch of stuff, see Unrolled.jl invalidates quite a lot · Issue #12 · cstjean/Unrolled.jl · GitHub
HDF5.jl invalidates a bunch of stuff in the REPL: hopefully fix invalidations of REPL from HDF5.jl by ranocha · Pull Request #46486 · JuliaLang/julia · GitHub
ChainRulesCore.jl invalidates a bunch of stuff, see Invalidations from ChainRulesCore Tangent overload on Tail · Issue #576 · JuliaDiff/ChainRulesCore.jl · GitHub

FixedPointNumbers.jl invalidates quite a bit, in particular from LoopVectorization calling sum(::Vector{Any})

inserting reduce_first(::typeof(Base.add_sum), x::FixedPointNumbers.FixedPoint) in FixedPointNumbers at ~/.julia/packages/FixedPointNumbers/HAGk2/src/FixedPointNumbers.jl:295 invalidated:
  backedges: 1: superseding reduce_first(::typeof(Base.add_sum), x) in Base at reduce.jl:394 with MethodInstance for Base.reduce_first(::typeof(Base.add_sum), ::Any) (309 children)

ArrayInterface.jl invalidates parts of Tar.jl: hopefully fix invalidations from ArrayInterface.jl by ranocha · Pull Request #138 · JuliaIO/Tar.jl · GitHub, hopefully fix more invalidations by ranocha · Pull Request #139 · JuliaIO/Tar.jl · GitHub

OrderedCollections.jl invalidates quite a bit

inserting convert(::Type{OrderedCollections.OrderedDict{K, V}}, d::OrderedCollections.OrderedDict{K, V}) where {K, V} in OrderedCollections at ~/.julia/packages/OrderedCollections/PRayh/src/ordered_dict.jl:110 invalidated:
 backedges: 1: superseding convert(::Type{T}, x::AbstractDict) where T<:AbstractDict in Base at abstractdict.jl:561 with MethodInstance for convert(::Type, ::AbstractDict) (134 children)

LoopVectorization.jl invalidates some code in HDF5.jl and indexing:

inserting convert(::Type{T}, i::LoopVectorization.UpperBoundedInteger) where T<:Number in LoopVectorization at /home/hendrik/.julia/packages/LoopVectorization/e7fJe/src/reconstruct_loopset.jl:25 invalidated:
 backedges: 1: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{UInt64}, ::Integer) (3 children)
            2: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int64}, ::Integer) (17 children)
            3: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int32}, ::Integer) (99 children)
 17 mt_cache

Geometry basics also invalidates code in HDF5.jl etc.

inserting convert(::Type{IT}, x::GeometryBasics.OffsetInteger) where IT<:Integer in GeometryBasics at /home/hendrik/.julia/packages/GeometryBasics/5Sb5M/src/offsetintegers.jl:40 invalidated:
 mt_backedges: 1: signature convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 (formerly convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7) triggered MethodInstance for Colors._precompile_() (1 children)
               2: signature convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 (formerly convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7) triggered MethodInstance for parse(::Type{ColorTypes.RGB{FixedPointNumbers.N0f8}}, ::String) (1 children)
               3: signature convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 (formerly convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7) triggered MethodInstance for Colors._parse_colorant(::String) (1 children)
 backedges: 1: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int64}, ::Integer) (9 children)
            2: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int32}, ::Integer) (91 children)

ForwardDiff.jl invalidates some string code: fix type instability/invalidations from `nextind` by ranocha · Pull Request #46489 · JuliaLang/julia · GitHub

StaticArrays.jl also invalidates quite a bit, e.g.,

inserting similar(::Type{A}, shape::Union{Tuple{SOneTo, Vararg{Union{Integer, Base.OneTo, SOneTo}}}, Tuple{Union{Integer, Base.OneTo}, SOneTo, Vararg{Union{Integer, Base.OneTo, SOneTo}}}, Tuple{Union{Integer, Base.OneTo}, Union{Integer, Base.OneTo}, SOneTo, Vararg{Union{Integer, Base.OneTo, SOneTo}}}}) where A<:AbstractArray in StaticArrays at ~/.julia/packages/StaticArrays/68nRv/src/abstractarray.jl:156 invalidated:
 mt_backedges: 1: signature Tuple{typeof(similar), Type{Array{Union{Int64, Symbol}, _A}} where _A, Tuple{Union{Integer, AbstractUnitRange}}} triggered MethodInstance for similar(::Type{Array{Union{Int64, Symbol}, _A}}, ::Union{Integer, AbstractUnitRange}) where _A (0 children)
               2: signature Tuple{typeof(similar), Type{Array{Union{Int64, Symbol}, _A}} where _A, Any} triggered MethodInstance for Base._array_for(::Type{Union{Int64, Symbol}}, ::Base.HasShape, ::Any) (0 children)
               3: signature Tuple{typeof(similar), Type{Array{Any, _A}} where _A, Tuple{Union{Integer, AbstractUnitRange}}} triggered MethodInstance for similar(::Type{Array{Any, _A}}, ::Union{Integer, AbstractUnitRange}) where _A (0 children)
               4: signature Tuple{typeof(similar), Type{Array{Any, _A}} where _A, Any} triggered MethodInstance for Base._array_for(::Type{Any}, ::Base.HasShape, ::Any) (0 children)
               5: signature Tuple{typeof(similar), Type{Array{Base.PkgId, _A}} where _A, Tuple{Union{Integer, AbstractUnitRange}}} triggered MethodInstance for similar(::Type{Array{Base.PkgId, _A}}, ::Union{Integer, AbstractUnitRange}) where _A (0 children)
               6: signature Tuple{typeof(similar), Type{Array{Base.PkgId, _A}} where _A, Any} triggered MethodInstance for Base._array_for(::Type{Base.PkgId}, ::Base.HasShape, ::Any) (0 children)
               7: signature Tuple{typeof(similar), Type{Array{Union{Int64, Symbol}, _A}} where _A, Tuple{Union{Integer, AbstractUnitRange}}} triggered MethodInstance for similar(::Type{Array{Union{Int64, Symbol}, _A}}, ::Tuple{Union{Integer, Base.OneTo}}) where _A (9 children)
               8: signature Tuple{typeof(similar), Type{Array{Base.PkgId, _A}} where _A, Tuple{Union{Integer, AbstractUnitRange}}} triggered MethodInstance for similar(::Type{Array{Base.PkgId, _A}}, ::Tuple{Union{Integer, Base.OneTo}}) where _A (9 children)
               9: signature Tuple{typeof(similar), Type{Array{Any, _A}} where _A, Tuple{Union{Integer, AbstractUnitRange}}} triggered MethodInstance for similar(::Type{Array{Any, _A}}, ::Tuple{Union{Integer, Base.OneTo}}) where _A (136 children)

There are of course more invalidations, but they seem to be less severe, e.g.,
hopefully fix invalidations in API from AbstractFFTs by ranocha · Pull Request #3180 · JuliaLang/Pkg.jl · GitHub

However, I would have expected that these invalidations happen also with Julia v1.7 - or do I miss something?

ChrisRackauckas · August 26, 2022, 12:14pm

There’s two facts that collide to make this matter more now. One is:

github.com/JuliaLang/julia

Cache external CodeInstances

JuliaLang:master ← JuliaLang:teh/cache_external

opened 08:10PM - 30 Jan 22 UTC

timholy

+724 -96

Builds on #43793 and #43881. Tested each package with a task for which it was pr…ecompiled (I've devved most of these, see diffs that force precompilation [here](https://gist.github.com/timholy/3e7185761d151a9c21072e28b8804a3d); ModelingToolkit is [here](https://github.com/SciML/ModelingToolkit.jl/pull/1215#issuecomment-1026902709)). TTFT = Time To Finish Task EDIT: latest benchmarks are in https://github.com/JuliaLang/julia/pull/43990#issuecomment-1044612013 | Package | Load (nightly) | Load (PR) | Load update (PR) | TTFT (nightly) | TTFT (PR) | |:--- | ---:| ---:| ---:| ---:| ---:| | CSV | 2.6 | 2.0 | | 15.8 | 12.9 | | DataFrames | 1.3 | 1.5 | | 12.5 | 10.4 | | Plots | 4.1 | 6.7 | 4.6 | 8.9 | 9.5 | | GLMakie | 7.3 | 15.4 | 11.4 | 60.4 | 42.1 | | OrdinaryDiffEq | 6.8 | 8.8 | | 1.4 | 1.3 | | ModelingToolkit | 11.8 | 17.1 | | 36.7 | 20.3 | | Flux | 7.5 | 8.7 | 8.2 | 24.9 | 15.7 | | JuMP | 4.8 | 5.0 | | 2.5 | 1.6 | | ImageFiltering | 2.5 | 3.0 | | 1.2 | 1.2 | | LV (see below) | 2.5 | 2.9 | | 8.6 | 0.2 | "Load update" includes the changes in https://github.com/JuliaLang/julia/pull/43990#issuecomment-1026222643. (If not reported, they seemed unchanged from the previous measurement.) Looks like a clear win. (It's not obvious the TTFT for Plots is a significant difference, I've never replicated that gap.) Closes #42016 Fixes #35972 and possibly the other issues listed in https://github.com/JuliaGraphics/Colors.jl/issues/496 (I haven't checked yet).

This means that if a precompile is missing in package X that is used/needed to precompile a call in package Y, it will now precompile with the ownership of package Y. This has 2 effects: one is that more precompilation will happen, two is that if package Z also needs the missing precompile, then package Y and package Z will precompile their own versions of the call from package X.

The more precompilation will increase load times but normally decrease first solve time, if types tend to match etc. But, it will increase load times more if a call is precompiled multiple times. The solution then is to try and precompile “what we know is needed” in package X, and use the system of external precompilation as sparingly as possible. It’s required to make things work (for example, Base misses precompilation of Vector(::Uninitiaialized,::Tuple) so oops you might need that), but don’t overrely on it.

The next is how SnoopPrecompile changes the game:

The main fact is that uninferred calls can now do precompilation effectively. Go back and re-read this issue in full:

github.com/SciML/DifferentialEquations.jl

22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times!

opened 11:49PM - 12 Aug 21 UTC

closed 11:14AM - 18 Sep 23 UTC

ChrisRackauckas

# Take awhile to precompile with style all of the DiffEq stack, no denial Our… goal is to get the entire DiffEq stack compile times down. From now on we should treat compile time issues and compile time regressions just like we treat performance issues, 😢 😿 😭 and then fix them. For a very long time we did not do this because, unlike runtime performance, we did not have a good way to diagnose the causes, so it was 🤷 whatever we got is what we got, move on. But now, thanks to @timholy, we have the right tools and have learned how to handle compile time issues in a very deep way so the game is on. Our goal is to get compile times of all standard workflows to at least 0.1 seconds. That is quick enough that you wouldn't care all that much in interactive usage, but still not an unreasonable goal given the benchmarks. This issue is how to get there and how the community is can help. This will cover: - What have we done so far (so you can learn from our experience) - How much has it mattered and to what use cases - How can a user give helpful compile time issues - What are some helpful tricks and knowledge to share? - What are some of the next steps Let's dig in. ## What were our first strides? We have already made a great leap forward in the last week+ since the JuliaCon hackathon. The very long tl;dr i in https://github.com/SciML/DiffEqBase.jl/pull/698 . However, that doesn't capture the full scope of what was actually done: - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1460 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1467 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1468 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1469 - https://github.com/SciML/DiffEqBase.jl/pull/688 - https://github.com/SciML/DiffEqBase.jl/pull/696 - https://github.com/SciML/DiffEqBase.jl/pull/697 - https://github.com/SciML/DiffEqBase.jl/pull/698 - https://github.com/SciML/SciMLBase.jl/pull/95 - https://github.com/JuliaDiff/SparseDiffTools.jl/pull/147 - https://github.com/JuliaDiff/SparseDiffTools.jl/pull/149 - https://github.com/YingboMa/RecursiveFactorization.jl/pull/29 - https://github.com/YingboMa/RecursiveFactorization.jl/pull/30 - https://github.com/JuliaSIMD/TriangularSolve.jl/pull/8 - https://github.com/SciML/OrdinaryDiffEq.jl/pull/1470 (about to merge after a few more fixes) with some bonus PRs and issues like: - https://github.com/JuliaGraphs/LightGraphs.jl/pull/1581 - https://github.com/JuliaDiff/DiffRules.jl/pull/64 - https://github.com/JuliaDebug/Cthulhu.jl/issues/184 - https://github.com/JuliaDebug/Cthulhu.jl/pull/185 - https://github.com/JuliaLang/julia/issues/41750 - https://github.com/JuliaLang/julia/pull/41813 ## Show me some results The net result is something like this. On non-stiff ODEs, compile times dropped from about 5 seconds to sub 1 second, and on stiff ODEs compile times dropped from about 22 seconds to 2.5 seconds. The tests are things like: https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465 ```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Tsit5() tinf = @snoopi_deep solve(prob,alg) itrigs = inference_triggers(tinf) itrig = itrigs[13] ascend(itrig) @time solve(prob,alg) ``` ```julia v5.60.2 InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children Before InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children Without `@turbo` InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children With `@turbo` InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` everywhere InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children # Today, a week after that PR InferenceTimingNode: 0.634172/0.875295 on Core.Compiler.Timings.ROOT() with 1 direct children 🎉 (it automatically does the emoji because the computer is happy too) ``` You read this as, it used to take 1.25 seconds for inference and 4.88 seconds for compilation in full, but now it's 0.63 and 0.88. and https://github.com/SciML/DiffEqBase.jl/pull/698 ```julia function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) using OrdinaryDiffEq, SnoopCompile prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) ``` ```julia After basic precompilation (most of the above PRs): InferenceTimingNode: 1.460777/16.030597 on Core.Compiler.Timings.ROOT() with 46 direct children After fixing precompilation of the LU-factorization tools: InferenceTimingNode: 1.077774/2.868269 on Core.Compiler.Timings.ROOT() with 11 direct children ``` So that's the good news. The bad news. ## The precompilation results do not always generalize Take for example https://github.com/SciML/DifferentialEquations.jl/issues/785: ```julia using DifferentialEquations, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) InferenceTimingNode: 1.535779/13.754596 on Core.Compiler.Timings.ROOT() with 7 direct children ``` "But Chris, I thought you just said that was sub 3 seconds, not 13.75 seconds compile time!". Well, that's with `using OrdinaryDiffEq` instead of `using DifferentialEquations`. And we see this when DiffEqSensitivity.jl or DiffEqFlux.jl gets involved. So compile times are "a lot better"*, and the `"`+`*` are right now necessary when saying that. We need to fix that aspect of it. ## How can I as a user help? Good question, thanks for asking! Sharing profiles is extremely helpful. Take another look at https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-895152646 . What was ran was: ```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Rodas5() tinf = @snoopi_deep solve(prob,alg) using ProfileView ProfileView.view(flamegraph(tinf)) ``` ![image](https://user-images.githubusercontent.com/1814174/129282082-ac51270f-5843-4bcc-a452-8aa663c458b8.png) What this was saying was that the vast majority of the compile time was because the `DEFAULT_LINSOLVE` calling RecursiveFactorization.jl was not precompiling. Since we use our own full Julia-based BLAS/LAPACK stack, that gave a full 13 seconds of compilation since it would compile RecursiveFactorization.jl, TriangularSolve.jl, etc. in sequence on each first solve call of a session. This allowed us to identify the issue can create a tizzy of PRs that finally made that get cached. If you check the DifferentialEquations.jl compile times, you'll see that part is back. Why won't it compile? Well that's a harder question, discussed in https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-896984234, with the fix in https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-897188008, but apparently gets invalidated. If that all doesn't make sense to you, that's fine! But if you can help us narrow in on what the real issues are, that will help us immensely. And of course... you can always give us a sponsor and star the repos to help 😅 . But seriously though, I hope we can start using SciML funds to start scouring our repos and get a lot of people fixing compile time issues. More on that soon... very soon... 🤐 ## What are some helpful tricks and knowledge to share? Yeah, what were our tricks? Well the big one is forcing compilation in using calls through small solves. In many cases compilation is solved by just putting a prototype solve inside of the package itself. For example, take a look at the precompile section of OrdinaryDiffEq.jl (https://github.com/SciML/OrdinaryDiffEq.jl/blob/v5.61.1/src/OrdinaryDiffEq.jl#L175-L193) ```julia let while true function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end lorenzprob = ODEProblem(lorenz,[1.0;0.0;0.0],(0.0,1.0)) solve(lorenzprob,Tsit5()) solve(lorenzprob,Vern7()) solve(lorenzprob,Vern9()) solve(lorenzprob,Rosenbrock23())(5.0) solve(lorenzprob,TRBDF2()) solve(lorenzprob,Rodas4(autodiff=false)) solve(lorenzprob,KenCarp4(autodiff=false)) solve(lorenzprob,Rodas5()) break end end ``` Lorenz equation takes nanoseconds to solve, so we take this equation and solve it a few times at `using` time, which will then trigger precompilation of as many functions hit in that call stack as Julia will allow for. For some things, the reason why they are not precompiled is simply because they are never called during `using` time, so a quick fix for many of the compile time issues is to simply add a short little statement like this to `using` time and then Julia will cache its results. For everything else, there's Mastercard, and it'll be much more costly to solve, so those will need issues and the experts, and possibly some Base compiler changes. But we should at least grab all of the low hanging fruit ASAP. **This is actually a necessary condition for getting precompilation, since if a method is never called in using then it will never precompile, so this is a first step among many.** A major part of the solution was avoiding codegen when unnecessary. If you take a look at https://github.com/SciML/OrdinaryDiffEq.jl/pull/1465, you'll see things like: ```julia @muladd function perform_step!(integrator, cache::Tsit5Cache{<:Array}, repeat_step=false) @unpack t,dt,uprev,u,f,p = integrator uidx = eachindex(integrator.uprev) @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache.tab @unpack k1,k2,k3,k4,k5,k6,k7,utilde,tmp,atmp = cache a = dt*a21 @inbounds @simd ivdep for i in uidx tmp[i] = uprev[i]+a*k1[i] end f(k2, tmp, p, t+c1*dt) @inbounds @simd ivdep for i in uidx tmp[i] = uprev[i]+dt*(a31*k1[i]+a32*k2[i]) end ... ``` i.e. new solver dispatches specifically for `Array`, to avoid dispatches like: ```julia @muladd function perform_step!(integrator, cache::Tsit5Cache, repeat_step=false) @unpack t,dt,uprev,u,f,p = integrator @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache.tab @unpack k1,k2,k3,k4,k5,k6,k7,utilde,tmp,atmp = cache a = dt*a21 @.. tmp = uprev+a*k1 f(k2, tmp, p, t+c1*dt) @.. tmp = uprev+dt*(a31*k1+a32*k2) ... ``` The reason is that compile time profiling showcased that the major contributor was these https://github.com/YingboMa/FastBroadcast.jl code generation steps. Base Broadcast too is a major contributor to compile times. So at least on the pieces of code that 99% of users are using, we just expanded them out by hand, forced them to precompile, and that gave the 5 seconds to 1 second compile time change. I wouldn't say I am recommending you shouldn't use broadcast, but this is something useful to keep in mind. Broadcast is more generic, so you have to pay in compile time to use it. If you're willing to maintain the code, doing `if u isa Array` or making a separate dispatch can remove that factor. Map is also a major contributor. We actually knew this was the case already, since there was a map call which changed Pumas compile times on a small example from 3 seconds to over 40 (https://github.com/SciML/SciMLBase.jl/pull/45). Removing map calls factored into these changes as well (https://github.com/JuliaDiff/SparseDiffTools.jl/pull/149). Again, this is something that could/should be handled at the Base compiler level, but we wanted as much of a fix ASAP so this was a nice and easy change which was within our power to ship in a week, and you can do the same. The last thing, and the major step forward, was https://github.com/SciML/DiffEqBase.jl/pull/698#issuecomment-897188008 . As it explains, using a function barrier can cause inference to not know what functions it will need in a call, which makes Julia **seem to compile a whole lot of junk**. That can probably get fixed at the Base compiler level to some extent, as the example there was spending 13 seconds compiling `(::DefaultLinSolve)(::Vector{Float64}, ::Any, ::Vector{Float64}, ::Bool)` junk which nobody could ever call, since any call to that function would specialize on the second argument (a matrix type) and so it should've been compiling `(::DefaultLinSolve)(::Vector{Float64}, ::Matrix{Float64}, ::Vector{Float64}, ::Bool)`. But since that call was already precompiled, if inference ends up good enough that it realized it would call that instead, then bingo compile times dropped from 16 seconds to sub 3. That's a nice showcase that the easiest way to fix compile time issues may be to fix inference issues in a package, because that can make Julia narrow down what methods it needs to compile more, and then it may have a better chance of hitting the pieces that you told it to precompile (via a using-time example in a precompile.jl). However, even if you have dynamism, you can still improve inference. The DiffEq problem was that ForwardDiff has a dynamically chosen `chunksize` based on the size of the input ODEs `u`. We previously would defer this calculation of the chunksize until the caches for the Jacobian were built, so then all dual number caches would have a `where N` on the chunk size from inference. But a function barrier makes the total runtime cost 100ns, so whatever that's solved right? But because all of the caches have this non-inference of the chunksize, Julia's compiler could not determine all of the `N`'s were the same, so inference would see far too many types in expressions and just give up. That's how `A::Matrix{Float64}` became `Any` in the inference passes. When Julia hits the function barrier, it will cause inference to trigger again, in which case inside the function it will now be type-stable, so again no runtime cost, but at the first compile it will compile all possible methods it may ever need to call... which is a much bigger universe than what we actually hit. We noticed this was the case because if the user manually put in a chunksize, i.e. `Rodas5(chunk_size = 3)`, then this all went away. So what we did was setup a hook so that the moment `solve` is called, it goes "have you defined the chunk size? If not, let's figure it out right away and then hit `__solve`. By doing this function barrier earlier, it at least knows that the chunk size should always match the `N` type parameter of the solver algorithm, so all of the `N`s are the same, which makes it realize there's less types and use more of the compile caches before. That one is harder than the others, but it makes the more general point that caching compilation is only good if inference is good enough. And of course, check out Tim Holy's workshop and JuliaCon video which is immensely helpful at getting started. https://www.youtube.com/watch?v=rVBgrWYKLHY https://www.youtube.com/watch?v=wXRMwJdEjX4 ## What are the next steps? Well, the next steps are to get the compile times below 0.1 seconds everywhere of course. I've already identified some issues to solve: - [x] https://github.com/SciML/DifferentialEquations.jl/issues/785 - [ ] https://github.com/SciML/DiffEqFlux.jl/pull/604 - [ ] https://github.com/SciML/DiffEqSensitivity.jl/pull/471 - [ ] How can we take the non-stiff ODE solvers even lower? There's pieces in the profile like Base.Logging that could probably get added to the Julia system image, etc. This needs a full investigation by someone who really knows their stuff. - [ ] Can we get effective CI for finding compile time regressions? Are there more things just missing some simple precompiles? Are there some major threads we're missing? Help us identify where all of this is so we can solve it in full. Compile times, here we come!

That was the old situation. The issue was, if we can get inference to happen higher, then precompilation will happen on RecursiveFactorization.jl and that will send the first solve time with implicit methods from 22 seconds to 3. Now with SnoopPrecompile, the pre-changed version probably already hits 3 (on current release, it’s now 0.5 seconds BTW).

Basically, this means a lot more precompiles. But because a lot more precompiles, doubling precompiles hurts more. And invalidating functions hurts even more. If you do @time using OrdinaryDiffEq, the real important stat is 75% of the time is recompilation. This is invalidations taking what was precompiled and throwing it away because loading a different package (Static.jl, LoopVectorization.jl) invalidates the precompiled version.

So in the end, a lot more gets precompiled so the load time is increased (because of the ownership and non-inferred help), this does have a major improvement on the first solve time, but it increases using time, which then explodes because invalidations throw away more than a majority of that precompile work.

Therefore, invalidations matter a whole lot more now. It’s time to fix as much as we can there.

ranocha · August 26, 2022, 1:08pm

Yeah, right, that’s explains (at least a part of) this.

ranocha · August 26, 2022, 1:40pm

Out of curiosity: Are invalidation fixes usually backported (to release-1.8 in this case) or do we have to wait for Julia v1.9?

giordano · August 26, 2022, 3:10pm

But do we have tools that can be used for example in CI to make sure invalidations aren’t brought back again in the future? The fact is that waiting for someone enough pissed off to hunt down all the invalidations can work once, but isn’t much sustainable in the long run.

jishnub · August 26, 2022, 3:40pm

I think at a package level, one may assert that a PR doesn’t add invalidations. See e.g. https://github.com/JuliaArrays/OffsetArrays.jl/blob/master/.github/workflows/invalidations.yml

sloede · August 26, 2022, 3:42pm

First of all, thanks to everyone who offered some helpful suggestions!

Unfortunately, no. When I try to do using Trixi with the nightly build 36aab14a97, Julia segfaults with

[1693377] signal (11): Segmentation fault
in expression starting at REPL[1]:1
ijl_array_del_end at /cache/build/default-amdci5-3/julialang/julia-master/src/array.c:1144
jl_insert_method_instances at /cache/build/default-amdci5-3/julialang/julia-master/src/dump.c:2379 [inlined]
_jl_restore_incremental at /cache/build/default-amdci5-3/julialang/julia-master/src/dump.c:3273
[...]

Full error message

[1693377] signal (11): Segmentation fault
in expression starting at REPL[1]:1
ijl_array_del_end at /cache/build/default-amdci5-3/julialang/julia-master/src/array.c:1144
jl_insert_method_instances at /cache/build/default-amdci5-3/julialang/julia-master/src/dump.c:2379 [inlined]
_jl_restore_incremental at /cache/build/default-amdci5-3/julialang/julia-master/src/dump.c:3273
ijl_restore_incremental at /cache/build/default-amdci5-3/julialang/julia-master/src/dump.c:3333
_include_from_serialized at ./loading.jl:867
_require_search_from_serialized at ./loading.jl:1099
_require at ./loading.jl:1378
_require_prelocked at ./loading.jl:1260
macro expansion at ./loading.jl:1240 [inlined]
macro expansion at ./lock.jl:267 [inlined]
require at ./loading.jl:1204
jfptr_require_50250.clone_1 at /mnt/hd1/opt/julia/nightly-20220825-36aab14a97/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2447 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2629
jl_apply at /cache/build/default-amdci5-3/julialang/julia-master/src/julia.h:1854 [inlined]
call_require at /cache/build/default-amdci5-3/julialang/julia-master/src/toplevel.c:466 [inlined]
eval_import_path at /cache/build/default-amdci5-3/julialang/julia-master/src/toplevel.c:503
jl_toplevel_eval_flex at /cache/build/default-amdci5-3/julialang/julia-master/src/toplevel.c:731
eval_body at /cache/build/default-amdci5-3/julialang/julia-master/src/interpreter.c:561
eval_body at /cache/build/default-amdci5-3/julialang/julia-master/src/interpreter.c:522
jl_interpret_toplevel_thunk at /cache/build/default-amdci5-3/julialang/julia-master/src/interpreter.c:751
jl_toplevel_eval_flex at /cache/build/default-amdci5-3/julialang/julia-master/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-amdci5-3/julialang/julia-master/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci5-3/julialang/julia-master/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
eval_user_input at /cache/build/default-amdci5-3/julialang/julia-master/usr/share/julia/stdlib/v1.9/REPL/src/REPL.jl:152
repl_backend_loop at /cache/build/default-amdci5-3/julialang/julia-master/usr/share/julia/stdlib/v1.9/REPL/src/REPL.jl:248
#start_repl_backend#46 at /cache/build/default-amdci5-3/julialang/julia-master/usr/share/julia/stdlib/v1.9/REPL/src/REPL.jl:233
start_repl_backend##kw at /cache/build/default-amdci5-3/julialang/julia-master/usr/share/julia/stdlib/v1.9/REPL/src/REPL.jl:230 [inlined]
#run_repl#59 at /cache/build/default-amdci5-3/julialang/julia-master/usr/share/julia/stdlib/v1.9/REPL/src/REPL.jl:372
run_repl at /cache/build/default-amdci5-3/julialang/julia-master/usr/share/julia/stdlib/v1.9/REPL/src/REPL.jl:357
jfptr_run_repl_57495.clone_1 at /mnt/hd1/opt/julia/nightly-20220825-36aab14a97/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2447 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2629
#1007 at ./client.jl:413
jfptr_YY.1007_36884.clone_1 at /mnt/hd1/opt/julia/nightly-20220825-36aab14a97/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2447 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2629
jl_apply at /cache/build/default-amdci5-3/julialang/julia-master/src/julia.h:1854 [inlined]
jl_f__call_latest at /cache/build/default-amdci5-3/julialang/julia-master/src/builtins.c:774
#invokelatest#2 at ./essentials.jl:810 [inlined]
invokelatest at ./essentials.jl:807 [inlined]
run_main_repl at ./client.jl:397
exec_options at ./client.jl:314
_start at ./client.jl:514
jfptr__start_30331.clone_1 at /mnt/hd1/opt/julia/nightly-20220825-36aab14a97/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2447 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-3/julialang/julia-master/src/gf.c:2629
jl_apply at /cache/build/default-amdci5-3/julialang/julia-master/src/julia.h:1854 [inlined]
true_main at /cache/build/default-amdci5-3/julialang/julia-master/src/jlapi.c:567
jl_repl_entrypoint at /cache/build/default-amdci5-3/julialang/julia-master/src/jlapi.c:711
main at julia-nightly-20220825-36aab14a97 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
Allocations: 22294390 (Pool: 22283733; Big: 10657); GC: 12
Segmentation fault (core dumped)

Just to re-emphasize this from my original post: I do not mean to criticize any particular package.

However, at the moment your description does not match our observations from a (what I would call) regular user’s perspective: If we install OrdinaryDiffEq and Trixi into a fresh depot on a standard Linux machine, then the package installation time and the package loading time and the compilation time go up from 1.7.3 to 1.8.0.

These times do not just represent “convenience” issues for us: Longer package installation times means increased development times due to higher CI wait times. Longer loading times means it is harder to use these packages for quick demonstrations or for live experimentation when teaching university courses. Longer compilation times are problematic when running parallel jobs on supercomputers.

Actually, this is IMHO probably the biggest issue of all: I would assume that a language designed for high performance will - unless specifically announced - only ever have improved execution performance with each new release. Thus, these measurements were at least a surprise to us. I would be very interested in hearing from others if this regression has been observed for other use cases as well. Since we use Julia as an HPC language, a 7% regression in execution speed is non-negligible.

They improve, but unfortunately not completely back to Juila 1.7.3 levels:

P.S.: It seems like the title of my post was unilaterally changed by someone to something different from what I wrote. I think this is somewhat rude, especially since now the title does not fully reflect my original intent anymore (in my opinion, the performance regressions are only a question of type dispatches, given that compilation and execution performance is affected as well).

giordano · August 26, 2022, 3:52pm

Note that `Core.ifelse` calls to avoid invalidations from defining custom `Base.ifelse` methods by chriselrod · Pull Request #46366 · JuliaLang/julia · GitHub isn’t merged, so you’d have to compile julia yourself (maybe applying that patch on top of v1.8.0 tag, to minimise unrelated changes).

ChrisRackauckas · August 26, 2022, 4:00pm

You’re missing the huge caveat and the whole point. Your statement only for cases like Trixi.jl where special parameters types are used. That’s a huge deal. The first solve time is dramatically lower if that’s not the case. That’s pretty clear from the measurements. Regular users use Vector{Float64} or nothing for parameters: if you don’t believe that’s the case please provide evidence (I can tell you that from thousands of Discourse posts, more than 99% of them are in this case!). Yes, for the cases that people post about <1% of the time, which happens to be the case that you are looking at all of the time, there is this problem. I understand that increased Trixi compilation time, plus v1.8 changes and increased invalidations, but there is no reason to go doom and gloom beyond what’s actually true. Recognition of this fact is what will lead us to the real issues and the real solutions.

That is convenience. I’m sorry it’s now less convenient, but we will fix this. I need your help though on categorizing and profiling your case though in order to do this efficiently. My compute resources are swamped trying to survey the possible cases.

Note that you can cache the precompilation (or do a system image build that’s cached) that would remove this, so for CI infrastructure there are some easy fixes. If you need it, we can also get you some more CI compute on the AMDCI machines. In fact, I’m curious whether @giordano has any build scripts that do a single precompilation step for a multi-group CI test.

Yes, but that’s a completely different thread. Please create a separate thread on this. It’s different profiles, different causes and effects, etc. It would just be confusing to address it here because it has nothing to do with the compile times which is a whole discussion of its own. As I said, I think it’s due to the effects system, we can take a look at a thread with profiles and everything, but putting all of that into a thread about precompilation would be unreadable so it’s best to keep two completely separate topics in separate threads. I’d be happy to dig into this with you, but that’s 25+ posts with images etc. on its own. Handling this precompilation is already long.

The CI builds have prebuilt artifacts you can download. I just learned that the other day: it’s so much easier .

I did that and I’ll take full responsibility. I don’t think it’s rude because if someone finds this thread they will find 13 in-depth posts about v1.8 precompilation changes and how it adversely effects the special type cases which are not covered by the package snooping. They will find nothing about runtime changes in v1.8, which is a completely separate topic. Keeping things organized and searchable is helpful. But again, there’s no reason to not discuss v1.8 runtime changes, it’s just a separate thread and a tangent in a discussion about precompilation.

ChrisRackauckas · August 26, 2022, 4:03pm

Generally these aren’t cases that just come and go. These Static, LoopVectorization, Symbolic, and ChainRules core invalidations have been there for a long time. It’s just that precompilation never really did much, so no one really cared.

It would be good to add invalidation testing to infrastructure somehow, but solving the root cases in the core packages that cause the vast majority of issues is relatively maintainable. There just aren’t that many packages that are used by the majority of Julia users and which happen to overload something like !.

ChrisRackauckas · August 26, 2022, 4:06pm

Can you share the recompilation percentage on v1.8 that you’re seeing? @time on the using should show that.

Also share @time_imports Trixi

giordano · August 26, 2022, 9:07pm

Sure, “brought back” wasn’t the best word choice, I really meant how to not introduce new one in the future. It looks like the script suggested by Jishnu above can help with that.

ChrisRackauckas · August 27, 2022, 1:02pm

Oh I see what you were asking for now. Yeah, the script can work, but it’s overly sensitive. “Most” invalidations don’t mean very much, so getting a red test from adding one invalidation to a high level call is a bit too conservative. You almost want to pair it with a cost model of precompilation. We might adopt it in SciML if someone would help us slam the script around 100 repos, though there would need to be some judgement calls made it on (the output would help make said judgements though!)

ranocha · September 7, 2022, 3:26pm

To keep people up to date: We distributed an updated variant of the script mentioned above to all SciML repos and the Trixi framework. I also made quite a few PRs to other basic packages in the ecosystem. Let’s see how things evolve from here on and let’s work together to fix invalidations!

ChrisRackauckas · September 7, 2022, 4:38pm

And there was a major change to the function wrapping for late wrapping that should help first solve times for all downstream users, even Trixi, if Trixi snoop compiles. That’ll get a write-up soon. Also, other changes like A bunch of ambiguity fixes by ChrisRackauckas · Pull Request #1753 · SciML/OrdinaryDiffEq.jl · GitHub

Topic		Replies	Views
Shouldn't 1.8.0 be faster than Julia 1.7? Performance	30	2610	September 16, 2022
Taking TTFX seriously: Can we make common packages faster to load and use Performance ttfp	125	11888	June 20, 2022
22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times! Performance precompilation	6	5054	August 17, 2021
Package load time regressions in v1.8-beta3 Internals & Design	23	3237	June 1, 2022
Help building MWE for 1.8 latency regression Performance	11	973	November 2, 2022

TL;DR

Longer version

Setup and measurements

Package installation

Package loading

Compilation

Time-to-first-solution

Runtime of numerical simulation kernels

Conclusions

Related topics