One doubt/curiosity: if one loads a package in a fresh Julia section, and then later opens Julia again in the same environment, should now the loading time be very fast? I mean, is there any reason for recompilation to occur among two subsequent loadings of a package, if nothing has been done in the mean time?
Invalidations happen when you load the package, not when you run the workload. You’re snooping the wrong step.
If the package needs precompilation, then yes, using MyPkg
will be very slow the first time. But loading is still slow (less than in the first case) once it’s been precompiled. The second form of slowness has nothing to do with I/O (that’s fast), instead almost all of it is about checking whether the loaded code is still valid.
For those who curse invalidations, keep in mind they are a feature, not a bug, even though they cause headaches. Julia has three remarkable properties:
- it’s aggressively compiled & automatically specialized
- it’s interactive
- we attempt to guarantee that the code that runs is the same no matter how you got there: no matter whether you loaded everything in your “universe of code” before triggering compilation, or whether you built up your system incrementally at the REPL with usage, you get the same outcome.
The last property is critical and is what makes the REPL actually work. Otherwise you’d get different outcomes when you used the REPL vs something “hardbaked” in a package. But it’s also what requires us to check for invalidations, and what makes package loading slow.
Thanks. After this season is over (I’ll be traveling after lunch) I will open an issue trying to explain all that I’ve tried and results. You should now be able to automatically install GMT (unless you are also bitten by the libnetcdf/libcurl
fight.
I just share people this script, which IMO should be in the docs in some form.
using SnoopCompile
invalidations = @snoopr begin
# Your code goes here!
using OrdinaryDiffEq
function lorenz(du, u, p, t)
du[1] = 10.0(u[2] - u[1])
du[2] = u[1] * (28.0 - u[3]) - u[2]
du[3] = u[1] * u[2] - (8 / 3) * u[3]
end
u0 = [1.0; 0.0; 0.0]
tspan = (0.0, 100.0)
prob = ODEProblem{true,false}(lorenz, u0, tspan)
alg = Rodas5()
tinf = solve(prob, alg)
end;
trees = SnoopCompile.invalidation_trees(invalidations);
@show length(SnoopCompile.uinvalidated(invalidations)) # show total invalidations
show(trees[end]) # show the most invalidated method
# Count number of children (number of invalidations per invalidated method)
n_invalidations = map(trees) do methinvs
SnoopCompile.countchildren(methinvs)
end
import Plots
Plots.plot(
1:length(trees),
n_invalidations;
markershape=:circle,
xlabel="i-th method invalidation",
label="Number of children per method invalidations"
)
Just replace the OrdinaryDiffEq code with the code you want analyzed and this will spit out an informative plot of the invalidations distribution plus what the biggest invalidating method is.
The script errors when there are no invalidations
julia> include("using_SC.jl")
length(SnoopCompile.uinvalidated(invalidations)) = 0
ERROR: LoadError: BoundsError: attempt to access 0-element Vector{SnoopCompile.MethodInvalidations} at index [0]
Use Julia v1.9.0-beta2 is fast - #17 by tim.holy instead. It shouldn’t error and it will also make your TODO list much, much smaller .
Getting an error when trying this.
Using an environment with julia-1.9.0-beta2 and only the three relevant packages installed.
julia> using SnoopCompileCore
julia> invs = @snoopr using ControlSystems;
julia> using SnoopCompile
julia> trees = invalidation_trees(invs)
ERROR: KeyError: key MethodInstance for LoopVectorization.avx_config_val(::Val{(false, 0, 0, 0, false, 0x0000000000000001, 1, true)}, ::Static.StaticInt{4}) not found
Stacktrace:
[1] getindex(h::Dict{Union{Int32, Core.MethodInstance}, Union{Tuple{Any, Vector{Any}}, SnoopCompile.InstanceNode}}, key::Core.MethodInstance)
@ Base ./dict.jl:484
[2] invalidation_trees(list::Vector{Any}; exclude_corecompiler::Bool)
@ SnoopCompile ~/.julia/packages/SnoopCompile/Q8zUg/src/invalidations.jl:403
[3] invalidation_trees(list::Vector{Any})
@ SnoopCompile ~/.julia/packages/SnoopCompile/Q8zUg/src/invalidations.jl:294
[4] top-level scope
@ REPL[4]:1
(testinv) pkg> st
Status `/tmp/testinv/Project.toml`
[a6e380b2] ControlSystems v1.5.3
[aa65fe97] SnoopCompile v2.9.5
[e2b509da] SnoopCompileCore v2.9.0
Sorry about that. I don’t yet understand the error, but in the meantime there’s a hack-workaround real fix available if you do use SnoopCompile 2.9.6 or higher. That should at least allow you to keep moving forward, pkg> add SnoopCompile#teh/circumvent_err
with the understanding that there are no guarantees that the output is comprehensive (and if you see stuff that makes no sense, it would be worth suspecting a bug).
I’ll try to develop a clearer understanding in the next few days, but I wanted you to have something to play with.
The error has been fixed (correctly). You can get the fix with SnoopCompile 2.9.6 or higher. In case anybody is using the teh/circumvent_err
branch, I’ll try to remember to leave it lying around for a week or so, but it’s better to upgrade to the latest release.
Thought I’d leave this here as well: I wanted to get a feel for how the user experience of the “original” TTFX - i.e. time-to-first-plot using Plots.jl - has evolved over the years, so I wrote a little benchmark script that times starting Julia and running using Plots; display(scatter(rand(1_000)))
for all Julia versions since 0.7.
Here’s what the mean and minimum times look like across versions:
Initially I had included 0.6.4 as well but with such an old Julia version one has to also fall back onto a very old Plots version which errors out on the display
call. An alternative script with savefig
instead of display
suggests that 0.6 had similar TTFP to 0.7-1.2.
I think the plot makes clear the scale of achievement here - while I’m sure there’s more to come as the ecosystem starts to make best use of the new abilities of Julia, it seems clear that this is the first time that we’ve seen a dramatic qualitative shift in the sort of very basic experience that the modal user might have without resorting to any tricks like PackageCompiler.
Some more boring details for those interested below:
Details
The benchmarks were run on an i7 Windows laptop, versioninfo()
is:
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
Threads: 1 on 8 virtual cores
As Plots.jl has dropped compatibility for older Julia versions over time, going back in Julia versions also means going back to older Plots versions, so the benchmarks don’t just capture differences in precompilation and other language features, but changes in Plots itself.
The following table shows the Plots version used for every Julia version, as well as the average, minimum, and standard deviation of the 20 samples taken:
│ String String Float64 Float64 Float64
─────┼───────────────────────────────────────────────────
1 │ 0.6.4 0.17.4 11.5 10.7 0.9
2 │ 0.7.0 0.19.3 21.5 19.8 1.3
3 │ 1.0.5 1.4.3 18.5 18.1 0.5
4 │ 1.1.1 1.4.3 21.3 18.3 4.4
5 │ 1.2.0 1.4.4 22.8 21.4 1.5
6 │ 1.3.1 1.6.12 27.3 26.2 0.7
7 │ 1.4.2 1.6.12 35.0 30.0 5.4
8 │ 1.5.4 1.24.3 35.7 31.9 3.3
9 │ 1.6.5 1.27.6 24.6 21.7 2.4
10 │ 1.7.3 1.37.2 24.3 22.6 2.0
11 │ 1.8.3 1.38.0 20.6 17.7 2.8
12 │ 1.9.0b 1.38.0-dev 7.6 4.1 1.1
Note that 1.9.0b uses a branch of Plots with a fix from Kristoffer that addresses an issue with the usage of cached code on Windows.
The whole benchmark code run is:
using DataFrames, ProgressMeter
path = "path/to/Julias"
julias = Dict(
"0.6.4" => "Julia-0.6.4/bin/julia.exe",
"0.7.0" => "Julia-0.7.0/bin/julia.exe",
"1.0.5" => "Julia-1.0.5/bin/julia.exe",
"1.1.1" => "Julia-1.1.1/bin/julia.exe",
"1.2.0" => "Programs/Julia-1.2.0/bin/julia.exe",
"1.3.1" => "Julia-1.3.1/bin/julia.exe",
"1.4.2" => "Programs/Julia/Julia-1.4.2/bin/julia.exe",
"1.5.4" => "Programs/Julia 1.5.4/bin/julia.exe",
"1.6.5" => "Programs/Julia-1.6.5/bin/julia.exe",
"1.7.3" => "Programs/Julia-1.7.3/bin/julia.exe",
"1.8.3" => "Programs/Julia-1.8.3/bin/julia.exe",
"1.9.0b" => "Programs/Julia-1.9.0-beta2/bin/julia.exe",
)
jdf = sort(DataFrame(version = collect(keys(julias)), location = collect(values(julias))), :version)
# Check versions
for r ∈ eachrow(jdf)
println("Julia version: ", r.version)
if r.version == "0.6.4"
run(`$(normpath(path, r.location)) -e "println(Pkg.installed()[\"Plots\"])"`);
else
run(`$(normpath(path, r.location)) -e "import Pkg; println(Pkg.installed()[\"Plots\"])"`);
end
println()
end
n = 20
jdf.times = [zeros(n) for _ ∈ 1:nrow(jdf)]
# Run code
for r ∈ eachrow(jdf)
println("Julia version: ", r.version)
pn = normpath(path, r.location)
@showprogress for i ∈ 1:n
if r.version == "0.6.4"
r.times[i] = @elapsed run(`$pn -e "using Plots; scatter(rand(1_000))"`);
else
r.times[i] = @elapsed run(`$pn -e "using Plots; display(scatter(rand(1_000)))"`);
end
end
println()
end
using Plots, Statistics
jdf.average = mean.(jdf.times)
jdf.standard_dev = std.(jdf.times)
bar(jdf.version[2:end], jdf.average[2:end], label = "",
linewidth = 0.1, ylabel = "Time (seconds)", xlabel = "Julia version",
yticks = 0:5:50,
title = "Mean/minimum time for 20 samples\nusing Plots; display(scatter(rand(1_000)))")
scatter!((1:nrow(jdf)-1) .- 0.5, minimum.(jdf.times)[2:end], label = "Minimum", markerstrokewidth = 0.1)
jdf.Plots_version = ["0.17.4", "0.19.3", "1.4.3", "1.4.3", "1.4.4", "1.6.12",
"1.6.12", "1.24.3", "1.27.6", "1.37.2", "1.38.0", "1.38.0-dev"]
select(jdf, :version, :average => (x -> round.(x, digits = 1)) => :average,
:times => (x -> round.(minimum.(x), digits = 1)) => :minimum,
:times => (x -> round.(std.(x), digits = 1)) => :std_dev)
That’s really cool! If you have the time and willingness, would you be able to check DifferentialEquations?
Here’s a one-liner that I believe should work even on ancient versions:
using DifferentialEquations; f(u,p,t) = 1.01*u; solve(ODEProblem(f,1/2,(0.0,1.0)), Tsit5(), abstol=1e-8)
Just wanted to say thanks to everybody who worked on this. On DFTK, with a simple SnoopPrecompile.@precompile_all_calls
before a standard workflow in the package, TTFX went from 40s to 8s on my machine, and from 80s to 2s on somebody else’s (not sure why the discrepancy…), which is pretty amazing. Precompilation costs did go through the roof, which is unfortunate, but we added an environment variable so precompilation can be disabled for developers.
Well that was quite the Odyssey… Best to do these kinds of experiments with packages that take less than 10 minutes to install and precompile (I think on 0.6.4 it took 20 minutes just to download all the dependencies). I had also forgotten how annoying precompilation was before parallel precompilation and progress bars - just staring at a terminal with nothing happening for minutes on end isn’t fun.
And after all that the results are… not excatly overwhelming
Note that I couldn’t get the package to precompule on 1.4 and 1.5, so I skipped those Julia versions.
Here we are covering 4 major DiffEq releases (from 4.5 to 7.6), so I assume we are comparing across some pretty major changes to the package itself, and during installation I noticed that later versions pull in a lot of heavy dependencies like LoopVectorization. On 1.9-beta2 it looks like virtually all the time is spent in using DifferentialEquations
, so maybe there’s not a lot to be gained on the actual first solve - but I don’t really know anything about DifferentialEquations
so I’m not the right person to make sense of these results!
Here are the package versions used and summary statistics of the 20 samples for each Julia version:
Row │ version DiffEq_version average minimum std_dev
│ String String Float64 Float64 Float64
─────┼────────────────────────────────────────────────────
1 │ 0.6.4 4.5.0 7.1 6.8 0.4
2 │ 0.7.0 5.2.1 14.3 11.1 1.8
3 │ 1.0.5 6.10.1 25.9 20.5 2.4
4 │ 1.1.1 6.10.1 31.2 29.1 2.0
5 │ 1.2.0 6.11.0 32.1 24.5 4.7
6 │ 1.3.1 6.18.0 32.3 26.7 4.7
7 │ 1.6.5 7.6.0 12.8 11.4 1.4
8 │ 1.7.3 7.6.0 10.4 9.5 0.7
9 │ 1.8.3 7.6.0 17.7 17.3 0.4
10 │ 1.9.0b 7.6.0 10.7 9.9 0.8
Full code
using DataFrames, ProgressMeter
path = "path/to/julias"
julias = Dict(
"0.6.4" => "Julia-0.6.4/bin/julia.exe",
"0.7.0" => "Julia-0.7.0/bin/julia.exe",
"1.0.5" => "Julia-1.0.5/bin/julia.exe",
"1.1.1" => "Julia-1.1.1/bin/julia.exe",
"1.2.0" => "Programs/Julia-1.2.0/bin/julia.exe",
"1.3.1" => "Julia-1.3.1/bin/julia.exe",
#"1.4.2" => "Programs/Julia/Julia-1.4.2/bin/julia.exe",
#"1.5.4" => "Programs/Julia 1.5.4/bin/julia.exe",
"1.6.5" => "Programs/Julia-1.6.5/bin/julia.exe",
"1.7.3" => "Programs/Julia-1.7.3/bin/julia.exe",
"1.8.3" => "Programs/Julia-1.8.3/bin/julia.exe",
"1.9.0b" => "Programs/Julia-1.9.0-beta2/bin/julia.exe",
)
jdf = sort(DataFrame(version = collect(keys(julias)), location = collect(values(julias))), :version)
# Check versions
for r ∈ eachrow(jdf)
println("Julia version: ", r.version)
if r.version == "0.6.4"
run(`$(normpath(path, r.location)) -e "println(Pkg.installed()[\"DifferentialEquations\"])"`);
else
run(`$(normpath(path, r.location)) -e "import Pkg; println(Pkg.installed()[\"DifferentialEquations\"])"`);
end
println()
end
n = 20
jdf.times = [zeros(n) for _ ∈ 1:nrow(jdf)]
# Run code
for r ∈ eachrow(jdf)
println("Julia version: ", r.version)
pn = normpath(path, r.location)
@showprogress for i ∈ 1:n
r.times[i] = @elapsed run(`$pn -e "using DifferentialEquations; f(u,p,t) = 1.01*u; solve(ODEProblem(f,1/2,(0.0,1.0)), Tsit5(), abstol=1e-8)"`);
end
println()
end
using Plots, Statistics
jdf.average = mean.(jdf.times)
jdf.standard_dev = std.(jdf.times)
bar(jdf.version, jdf.average, label = "",
linewidth = 0.1, ylabel = "Time (seconds)", xlabel = "Julia version",
yticks = 0:5:50,
title = "Mean/minimum time for 20 samples\nTime-to-first-DiffEq solve")
scatter!((1:nrow(jdf)) .- 0.5, minimum.(jdf.times), label = "Minimum", markerstrokewidth = 0.1)
jdf.DiffEq_version = ["4.5.0", "5.2.1", "6.10.1", "6.10.1", "6.11.0", "6.18.0",
"7.6.0", "7.6.0", "7.6.0", "7.6.0"]
print(select(jdf, :version, :DiffEq_version, :average => (x -> round.(x, digits = 1)) => :average,
:times => (x -> round.(minimum.(x), digits = 1)) => :minimum,
:times => (x -> round.(std.(x), digits = 1)) => :std_dev))
We should also keep in mind that the amazing new pkgimg caching of native code solves only “time to first X after things are already imported when invalidations are fixed by package developers”. Developers separately are solving the invalidation issues (which had started to pay off even before pkgimages). However, there is still a lot of work that can be done on the “import time” side of things. The plots above show “import time + time to first X after import” which can be confusing because:
- pkgimages (and fixing invalidations) turn “time to first X after import” from many seconds to milliseconds
- pkgimages do not change “import-only time” much
Nonetheless, this is an amazing step forward, with many other exciting ideas being rumored on offhand comments on github.
This is a very cool graph. I think this subject should have its own blog post in addition to being mentioned in the highlights post and plots like this would be excellent to show the trajectory over time. I can add some interpretation here:
- 0.7 versus 1.0: The 1.0 release deleted a lot of deprecation code, making it significantly leaner and faster than 0.7.
- 1.0 through 1.5: The compiler got smarter and generated faster code, but at the expense of worsening compile times. Some compensatory latency reductions were done in this era, but they were outpaced by improved compiler output taking more time (and LLVM getting slower).
- 1.6 through 1.8: Incremental improvements to compile time due to focus on reducing latency in both Julia and LLVM. We were always acutely aware of this, and people started shaming LLVM for its compile times; LLVM added compile time benchmarks and now tracks them and has been improving somewhat.
- 1.9: Machine code caching is transformative at significantly reducing TTFP. Now that it works, more can be done here: packages can precompile more now that it’s more worthwhile to do so; fixing invalidations now has larger proportional effect; more effort can be made on reducing the remaining bottlenecks in load time.
Notable is that second two phases retain and even build on the runtime performance improvements gained in the previous era so it’s a win-win: much improved load latency and much improved runtime performance.
Oh, and we might see big improvements in load / compile time from the newly landed “weak dependencies” functionality, which can reduce the number of dependencies that packages have considerably by allowing them to convert some regular deps to weak deps so that you don’t have to pay all the time on the off chance that you might use them. That requires more of an ecosystem wide movement though.
I’d also add all the method invalidation hunting that happened during this period, happening in a concerted manner across the entire ecosystem, with new package versions that are dependent upon work in Julia itself. The best compile time improvements are compilations that don’t need to happen!
I think that’s quite interesting, thanks! (I should have told you to use only OrdinaryDiffEq.jl
, that would have had way less dependencies, sorry about that!)