I have a package that my collaborators want to use like a script. Specifically, they want it to load quickly. Once it’s compiled, my part of the code is fast — around 0.01s, compared to the several seconds taken by my equivalent Python package. But loading the code still takes almost 30 seconds, even with all the cool precompilation stuff in Julia 1.9 and PrecompileTools. (Compare this to the 0.15 seconds to load the Python code, and Julia comes out behind.)
The problem is that there are a few packages that my package depends on that just take a really long time to load. I don’t mean to criticize these packages because they are awesome in so many respects, and have so many things going on that I’m sure I couldn’t do any better. But to be specific, they are OrdinaryDiffEq and Symbolics. Even in an otherwise empty project, leaving my own package out of it entirely, and using all the tricks I found in PrecompileTools about making a “Startup” package and “healing” invalidations, it still takes quite a while just to run
using OrdinaryDiffEq
and/or
using Symbolics
(On my 7yo macbook pro, it’s about 22 seconds and 6 seconds, respectively.) There’s not a whole lot of (re)compilation being reported by @time using ..., but the load time is undeniable.
One point seems to be that both of those packages have lots of capabilities I don’t actually use in my package, many of which seem to be part of the precompilation for them (thus leading to huge compilation caches, I think…). Would it be possible and/or useful for me to turn off precompilation of those packages, and just rely on my own precompilation workloads to precompile whatever I do actually use? I’ve tried something like this with the preferences method suggested in PrecompileTools, but it didn’t seem to stop the precompilation from happening. In particular, it didn’t speed up the loading, though I can imagine that’s just because I did something wrong. Is this an avenue I should pursue?
Is there any other option? I had claimed that Julia 1.9 was going to make things so much better for my collaborators (who are all Python users I’ve been trying to convert), but that seems to be falling flat.
Thanks for the tip about DaemonMode. It looks like something that could come in handy for me. Unfortunately, I don’t think my colleagues will ever use my code if they have to do that.
Also, I should add that one of my colleagues has a newer M1 Mac, and reports that it takes him almost 30 seconds to load as well. But I’ve now tested it on a (linux) cluster, and both OrdinaryDiffEq and Symbolics only take ~5 seconds. Still too long, but an interesting data point.
As a side note, using OrdinaryDiffEq on my Desktop (Ryzen 7850X) using Julia master takes 1.5 seconds… This is by a factor of 1.6 faster then with Julia 1.9.
PrecompileTools is not automatic and many packages have not adopted it yet. Is is not clear to me if native caching here is slowing you down or if it is not being used extensively enough.
PrecompileTools is not automatic and many packages have not adopted it yet. Is is not clear to me if native caching here is slowing you down or if it is not being used extensively enough.
Looks like OrdinaryDiffEq makes extensive use of precompilation. And there is a lot of it. But that chunk of code also shows that it uses preferences to control what gets precompiled. So maybe I do need to just try harder with to make it not precompile, and just precompile my own stuff.
You probably only need one solver right? The preferences allow you to turn on or off compilation if specific solvers, turn it off for those you don’t use.
For end user applications, I usually provide a script that performs the package compiler step as part of “installation”.
I did some analysis of the loading time for these two packages for you.
julia> @time @time_imports using OrdinaryDiffEq, Symbolics
1.5 ms DocStringExtensions
0.3 ms Reexport
0.1 ms SuiteSparse
0.2 ms Requires
2.6 ms ArrayInterface
3.5 ms StaticArraysCore
0.4 ms ArrayInterface → ArrayInterfaceStaticArraysCoreExt
17.9 ms FunctionWrappers
0.4 ms MuladdMacro
9.4 ms OrderedCollections
0.3 ms UnPack
0.4 ms Parameters
0.9 ms Statistics
0.1 ms IfElse
33.1 ms Static
0.3 ms Compat
0.2 ms Compat → CompatLinearAlgebraExt
16.8 ms Preferences
0.3 ms SnoopPrecompile
6.8 ms StaticArrayInterface
1.4 ms ManualMemory
49.0 ms ThreadingUtilities
0.3 ms SIMDTypes
5.1 ms LayoutPointers
5.1 ms CloseOpenIntervals
441.1 ms StrideArraysCore
0.3 ms BitTwiddlingConvenienceFunctions
0.9 ms CpuId
153.6 ms CPUSummary 95.26% compilation time
7.2 ms PolyesterWeave
0.8 ms Polyester
0.4 ms FastBroadcast
34.9 ms ChainRulesCore
0.3 ms PrecompileTools
46.6 ms RecipesBase
0.8 ms SymbolicIndexingInterface
0.3 ms Adapt
0.1 ms DataValueInterfaces
1.0 ms DataAPI
0.2 ms IteratorInterfaceExtensions
0.2 ms TableTraits
41.5 ms Tables
2.3 ms GPUArraysCore
0.2 ms ArrayInterface → ArrayInterfaceGPUArraysCoreExt
18.5 ms RecursiveArrayTools
12.2 ms MacroTools
0.3 ms TruncatedStacktraces
0.8 ms ZygoteRules
1.2 ms ConstructionBase
21.0 ms Setfield
7.2 ms IrrationalConstants
1.1 ms DiffRules
4.1 ms DiffResults
0.2 ms OpenLibm_jll
0.3 ms NaNMath
0.4 ms LogExpFunctions
0.5 ms LogExpFunctions → LogExpFunctionsChainRulesCoreExt
0.3 ms JLLWrappers
5.2 ms OpenSpecFun_jll 87.29% compilation time
14.4 ms SpecialFunctions
1.0 ms SpecialFunctions → SpecialFunctionsChainRulesCoreExt
0.3 ms CommonSubexpressions
82.1 ms ForwardDiff
0.5 ms EnumX
1.2 ms PreallocationTools
0.5 ms FunctionWrappersWrappers
0.2 ms CommonSolve
0.3 ms ExprTools
1.0 ms RuntimeGeneratedFunctions
0.3 ms Tricks
12.9 ms Lazy
18.6 ms SciMLOperators
190.4 ms SciMLBase
8.6 ms DiffEqBase
0.3 ms FastClosures
3.2 ms ArrayInterfaceCore
39.0 ms HostCPUFeatures
280.0 ms VectorizationBase
4.8 ms SLEEFPirates
61.1 ms OffsetArrays
1.3 ms StaticArrayInterface → StaticArrayInterfaceOffsetArraysExt
224.1 ms LoopVectorization
0.3 ms LoopVectorization → SpecialFunctionsExt
5.3 ms LoopVectorization → ForwardDiffExt
1.4 ms TriangularSolve
235.5 ms RecursiveFactorization
15.2 ms IterativeSolvers
34.5 ms KLU
3.7 ms Sparspak
6.7 ms FastLapackInterface
22.4 ms Krylov
52.5 ms KrylovKit
270.0 ms LinearSolve
696.5 ms StaticArrays
5.7 ms StaticArrayInterface → StaticArrayInterfaceStaticArraysExt
0.2 ms Adapt → AdaptStaticArraysExt
2.4 ms ConstructionBase → ConstructionBaseStaticArraysExt
0.7 ms ForwardDiff → ForwardDiffStaticArraysExt
5.8 ms FiniteDiff
104.8 ms SimpleNonlinearSolve
8.2 ms NLSolversBase
7.3 ms LineSearches
0.2 ms SimpleUnPack
93.8 ms DataStructures
38.0 ms GenericSchur
1544.0 ms ExponentialUtilities
3.1 ms SimpleTraits
4.4 ms ArnoldiMethod
1.0 ms Inflate
41.2 ms Graphs
0.6 ms VertexSafeGraphs
6.1 ms SparseDiffTools
255.3 ms NonlinearSolve
0.5 ms StatsAPI
5.8 ms Distances
2.9 ms NLsolve
0.9 ms SciMLNLSolve
1755.4 ms OrdinaryDiffEq
83.9 ms IntervalSets
0.8 ms ConstructionBase → ConstructionBaseIntervalSetsExt
2.0 ms CompositeTypes
224.1 ms DomainSets
0.5 ms Unityper
39.1 ms AbstractTrees
46.9 ms TimerOutputs
5.4 ms Combinatorics
565.9 ms MutableArithmetics
127.1 ms MultivariatePolynomials
35.1 ms DynamicPolynomials
2.0 ms Bijections
62.7 ms LabelledArrays
423.9 ms SymbolicUtils
0.4 ms TreeViews
41.3 ms RandomExtensions
12.7 ms GroupsCore
203.0 ms AbstractAlgebra
0.4 ms IntegerMathUtils
23.0 ms Primes
849.3 ms Groebner
0.7 ms SortingAlgorithms
20.8 ms Missings
31.8 ms StatsBase
48.6 ms PDMats
189.7 ms Rmath_jll 99.59% compilation time (100% recompilation)
1.1 ms Rmath
1.9 ms Calculus
33.3 ms DualNumbers
2.1 ms HypergeometricFunctions
5.5 ms StatsFuns
0.4 ms StatsFuns → StatsFunsChainRulesCoreExt
7.5 ms QuadGK
229.2 ms FillArrays
747.8 ms Distributions
1.2 ms Distributions → DistributionsChainRulesCoreExt
0.3 ms DiffEqBase → DiffEqBaseDistributionsExt
0.8 ms LaTeXStrings
1.1 ms Formatting
90.7 ms Latexify
10.1 ms LambertW
480.4 ms Symbolics
12.270529 seconds (12.59 M allocations: 775.836 MiB, 5.17% gc time, 3.68% compilation time: 42% of which was recompilation)
Here is the sorted list of the top 10 packages that take more than 300 ms to load.
1789.4 ms OrdinaryDiffEq
1578.9 ms ExponentialUtilities
872.6 ms Groebner
756.7 ms Distributions
687.8 ms StaticArrays
563.7 ms MutableArithmetics
485.4 ms Symbolics
427.1 ms SymbolicUtils
424.3 ms StrideArraysCore
312.6 ms VectorizationBase
These 10 packages account for 7.1 seconds out of the 12.2 seconds (58%) of the loading time.
You might want to take a look under your ~/.julia/compiled/v1.9 directory to see how large the shared libraries are for the respective packages (.so, .dll, .dylib). Scanning my .julia/compiled/v1.9 directory for those ten packages, I get the following.
As for your other idea of modifying the precompilation of your dependencies, you could also fork those packages, modify their top-level precompilation statements, and then provide your collaborators a Manifest.toml pointing at your forks.
That’s a somewhat more fine-grained version of what I described as the “preferences method” that didn’t work for me. But I think I was actually putting the preferences in Startup, rather than in the project containing it.
Trying again, I now (correctly) turn off precompilation entirely for OrdinaryDiffEq, and in Startup I have a simple @compile_workload that’s just the first example from the OrdinaryDiffEq README. It now takes ~6s to load on both mac and linux for me:
> julia --project -e '@time using Startup; @time Startup.ode_example(); @time Startup.ode_example();'
6.081482 seconds (8.87 M allocations: 536.248 MiB, 5.27% gc time, 3.84% compilation time)
0.011447 seconds (17.48 k allocations: 1.132 MiB, 64.13% compilation time)
0.000092 seconds (97 allocations: 6.984 KiB)
That’s certainly a lot better, though still going to be a hurdle when I’m proselytizing for the church of Julia. (It’s also surprising that there’s still compilation to be done for the first run of Startup.ode_example even though that’s the @compile_workload. But it’s quick enough that I suppose I’m willing to move on.)
I just found a nice comparison with PackageCompiler in the README of PrecompileTools that would have told me that I can’t get all the way with just PrecompileTools:
only PrecompileTools can be used by package developers to ensure a better out-of-box experience for your users
only PrecompileTools allows you to update your packages without needing to rebuild Julia
only PackageCompiler dramatically speeds up loading time (i.e., using ...) for all the packages
There are steps involved in loading Julia and packages that are still totally obscure to me; that last bullet point tells me that I don’t even have a high-level understanding of what’s happening or how I can control it.
EDIT: I also now see that TTL was also discussed in this blog post, where they also pointed out this difference in capabilities.
So, I suppose that’s about as well as I can do with precompilation alone. Here are my plans for moving forward. Some of my users will be new to Julia, and so will be following my installation instructions carefully, including creating a new Project just for my package, so I’ll also suggest adding this preference to turn off precompilation of OrdinaryDiffEq (and Symbolics if similar results follow). Basically all other users will actually be using my Julia code via one of my Python packages , and that Python package needs to run some Julia code on installation via JuliaCall, so my plan is to just add this preference as part of that Julia code before precompilation. Maybe I’ll be able to get really fancy and make a sysimage with Juliacall…
Thanks for both of these very informative comments, @mkitti. I forgot about @time_imports, and I really appreciate your correlation of the times with the sizes. And your nice example and excellent results with PackageCompiler really inspire me to work something up to make this easy to do for my users. Especially in Python (via JuliaCall), I’m thinking this should be quite feasible.
I would argue that the buzz around TTFX in 1.9 made it appear that this would no longer be an issue. In particular, I’m not seeing anything like the results in the blog post, even when jumping through hoops.
Anyway, I think mkitti has given me a great path forward by pointing to MilesCranmer’s approach.
The time to load, TTL, of OrdinaryDiffEq there seems to be about 5-10s, so compatible with what we are seeing. And that is a huge improvement, but doesn’t solve the issue in your case.
Well, on my old Laptop I get (in performance mode):
5.9s load time for DifferentialEquations with Julia 1.9
16.3s load time with Julia 1.8.5
So this is an improvement by a factor of 2.7…
My hardware:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core™ i7-10510U CPU @ 1.80GHz
It is known that the load times on Windows are higher and are also varying a lot, mainly because on Windows programs like a virus scanner might be active. When measuring the load time I also avoid to use VSCode because the language server can impact the performance in a negative way. I don’t know about MacOS.
On Julia master the load times are again significantly lower, but they will never be as low as loading a library in Python. As far as I understand one reason is that Julia allows inlining of code on global scope which can give you a performance boost of a factor of two compared to C++ or Python, but comes at the price of higher load times…
Can you provide details? Note that the blog post gives a link to code that should allow you to replicate the exact methodology used.
But in general terms, 1.9 does not really address TTL. We say that explicitly in the blog post. The only dramatic difference is for TTX, but wow is it dramatic (often ~100x for some previously long waits like Makie). It’s unlikely we’ll get a 100x gain in TTL, as loading compiled Julia code is much more complicated than anything Python has to do. But master already has some excellent gains and there are package ecosystem improvements that can make a substantial difference. I think the process is already underway with a lot of the SciML ecosystem.