Any way to speed up loading large precompiled packages?

I have a package that my collaborators want to use like a script. Specifically, they want it to load quickly. Once it’s compiled, my part of the code is fast — around 0.01s, compared to the several seconds taken by my equivalent Python package. But loading the code still takes almost 30 seconds, even with all the cool precompilation stuff in Julia 1.9 and PrecompileTools. (Compare this to the 0.15 seconds to load the Python code, and Julia comes out behind.)

The problem is that there are a few packages that my package depends on that just take a really long time to load. I don’t mean to criticize these packages because they are awesome in so many respects, and have so many things going on that I’m sure I couldn’t do any better. But to be specific, they are OrdinaryDiffEq and Symbolics. Even in an otherwise empty project, leaving my own package out of it entirely, and using all the tricks I found in PrecompileTools about making a “Startup” package and “healing” invalidations, it still takes quite a while just to run

using OrdinaryDiffEq

and/or

using Symbolics

(On my 7yo macbook pro, it’s about 22 seconds and 6 seconds, respectively.) There’s not a whole lot of (re)compilation being reported by @time using ..., but the load time is undeniable.

One point seems to be that both of those packages have lots of capabilities I don’t actually use in my package, many of which seem to be part of the precompilation for them (thus leading to huge compilation caches, I think…). Would it be possible and/or useful for me to turn off precompilation of those packages, and just rely on my own precompilation workloads to precompile whatever I do actually use? I’ve tried something like this with the preferences method suggested in PrecompileTools, but it didn’t seem to stop the precompilation from happening. In particular, it didn’t speed up the loading, though I can imagine that’s just because I did something wrong. Is this an avenue I should pursue?

Is there any other option? I had claimed that Julia 1.9 was going to make things so much better for my collaborators (who are all Python users I’ve been trying to convert), but that seems to be falling flat.

1 Like

Depending on the use case, GitHub - dmolina/DaemonMode.jl: Client-Daemon workflow to run faster scripts in Julia can be useful.

(just to note, on my laptop those two using take ~5 and ~3 seconds, after the long installation / precompilation times, in a fresh environment).

Thanks for the tip about DaemonMode. It looks like something that could come in handy for me. Unfortunately, I don’t think my colleagues will ever use my code if they have to do that.

Also, I should add that one of my colleagues has a newer M1 Mac, and reports that it takes him almost 30 seconds to load as well. But I’ve now tested it on a (linux) cluster, and both OrdinaryDiffEq and Symbolics only take ~5 seconds. Still too long, but an interesting data point.

1 Like

As a side note, using OrdinaryDiffEq on my Desktop (Ryzen 7850X) using Julia master takes 1.5 seconds… This is by a factor of 1.6 faster then with Julia 1.9.

PrecompileTools is not automatic and many packages have not adopted it yet. Is is not clear to me if native caching here is slowing you down or if it is not being used extensively enough.

If this does not work, you can also fallback to the older solution:
https://julialang.github.io/PackageCompiler.jl/stable/

Have you read this post? You may not need to load all of OrdinaryDiffEq.

PrecompileTools is not automatic and many packages have not adopted it yet. Is is not clear to me if native caching here is slowing you down or if it is not being used extensively enough.

Looks like OrdinaryDiffEq makes extensive use of precompilation. And there is a lot of it. But that chunk of code also shows that it uses preferences to control what gets precompiled. So maybe I do need to just try harder with to make it not precompile, and just precompile my own stuff.

If this does not work, you can also fallback to the older solution:
https://julialang.github.io/PackageCompiler.jl/stable/

Again, I feel like that’s asking more of my prospective users than they would likely stomach.

Have you read this post? You may not need to load all of OrdinaryDiffEq.

My package does actually use a solver from OrdinaryDiffEq, so I think I really need it. Thanks, though.

You probably only need one solver right? The preferences allow you to turn on or off compilation if specific solvers, turn it off for those you don’t use. :blush:

https://docs.sciml.ai/DiffEqDocs/stable/features/low_dep/#Controlling-Function-Specialization-and-Precompilation

For end user applications, I usually provide a script that performs the package compiler step as part of “installation”.

I did some analysis of the loading time for these two packages for you.

julia> @time @time_imports using OrdinaryDiffEq, Symbolics
      1.5 ms  DocStringExtensions
      0.3 ms  Reexport
      0.1 ms  SuiteSparse
      0.2 ms  Requires
      2.6 ms  ArrayInterface
      3.5 ms  StaticArraysCore
      0.4 ms  ArrayInterface → ArrayInterfaceStaticArraysCoreExt
     17.9 ms  FunctionWrappers
      0.4 ms  MuladdMacro
      9.4 ms  OrderedCollections
      0.3 ms  UnPack
      0.4 ms  Parameters
      0.9 ms  Statistics
      0.1 ms  IfElse
     33.1 ms  Static
      0.3 ms  Compat
      0.2 ms  Compat → CompatLinearAlgebraExt
     16.8 ms  Preferences
      0.3 ms  SnoopPrecompile
      6.8 ms  StaticArrayInterface
      1.4 ms  ManualMemory
     49.0 ms  ThreadingUtilities
      0.3 ms  SIMDTypes
      5.1 ms  LayoutPointers
      5.1 ms  CloseOpenIntervals
    441.1 ms  StrideArraysCore
      0.3 ms  BitTwiddlingConvenienceFunctions
      0.9 ms  CpuId
    153.6 ms  CPUSummary 95.26% compilation time
      7.2 ms  PolyesterWeave
      0.8 ms  Polyester
      0.4 ms  FastBroadcast
     34.9 ms  ChainRulesCore
      0.3 ms  PrecompileTools
     46.6 ms  RecipesBase
      0.8 ms  SymbolicIndexingInterface
      0.3 ms  Adapt
      0.1 ms  DataValueInterfaces
      1.0 ms  DataAPI
      0.2 ms  IteratorInterfaceExtensions
      0.2 ms  TableTraits
     41.5 ms  Tables
      2.3 ms  GPUArraysCore
      0.2 ms  ArrayInterface → ArrayInterfaceGPUArraysCoreExt
     18.5 ms  RecursiveArrayTools
     12.2 ms  MacroTools
      0.3 ms  TruncatedStacktraces
      0.8 ms  ZygoteRules
      1.2 ms  ConstructionBase
     21.0 ms  Setfield
      7.2 ms  IrrationalConstants
      1.1 ms  DiffRules
      4.1 ms  DiffResults
      0.2 ms  OpenLibm_jll
      0.3 ms  NaNMath
      0.4 ms  LogExpFunctions
      0.5 ms  LogExpFunctions → LogExpFunctionsChainRulesCoreExt
      0.3 ms  JLLWrappers
      5.2 ms  OpenSpecFun_jll 87.29% compilation time
     14.4 ms  SpecialFunctions
      1.0 ms  SpecialFunctions → SpecialFunctionsChainRulesCoreExt
      0.3 ms  CommonSubexpressions
     82.1 ms  ForwardDiff
      0.5 ms  EnumX
      1.2 ms  PreallocationTools
      0.5 ms  FunctionWrappersWrappers
      0.2 ms  CommonSolve
      0.3 ms  ExprTools
      1.0 ms  RuntimeGeneratedFunctions
      0.3 ms  Tricks
     12.9 ms  Lazy
     18.6 ms  SciMLOperators
    190.4 ms  SciMLBase
      8.6 ms  DiffEqBase
      0.3 ms  FastClosures
      3.2 ms  ArrayInterfaceCore
     39.0 ms  HostCPUFeatures
    280.0 ms  VectorizationBase
      4.8 ms  SLEEFPirates
     61.1 ms  OffsetArrays
      1.3 ms  StaticArrayInterface → StaticArrayInterfaceOffsetArraysExt
    224.1 ms  LoopVectorization
      0.3 ms  LoopVectorization → SpecialFunctionsExt
      5.3 ms  LoopVectorization → ForwardDiffExt
      1.4 ms  TriangularSolve
    235.5 ms  RecursiveFactorization
     15.2 ms  IterativeSolvers
     34.5 ms  KLU
      3.7 ms  Sparspak
      6.7 ms  FastLapackInterface
     22.4 ms  Krylov
     52.5 ms  KrylovKit
    270.0 ms  LinearSolve
    696.5 ms  StaticArrays
      5.7 ms  StaticArrayInterface → StaticArrayInterfaceStaticArraysExt
      0.2 ms  Adapt → AdaptStaticArraysExt
      2.4 ms  ConstructionBase → ConstructionBaseStaticArraysExt
      0.7 ms  ForwardDiff → ForwardDiffStaticArraysExt
      5.8 ms  FiniteDiff
    104.8 ms  SimpleNonlinearSolve
      8.2 ms  NLSolversBase
      7.3 ms  LineSearches
      0.2 ms  SimpleUnPack
     93.8 ms  DataStructures
     38.0 ms  GenericSchur
   1544.0 ms  ExponentialUtilities
      3.1 ms  SimpleTraits
      4.4 ms  ArnoldiMethod
      1.0 ms  Inflate
     41.2 ms  Graphs
      0.6 ms  VertexSafeGraphs
      6.1 ms  SparseDiffTools
    255.3 ms  NonlinearSolve
      0.5 ms  StatsAPI
      5.8 ms  Distances
      2.9 ms  NLsolve
      0.9 ms  SciMLNLSolve
   1755.4 ms  OrdinaryDiffEq
     83.9 ms  IntervalSets
      0.8 ms  ConstructionBase → ConstructionBaseIntervalSetsExt
      2.0 ms  CompositeTypes
    224.1 ms  DomainSets
      0.5 ms  Unityper
     39.1 ms  AbstractTrees
     46.9 ms  TimerOutputs
      5.4 ms  Combinatorics
    565.9 ms  MutableArithmetics
    127.1 ms  MultivariatePolynomials
     35.1 ms  DynamicPolynomials
      2.0 ms  Bijections
     62.7 ms  LabelledArrays
    423.9 ms  SymbolicUtils
      0.4 ms  TreeViews
     41.3 ms  RandomExtensions
     12.7 ms  GroupsCore
    203.0 ms  AbstractAlgebra
      0.4 ms  IntegerMathUtils
     23.0 ms  Primes
    849.3 ms  Groebner
      0.7 ms  SortingAlgorithms
     20.8 ms  Missings
     31.8 ms  StatsBase
     48.6 ms  PDMats
    189.7 ms  Rmath_jll 99.59% compilation time (100% recompilation)
      1.1 ms  Rmath
      1.9 ms  Calculus
     33.3 ms  DualNumbers
      2.1 ms  HypergeometricFunctions
      5.5 ms  StatsFuns
      0.4 ms  StatsFuns → StatsFunsChainRulesCoreExt
      7.5 ms  QuadGK
    229.2 ms  FillArrays
    747.8 ms  Distributions
      1.2 ms  Distributions → DistributionsChainRulesCoreExt
      0.3 ms  DiffEqBase → DiffEqBaseDistributionsExt
      0.8 ms  LaTeXStrings
      1.1 ms  Formatting
     90.7 ms  Latexify
     10.1 ms  LambertW
    480.4 ms  Symbolics
 12.270529 seconds (12.59 M allocations: 775.836 MiB, 5.17% gc time, 3.68% compilation time: 42% of which was recompilation)

Here is the sorted list of the top 10 packages that take more than 300 ms to load.

   1789.4 ms  OrdinaryDiffEq
   1578.9 ms  ExponentialUtilities
    872.6 ms  Groebner
    756.7 ms  Distributions
    687.8 ms  StaticArrays
    563.7 ms  MutableArithmetics
    485.4 ms  Symbolics
    427.1 ms  SymbolicUtils
    424.3 ms  StrideArraysCore
    312.6 ms  VectorizationBase

These 10 packages account for 7.1 seconds out of the 12.2 seconds (58%) of the loading time.

You might want to take a look under your ~/.julia/compiled/v1.9 directory to see how large the shared libraries are for the respective packages (.so, .dll, .dylib). Scanning my .julia/compiled/v1.9 directory for those ten packages, I get the following.

$ du -hcs OrdinaryDiffEq ExponentialUtilities Groebner Distributions StaticArrays MutableArithmetics Symbolics SymbolicUtils StrideArraysCore VectorizationBase
151M	OrdinaryDiffEq
16M	ExponentialUtilities
11M	Groebner
16M	Distributions
75M	StaticArrays
5.7M	MutableArithmetics
14M	Symbolics
15M	SymbolicUtils
600K	StrideArraysCore
19M	VectorizationBase
320M	total

As for your other idea of modifying the precompilation of your dependencies, you could also fork those packages, modify their top-level precompilation statements, and then provide your collaborators a Manifest.toml pointing at your forks.

5 Likes

I used PackageCompiler to package the following environment.

(@ode_sym_test) pkg> st
Status `~/.julia/environments/ode_sym_test/Project.toml`
  [1dea7af3] OrdinaryDiffEq v6.51.1
  [0c5d862f] Symbolics v5.3.1

To compile, I used the following commands.

julia> using PackageCompiler

julia> PackageCompiler.create_sysimage(; sysimage_path="ode_sym_test.so", precompile_execution_file="ode_symbolics.jl")
Precompiling environment...
  164 dependencies successfully precompiled in 267 seconds
[ Info: PackageCompiler: Executing /home/mkitti/test/ode_symbolics.jl => /tmp/jl_packagecompiler_R8pH88/jl_rU7HrL
[ Info: PackageCompiler: Done
⡆ [09m:46s] PackageCompiler: compiling incremental system image

Using the system image, loading the two packages now takes 26 ms.

$ julia -J ode_sym_test.so 

(@v1.9) pkg> activate @ode_sym_test
  Activating project at `~/.julia/environments/ode_sym_test`

julia> @time @time_imports using OrdinaryDiffEq, Symbolics
  0.026072 seconds (13.81 k allocations: 911.373 KiB, 96.39% gc time, 65.00% compilation time)
9 Likes

That’s a somewhat more fine-grained version of what I described as the “preferences method” that didn’t work for me. But I think I was actually putting the preferences in Startup, rather than in the project containing it.

Trying again, I now (correctly) turn off precompilation entirely for OrdinaryDiffEq, and in Startup I have a simple @compile_workload that’s just the first example from the OrdinaryDiffEq README. It now takes ~6s to load on both mac and linux for me:

> julia --project -e '@time using Startup; @time Startup.ode_example(); @time Startup.ode_example();'
  6.081482 seconds (8.87 M allocations: 536.248 MiB, 5.27% gc time, 3.84% compilation time)
  0.011447 seconds (17.48 k allocations: 1.132 MiB, 64.13% compilation time)
  0.000092 seconds (97 allocations: 6.984 KiB)

That’s certainly a lot better, though still going to be a hurdle when I’m proselytizing for the church of Julia. (It’s also surprising that there’s still compilation to be done for the first run of Startup.ode_example even though that’s the @compile_workload. But it’s quick enough that I suppose I’m willing to move on.)


I just found a nice comparison with PackageCompiler in the README of PrecompileTools that would have told me that I can’t get all the way with just PrecompileTools:

  • only PrecompileTools can be used by package developers to ensure a better out-of-box experience for your users
  • only PrecompileTools allows you to update your packages without needing to rebuild Julia
  • only PackageCompiler dramatically speeds up loading time (i.e., using ...) for all the packages

There are steps involved in loading Julia and packages that are still totally obscure to me; that last bullet point tells me that I don’t even have a high-level understanding of what’s happening or how I can control it.

EDIT: I also now see that TTL was also discussed in this blog post, where they also pointed out this difference in capabilities.


So, I suppose that’s about as well as I can do with precompilation alone. Here are my plans for moving forward. Some of my users will be new to Julia, and so will be following my installation instructions carefully, including creating a new Project just for my package, so I’ll also suggest adding this preference to turn off precompilation of OrdinaryDiffEq (and Symbolics if similar results follow). Basically all other users will actually be using my Julia code via one of my Python packages , and that Python package needs to run some Julia code on installation via JuliaCall, so my plan is to just add this preference as part of that Julia code before precompilation. Maybe I’ll be able to get really fancy and make a sysimage with Juliacall…

Thanks for both of these very informative comments, @mkitti. I forgot about @time_imports, and I really appreciate your correlation of the times with the sizes. And your nice example and excellent results with PackageCompiler really inspire me to work something up to make this easy to do for my users. Especially in Python (via JuliaCall), I’m thinking this should be quite feasible.

@MilesCranmer may have some advice on this.

See

1 Like

Effectively, scripts that take a few seconds are not Julia’s selling point. Trying to appear that this is not an issue can backfire.

Hopefully something that takes minutes or hours will appear soon for your collaborators :sweat_smile: .

I would argue that the buzz around TTFX in 1.9 made it appear that this would no longer be an issue. In particular, I’m not seeing anything like the results in the blog post, even when jumping through hoops.

Anyway, I think mkitti has given me a great path forward by pointing to MilesCranmer’s approach.

The time to load, TTL, of OrdinaryDiffEq there seems to be about 5-10s, so compatible with what we are seeing. And that is a huge improvement, but doesn’t solve the issue in your case.

Well, on my old Laptop I get (in performance mode):

  • 5.9s load time for DifferentialEquations with Julia 1.9
  • 16.3s load time with Julia 1.8.5

So this is an improvement by a factor of 2.7…

My hardware:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core™ i7-10510U CPU @ 1.80GHz

It is known that the load times on Windows are higher and are also varying a lot, mainly because on Windows programs like a virus scanner might be active. When measuring the load time I also avoid to use VSCode because the language server can impact the performance in a negative way. I don’t know about MacOS.

On Julia master the load times are again significantly lower, but they will never be as low as loading a library in Python. As far as I understand one reason is that Julia allows inlining of code on global scope which can give you a performance boost of a factor of two compared to C++ or Python, but comes at the price of higher load times…

Can you provide details? Note that the blog post gives a link to code that should allow you to replicate the exact methodology used.

But in general terms, 1.9 does not really address TTL. We say that explicitly in the blog post. The only dramatic difference is for TTX, but wow is it dramatic (often ~100x for some previously long waits like Makie). It’s unlikely we’ll get a 100x gain in TTL, as loading compiled Julia code is much more complicated than anything Python has to do. But master already has some excellent gains and there are package ecosystem improvements that can make a substantial difference. I think the process is already underway with a lot of the SciML ecosystem.