Optimizing a package: how to get started

mbaz · September 9, 2020, 6:03pm

The improvements in Julia v1.5.1, tools like SnoopCompile.jl, and the recent posts by Tim Holy have motivated me to see if I can extract further performance from my plotting package Gaston. Being fairly new to this kind of optimization work in Julia, and since the tools are not always trivial to use, I’d like to ask those with more experience advice on how to get started, and where am I likely to find the greatest gains.

Currently I have these timings in Julia 1.5.1. I’m using Base.Experimental.@optlevel 1 and __precompile__(true).

$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.1 (2020-08-25)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> @time using Gaston
  2.533726 seconds (9.71 M allocations: 540.790 MiB, 9.06% gc time)

julia> @time plot(1:10)
  0.415968 seconds (151.52 k allocations: 7.849 MiB, 1.98% gc time)

julia> @time plot(1:10)
  0.044536 seconds (16.50 k allocations: 822.115 KiB)

julia> @time plot(1:10)
  0.000114 seconds (108 allocations: 7.953 KiB)

julia> @time plot(1:10)
  0.000142 seconds (108 allocations: 7.953 KiB)

julia> @time plot(1.1:1.1:9.9)
  0.155749 seconds (39.43 k allocations: 2.102 MiB)

julia> @time plot(1.1:1.1:9.9)
  0.000131 seconds (134 allocations: 15.688 KiB)

Some thoughts:

I’ve tried with different values of optlevel and it doesn’t seem to make much difference.
The time to load the package seems high to me, considering that it has less than 1400 lines and the code is (I think) fairly straightforward. The number of allocations also seems too large to me.
The time and allocations required by the first plot tell me that the code was not precompiled in a useful way.
I don’t understand why the second plot still takes longer and requires many allocations, and why things settle down at the third plot. (I’m super happy with the microsecond timings, though )
Changing the arguments to float triggers a recomplilation, again with a huge amount of allocations.
In the case of float, the timing settles down to ~100us with the second plot (instead of the third).

I’d appreciate any advice and pointers.

BioTurboNick · September 9, 2020, 6:13pm

I just did some optimization work on a simpler hashing library. The biggest issue was removing unnecessary array allocations. Tuples, @views, or StaticArrays.jl can help.

Also use BenchmarkTools.@btime to get an averaged estimate that removes compilation time, instead of @time.

EDIT: Oh, I see you actually want to work on startup time. Nevermind.

mbaz · September 9, 2020, 6:25pm

Yeah I was just about to mention that. But thanks!

Palli · September 9, 2020, 8:39pm

The first thing to realize is the root cause, and it’s not (for the most part) your package (at least for the “using” part). So this will not be effective for it (and by now I think not in the dependencies either, as of Julia 1.6):

Most of your startup time is taken up by:

julia> @time using ColorSchemes
  1.548290 seconds (2.57 M allocations: 186.804 MiB)

and thereof:

julia> @time using Colors
  0.506410 seconds (494.73 k allocations: 36.814 MiB)

Maybe you can lazy-load the former (the one you use directly). But it only shifts the startup to another place (unless this is often not really used).

I would look into this:

julia> using SnoopCompileCore

julia> invalidations = @snoopr begin
         using ColorSchemes
       end
372-element Vector{Any}:

the exact same amount as if you had tried this for your package.

mbaz · September 9, 2020, 9:12pm

Whoa – those are interesting numbers. I need to look into that. Thanks a lot!

tim.holy · September 10, 2020, 11:39am

More helpful tips:

Don’t pay attention to the length of invalidations directly. Use length(uinvalidated(invalidations)), where you have to load the full SnoopCompile package to get uinvalidated.
Use precompilation. Below is a demo.

(Note juliamns is my alias for julia-master --startup-file=no since I don’t want any other packages to cloud this analysis. You should be able to do everything except this invalidation analysis above on 1.5, however.)

tim@diva:~/.julia/dev/Gaston$ juliamns -q
julia> @time (using Gaston; display(plot(1:10)))
  2.142460 seconds (6.28 M allocations: 399.821 MiB, 4.51% gc time)

julia> 
tim@diva:~/.julia/dev/Gaston$ juliamns -q
julia> using SnoopCompileCore

julia> @snoopi tmin=0.01 begin
           using Gaston
           display(plot(1:10))
       end
2-element Vector{Tuple{Float64, Core.MethodInstance}}:
 (0.08780097961425781, MethodInstance for display(::Gaston.Figure))
 (0.4707939624786377, MethodInstance for plot(::UnitRange{Int64}))

First column is inference time, second is the MethodInstance it was inferring. This implies you spend nearly 0.5s on inferring plot(::UnitRange{Int}). So add this:

diff --git a/src/Gaston.jl b/src/Gaston.jl
index 2731d24..c7ccb7b 100644
--- a/src/Gaston.jl
+++ b/src/Gaston.jl
@@ -91,4 +91,7 @@ function __init__()
     return nothing
 end
 
+@assert precompile(plot, (UnitRange{Int},))
+@assert precompile(display, (Figure,))
+
 end

and then try again:

tim@diva:~/.julia/dev/Gaston$ juliamns -q
julia> @time (using Gaston; display(plot(1:10)))
  1.651489 seconds (3.32 M allocations: 230.272 MiB, 2.84% gc time)

julia>

If it doesn’t work, double-check whether it actually “took”:

tim@diva:~/.julia/dev/Gaston$ juliamns -q
julia> using SnoopCompileCore

julia> @snoopi tmin=0.01 begin
           using Gaston
           display(plot(1:10))
       end
4-element Vector{Tuple{Float64, Core.MethodInstance}}:
 (0.01037907600402832, MethodInstance for #plot#17(::Nothing, ::Int64, ::Nothing, ::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(plot), ::UnitRange{Int64}, ::UnitRange{Int64}, ::Nothing, ::Axes))
 (0.010785102844238281, MethodInstance for Vector{Gaston.Curve}(::Vector{Gaston.Curve{UnitRange{Int64}, UnitRange{Int64}, Nothing, Nothing}}))
 (0.016611099243164062, MethodInstance for write_data(::Gaston.Curve{UnitRange{Int64}, UnitRange{Int64}, Nothing, Nothing}, ::Int64, ::String))
 (0.044039011001586914, MethodInstance for Vector{Union{Nothing, Gaston.Plot}}(::Vector{Gaston.Plot}))

All this is documented in SnoopCompile. Note that you might want to precompile for more than just UnitRange{Int}, of course.

EDIT: ooh, the plot thickens! See https://github.com/JuliaLang/julia/issues/37509. Thanks for asking this question, I had never traced this issue down though I’ve definitely noticed some weird things about precompilation. EDIT2: nvm, though it’s a funny read!

mbaz · September 10, 2020, 1:15pm

Thank you, Tim! This tells me that there are worthwhile gains to be obtained, and where to focus my attention. I know most of what you describe is in your blog posts, but I was looking at several different performance numbers and wasn’t sure where to begin.

Raf · September 10, 2020, 1:32pm

Also see:
https://github.com/JuliaGraphics/ColorSchemes.jl/issues/42

Topic		Replies	Views
Comparition of timing v1.9 vs v1.8 and how to get this precompilation in my own packages? General Usage	4	272	May 5, 2023
Why is this simple code slow (how to speed it up) Performance	7	6519	February 6, 2018
Tracking down slowness of using Performance	1	324	December 14, 2021
Roadmap for a faster time-to-first-plot? Internals & Design ttfp	251	32054	August 3, 2021
Where are those 3 sec TTFP hiding? General Usage	14	484	May 10, 2023

Optimizing a package: how to get started

Related topics