Taking TTFX seriously: Can we make common packages faster to load and use

Yeah its totally a time tradeoff, and I hear that this seems like arcane knowledge, maybe it is. Maybe Julia packages are just doomed to be slow until we have first class glue packages and a compile cache.

I some cases I do think its worth trying seriously once to tune in what the problems are so you can generally get it right afterwards. After doing this for 10 packages or so there are patterns… its nearly always Requires.jl and type instability. That was my approach to Interact.jl on the weekend, and it has 5x faster TTFX from just that. Blink.jl is harder, because the design problem is harder to solve. But it is generally possible to do in a systematic way.

You dont have to tune random functions, only those showing time in the profile. Although also e.g. fixing unstable struct fields can just improve things a little bit everywhere they are used.

3 Likes

Out of curiosity, there is this workshop from JuliaCon 2021.

So far, I haven’t found the motivation to work through this. Are those tips & tricks explained there? Is it worth it to watch and to understand the difficulties going on?

Edit: oops, there also seems to be a shorter/different one: https://youtu.be/rVBgrWYKLHY

1 Like

@antoine-levitt I didn’t look at your code, so I’m sorry if this doesn’t help. Do you happen to use StaticArrays.jl quite a bit? We experienced significant compilation latency for not-very-small static arrays, see https://github.com/trixi-framework/Trixi.jl/issues/516. We could fix that without decreasing runtime efficiency by switching to plain arrays in our case (see the PR linked in the issue above).

Thanks, that’s helpful! We only use static arrays for points and matrices in R^3, and the part of the code that takes a large TTFX is not the one that uses them, so I don’t think that’s the issue.

Maybe we can turn this discussion into a crowdsourced list of tips, and put that in a pinned thread or something? Here’s what I’ve gotten from this discussion, feel free to correct or add to it.

Don’t use large StaticArrays

https://github.com/trixi-framework/Trixi.jl/issues/516

Don’t use Requires

Enclosed code cannot be precompiled.

Careful about type instabilities

(this one is not clear to me)

Trigger precompilation in your package module

Call functions that don’t have a side effect. Use precompile on those that do. Info on the internet about __precompile__ is outdated, don’t use it.

Avoid invalidations

Profile your first run

Sometimes you get useful information out of it. SnoopCompile also gives a potentially more helpful view.

Use @nospecialize to disable inference of a particular argument

“use it when a function is called multiple times with different argument types, compiling that function is expensive but specialising on argument types does not have a significant impact on run-time performance” When should we use `@nospecialize`?

16 Likes

That sounds good to me :+1:

Thats a good list. Type instability causes long load times (in my understanding) largely because:

a. the code for handling boxed variables is much more complicated than for e.g a known Int. So it takes longer to compile.
b. The julia compiler tries really hard to resolve types. If your function is simple and type stable, this is very fast. But we can see that the towers of abstractinterpretation.jl type inference function calls often take most of the load time. Often these disappear with simple type stable code. This is probably slightly wrong, but someone who works on the compiler can maybe explain this more…
c: the compiler is more likely to precompile the right methods if everything is type stable, instead of random things you don’t actually use: Why isn't `size` always inferred to be an Integer? - #22 by mbauman

But the fact that removing type instability reduces load time isnt controversial at all, although definately misunderstood… every PR linked in this thread does that.

3 Likes

Good to know, but it’s entirely unclear to me it needs to be that way. The good thing, is that you often want to fix type instabilities anyway, since your code will be faster, and gets rid of allocations, thus less of (or no) GC pressure. It’s just unclear why such code can’t be precompiled. E.g. code in C with malloc can be compiled… I guess it’s just a matter of priorities for the compiler people, since you want to get rid of this inefficiency anyway. And I guess you mean it just breaks precompilation with that specific code (if you’re actually correct), not the whole module?!

Right, compilation takes longer (but needs not, if I understood Elrod correctly, the package could be improved). Do you mean the precompilation time only, or does it also affect using (since precompilation isn’t full, some of the compilation is deferred)?

I’m talking about the reality of how it works now. By “breaks” precompilation I mean precompiling unstable functions seems to give less or no reduction in TTFX compared to precompiling stable functions.

The basic reason for this is that with type stable code, the compiler can trace all the functions your top level code depends on and precompile all of them, but if the compiler has to trace unstable functions, it loses the ability to know what methodInstances get called which means it can’t precompile the recursive dependencies.

6 Likes

The basic reason for this is that with type stable code, the compiler can trace all the functions your top level code depends on and precompile all of them, but if the compiler has to trace unstable functions, it loses the ability to know what methodInstances get called which means it can’t precompile the recursive dependencies.

I get that it applies to precompile statements, but does it also apply to function calls?

1 Like

My only experience in developing packages is a little toy one I created just to learn more about the process, but I have not yet found the need to develop my own “serious” package yet. So, in a sense, I feel like this thread is directly towards people like myself who will need a good deal of help to avoid common mistakes/programming patterns that increase the TTFX. Rightly or wrongly, reading the comments gives the impression that there is a very high bar to clear.

Others (here and here) have come up with good “precompilation checklists” to go through for reducing start time. But now consider a “knowledge checklist” one may need in order to implement those suggestions (and the others above):

  • how to interpret the built-in profiler,
  • how to recognize type instabilities and understand @code_warntype,
  • how to read a ProfileViews.jl flamegraph,
  • how to use Cthulhu.jl,
  • what causes an “invalidation”,
  • how to use SnoopCompile.jl,
  • how to “properly” use Requires.jl
  • what makes code precompile-able in the first place,

and there are likely others I’ve missed. Some of these are fairly basic (interpreting the profiler and @code_warntype, reading flame graphs), but others may require more understanding.

Looking at the entirety of the thread, what would help the most is an easy-to-follow tutorial, perhaps with a dummy package repo on github, where each step is shown and explained in order of “low-hanging fruit” to “the hard stuff”. Because right now, again rightly or wrongly, I look at this thread and just think “wow, that’s a lot of work”, so anything to make the process seem more accessible would be huge.

17 Likes

I mean the first time a function is called.

Maybe this is stupid but: I noticed that @inferred was never mentioned in the thread. I know that other tools allow for more fine-grained inspection, but shouldn’t @inferred allow to catch at least a few type instabilities?

(I’m asking since I use it all the time)

2 Likes

Inspired by this post on TTFX with CSV/DataFrames, I made a quick attempt at a function to run script files repeatedly.

using Statistics

"""
    ttfx(; code="sleep(1)", N=10, args = "", preview=false)

Compute the time to first X. 

`ttfx` will run a `file` `N` times to determine the total 
startup cost of running certain packages/functions. `ttfx()`
with no arguments will simply run `sleep(1)` and may be used
to estimate the base julia runtime cost.

`code` can either be a short snippet that will run with 
the `-e` switch or a file containing the script to be run.

`args` can be used to set Julia runtime options

`preview = true` will show the final command without running it.

"""
function ttfx(; code="sleep(1)", N=10, args = "", preview=false)
    # If running a short snippet and not a file, add a -e
    if !isfile(code)
        code = "-e '$code'"
    end

    # `cmd doesn't interpolate properly or something
    # so using shell_parse`and cmd_gen is the workaround
    ex, = Base.shell_parse("julia $args $code")
    julia_cmd = Base.cmd_gen(eval(ex))

    # Return only the command that would have been run
    preview && return julia_cmd
    
    # Run the command N times
    times = Vector{Float64}(undef, N)
    for i = 1:N
        times[i] = @elapsed run(`$julia_cmd`)
    end

    return median(times), times
end

# Run the default timing with sleep
t = ttfx()

# Run `using CSV` 15 times in the current project with CSV.jl installed and 8 threads
t = ttfx(code="using CSV", N=15, args="-t 8 --project=@.")

For me ttffx() takes a median time of ~1.17 seconds, so there is a baseline julia runtime cost of 0.17 second on my machine. Then using CSV on my computer takes a median time of 3.4 seconds

Feel free to edit and expand this (perhaps into the @ctime macro that was suggested). Code suggestions welcome!

8 Likes

You can clean up the Cmd stuff by doing it like this:

function ttfx(; code="sleep(1)", N=10, args = String[], preview=false)
    # If running a short snippet and not a file, add a -e
    if !isfile(code)
        code = `-e $code`
    end

    julia_cmd = `julia $args $code`

    # Return only the command that would have been run
    preview && return julia_cmd
    
    # Run the command N times
    times = Vector{Float64}(undef, N)
    for i = 1:N
        times[i] = @elapsed run(julia_cmd)
    end

    return median(times), times
end

In this way, to pass args, do pass a list of strings like ttfx(; args=["-O1","--compile=min"]).

6 Likes

Or ttfx(args=`-O1 --compile=min`) works too.

5 Likes

Thanks! I initially tried using both the backticks `` and @cmd to convert the string to Cmd, but I was getting a strange error from Base.shell_parse, which is the reason for that note in my original function. But for some reason your new version doesn’t give me the same error…

Strings, commands, and arrays all interpolate into commands differently, and when you tried, you probably you had something as a string that should’ve been a command or something like that. I often end up using trial-and-error although I really should just learn the rules. In the end, you should usually not need “manual quoting” like code = "-e '$code'", and definitely should not need eval or Base.shell_parse, etc.

1 Like

Ah, that’s good to know. Trial-and-error is how I ended up with what I had initially.

Heres a minor rework, with a macro:

using Statistics
function ttfx(; code="sleep(1)", N=2, args = String[], preview=false)
    jl_startup = @elapsed run(`julia $args -e ""`)
    # If running a short snippet and not a file, add a -e
    if !isfile(code)
        code = `-e $code`
    end
    julia_cmd = `julia $args $code`
    # Return only the command that would have been run
    preview && return julia_cmd
    # Run the command N times
    times = Vector{Float64}(undef, N)
    for i = 1:N
        times[i] = @elapsed run(julia_cmd)
    end
    times .-= jl_startup
    return (mean=mean(times), times=times)
end
macro ttfx(code::String, N=1)
    ttfx(; code, N)
end

@ttfx "your code here" 10
1 Like