Taking TTFX seriously: Can we make common packages faster to load and use

How much is this going to help?

1 Like

Yes, a lot of problems (of course not all) is that people don’t profile (or even time) e.g. package load time or first call latency.

Just as an example, ChainRulesCore dependency causes 4x load time regression · Issue #310 · JuliaMath/SpecialFunctions.jl · GitHub. This was supposed to add a “small lightweight core dependency” and it ended up making the package 4-5x slower to load. It wasn’t until someone complained that ForwardDiff.jl was getting slow to load that I did a @profile using ForwardDiff.jl and it was immediately obvious what was going on. If this is how we treat packages that are pretty much transitively depended on by everyone, no wonder things are slow.

Another one is lazily allocate the threaded buffers and allocate them on the thread that will access it by KristofferC · Pull Request #704 · JuliaWeb/HTTP.jl · GitHub. A single @profile and some easy rewrites and HTTP.jl is 4x faster to load.

So we can come quite far by people actually caring and measuring and putting in a little bit of work.

23 Likes

On the more pessimistic side of things, Julia nightly is currently getting slower to import things (which I am certain is related to improvements in other parts of julia, and there are probably already awesome volunteers working on reversing this slowdown):

# julia 1.7.1
@time using Makie
  7.797519 seconds (13.29 M allocations: 947.065 MiB, 5.52% gc time, 11.28% compilation time)

# julia nightly 1.8.0-DEV.1275 (2022-01-11)
@time using Makie
  9.354482 seconds (14.44 M allocations: 994.084 MiB, 4.89% gc time, 21.24% compilation time)

# julia with the UNFINISHED CodeInstance caching work by Tim Holy
# 1.8.0-DEV.1372 (2022-01-22)
# teh/relocatable_cis/7aac48347e (fork: 1 commits, 1 day)
# (this last comparison does not mean much as this is not finished yet)
@time using Makie
  9.647164 seconds (14.26 M allocations: 1001.128 MiB, 4.79% gc time, 19.08% compilation time)

On

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, znver1)
1 Like

But… I honestly don’t know why you would focus on that. Compiler performance varies month to month for various things but has generally gotten a lot better in the last few years.

As you can see from the various posted example, it’s not always the compilers responsibility to fix things for us, there are 5x performance gains sitting around if you look. Maybe not in Makie. But I guess for at least half of the packages you would say are slow to load.

If we act like its only the compilers responsibility to fix everything, packages will inevitably remain slow to load despite the best efforts of the compiler team.

2 Likes

No, this one is a problem. Hopefully we can fix it before 1.8.

3 Likes

My bad if what I said was perceived as “the compiler is bad” or “this slowdown makes package profiling unworthy of attention” (or maybe the issue is that I derailed the conversation). I eagerly work on making my code easier to infer/compile. However, I do believe the above example of slowdown is just as important to address, as instilling a culture of “inference profiling” among us, casual package developers. And from listening to JuliaCon talks and reading Hacker News comments, I do know that core devs are taking that seriously (I was not trying to insinuate the opposite).

1 Like

I didn’t think you meant that… I mean of course that needs to be fixed as well and is good to know about.

But after it is fixed there will still be slow packages around that someone still needs to profile. That’s all we really have influence on unless we actually work on the compiler.

1 Like

Getting good MWE’s of compile time regressions are really useful. Whenever you find one, make an issue. We can’t fix regressions we don’t know about.

10 Likes

Can we hack together some list of best practices, which could become a page in the docs? Turning the initial list into instructions or alternatively a “debug guide”:

How to write packages with good startup times

  1. Write type stable code whose return types can be inferred
  2. Avoid using Requires.jl, because enclosed code cannot be precompiled
  3. If possible, put widely used calls into a let block in the package main source file
  4. Use precompile for functions with side effects, avoid it for functions without.
  5. Avoid unnecessary dependencies
  6. Use ProfileView.@profview using YourPackage to visualize the loading process.

i copied the list over here, where you can freely edit it and make corrections, if necessary, as I’m by no means an authority in this topic. Maybe we can turn this into a PR?

Regarding the last point: I just @profviewed one of my packages and the output is really hard to interpret. Are there any specific things to look out for? Currently, most of the time is spent in task.jl, poptask (80%) and loading.jl, _tryrequire_from_serialized (20%) - which tells me nothing, does it?

16 Likes

Use precompile when the function has side effects, like opening a window. Maybe we should also design packages so that most functions dont have side effects, like in many functional languages. To make precompilayion easier.

It tells you something… it sounds like the time is in some async tasks? That does make things a little harder. But we would really need to see the flame graph to know. What’s the package?

2 Likes

The package is here, but it’s in explorative state and the code quality is so bad that I’m a bit ashamed to share :smiley: (type instability etc). The flame graph looks like this:


corresponding data

The long flat task.jl on the right looks suspicious, but I don’t know where it comes from.

the right part comes from the (unused) threads. Should go away with a single thread.

4 Likes

Yes, having a simple checklist for package authors would be a good idea. What to think about when writing the code, and what to do as a “compile time optimization pass” after writing the code.

Maybe I was too pessimistic. My impression after having tried to use SnoopCompile on a few of my own packages is that reducing compilation time requires you to know all kinds of implementation details like backedges, and how precompilation actually works. And that you, beside having to know all this stuff, then need to carefully go through and analyze first invalidations, then time and analyze inference, and lastly go through your package to find all APIs and exercise them in precompile statements.

But maybe that was a wrong impression, and it’s actually possible to learn a series of quick checks/tweaks that can be done in say, 30 minutes for a small package. Such a checklist would be worth gold.

Maybe also copy some steps in from this checklist in SnoopCompile: SnoopCompile.jl · SnoopCompile

4 Likes

How does one check the effect of precompilation on TTFX? Just time using with and without?

9 posts were split to a new topic: Why isn’t size always inferred to be an Integer?

@PetrKryslUCSD im using TTFX like time-to-first-plot. The time to load a package + the time to do its main thing X that it does, using @time thefunction(x).

My PR above made blinks load in 1s, but Window() takes 15 so its still not a huge gain.

1 Like

Yeah SnoopCompile is more complicated. I mostly just use ProfileView.jl and Cthulhu.jl and give up after that. Anything really bad is pretty obvious in ProfileView. But I would also like to know how to use SnoopCompile better.

1 Like

So, inspired by this thread I decided to ProfileView the TTFX for my package, DFTK, which currently stands at a ridiculous 90s. About 1/5 of that time is in inference, without reference to package code. Some places I can recognize are functions that have nested functions, but that is a pattern I use a lot and I’m not sure how to do without (and not sure if that’s even the problem). A lot of the advice I see for reducing TTFP are about method redefinitions, but I don’t think I redefine many base functions. I’d appreciate any advice.

3 Likes

I wander if the compiler could read assertions, though, and use that information.

Question more on the subject: Can someone provide a clear step by step of what they are doing to produce these flamegraphs and benchmarking the TTFX? It is not clear to me how to do that, given that the first execution of the profile has to be discarded, and on the second run well, it is not anymore the first run (at least the one from VSCode these are the instructions).

So let it also solve this issue :smiling_imp: