Taking TTFX seriously: Can we make common packages faster to load and use

If I could use this to share my own “I tried but the results were bad / made no sense” story: I tried to add let block with a “typical code use” statement (and nothing else, no change to the rest of the library), and that made the runtime (not compiletime) of the library worse: Removing precompilation leads to lower allocations!? Some methods started to allocate instead of being allocation free (again, without changing anything in their source code).

I thought that should not be possible, and I still can not figure out what is so special about this fairly boring simulation code, that led to such weird behavior.

3 Likes

I’ve also had this experience. In the example here, precompiling Window for Blink.jl does essentially nothing. But I’m pretty sure that’s because its not type stable very far, and the methods it will call don’t actually get compiled.

With NCDatasets.jl I experienced this too, but then fixed the type stability of the objects so all the fields were concrete. That alone improved compile time, and afterwards adding precompile also helped because it compiled much further down. But getting the type stability was actually the most important part.

@jlapeyre yes I could have phrased that better. Its totally true that sometimes you can do a lot of work on compilation and get no benefit from it. So the situation is more that either nothing has been tried, or improving anything is actually quite difficult.

1 Like

This could be an instance of https://github.com/JuliaLang/julia/issues/35800 triggered by having Polyester and other JuliaSIMD tools in the stack. We’ve been talking about this inference bug for a bit, it’s a rough one but hopefully it will get resolved soon.

3 Likes

Its such an awful bug, we wasted a lot of time on it with Accessors.jl, Revise compilation can fix it too giving you hope that you actually fixed something, when you didn’t.

1 Like

Also @jlapeyre, with this post I was hoping we could start sharing problems like you describe, and workshop them here like we do with performance issues. We can all dev a package and run @profview using SomePackage without too much hassle. It’s also easy to push a branch with changes to compare and add to.

3 Likes

Exploring the linked issues, led me to a (seemingly esoteric) thing that is being done with kwargs to help with type inference.

Instead of function f(...; kw::Bool=false) people seem to be doing function f(...;kw::Val{T}=Val(false)) where T.

I think I am seeing things like this in these OrdinaryDiffEq and Polyester changes:

https://github.com/SciML/OrdinaryDiffEq.jl/pull/1473/files#diff-8ce813ea8d7f370bc91b5ac1526a80f7fd354be769bafdd6b12b7368f3ae90a9L395

https://github.com/JuliaSIMD/Polyester.jl/commit/5e6ae4c2ae009b507bbcf90ff2c2b9b7d5e94559#diff-a523f7f63af3c48f517501d7e392926093393f7d033fb208f477568ec30bec38L72

Why is this necessary? Why is it better than saying the keyword will be of type Bool?

EDIT: discourse seems to be stripping out the anchors from the links above making it difficult to tell which lines I am talking about. Here are the SciML and Polyester links with anchors.

That’s to force specialization. Functions and DataTypes do not always specialize when passed into a function.

1 Like

Here’s a fun find in the Blink.jl (and Interact.jl) TTFX story: JSON.jl serialization seems to be responsible for half the TTFX of most JuliaGizmos packages! Swapping to JSON3.jl has huge TTFX gains for Interact.jl, and should for WebIO.jl/Blink.jl:

https://github.com/JuliaGizmos/AssetRegistry.jl/pull/15
https://github.com/JuliaGizmos/WebIO.jl/issues/479

5 Likes

There is an inherent problem with Julia design as it is now.
I use it for anything that is a little bit more involved… and the developer experience gets worse and worse with huge start-up time recompilations, and the methods to speed that like concrete typing
style: struct A{T1,T2,T3} makes the error messages I get hideous(Flux anyone?)

I used to circumvent that problem by using my port of package compiler , but recently all incremental builds fails for anything that is more than a text book example.

I pointed out in the past in this forum, that multiple dispatch should work upside-down to what is now… a dispatch table for a function should be determined by the context of the caller. thus making all binary code cacheable.

I would also in my dream language investigate Thorin - AnyDSL which uses Continuation Passing Style under the hood , which I think is the holy grail for language like Julia.
Instead of propagating type instability during inference … resolve the instability in the point of contact and carry on compiling a type stable code.

4 Likes

I’m not sure I believe there is any way forward on this issue in general, except for the language itself to improve.

Part of the promise of Julia is that you can have a high-level programming language that was also fast, as long as you internalize some idioms. If the social convention become that library authors need to profile the precise inference/compilation of the package and adjust the package accordingly, then at least to me, Julia is no longer particularly convenient in the first place, and its advantage over a static language becomes less clear.

Realistically, it just won’t be possible to create a social norm around this kind of inference whack-a-mole if it is as difficult or annoying as it is now. If there was a simple checklist to improve TTFX similar to the Julia performance tips, then maybe there could be some limited traction

Of course, nothing prevents enthusiastic individual authors to take deep dives into the compilation process and improve the TTPX of their own package, especially if they’ve authored a widely used package with large latency. But this kind of individual effort for specific packages is not the same as a blanket effort affecting the whole ecosystem

By all means, do make PRs on individual packages where the latency annoys you. I just don’t see how we can meaningfully make a collective effort on this issue.

22 Likes

Isn’t that an unnecessarily depressing take? The whack mole that you are talking about isn’t needed for many packages to be much better than they are now. There are simple improvements available in 100s of packages that have very little to do with the compiler nuances you and @TsurHerman discuss.

That’s what I’m trying to get at here. Half of the improvements are just easy, low hanging fruit, but often also things the compiler will always struggle to optimise.

For example, Interact.jl has 450 stars. Its super useful. And you can get the TTFX down by 80% in half a days work, fixing most of the JuliaGizmos packages at the same time. Probably by 90% with a few more hours.
https://github.com/piever/Widgets.jl/pull/48

This stuff isn’t hard, its just basic profiling and type stability, with ProfileView and Cthulhu. We just have to do it, and waiting for the compiler to fix everything isn’t going to work. It wont.

I never expected to avoid this in package dev. I’ve worked on R/C++ packages and these things are trivial compared to the amount of work you need to put in to make that fast. The promise of julia is that you can do whatever you want in a script. Expecting that in package dev is a big ask.

Making it collective involves improving awareness of known TTFX pitfalls, the need to profile, and normalising that its ok to ask for help on TTFX problems like it is for performance optimisation.

I don’t know why you think we have to slave away solo and not share the experience, as I’m trying to do here, and as @ChrisRackauckas has also been doing really well from his experience.

21 Likes

https://github.com/JuliaLang/julia/pull/42016

How much is this going to help?

1 Like

Yes, a lot of problems (of course not all) is that people don’t profile (or even time) e.g. package load time or first call latency.

Just as an example, ChainRulesCore dependency causes 4x load time regression · Issue #310 · JuliaMath/SpecialFunctions.jl · GitHub. This was supposed to add a “small lightweight core dependency” and it ended up making the package 4-5x slower to load. It wasn’t until someone complained that ForwardDiff.jl was getting slow to load that I did a @profile using ForwardDiff.jl and it was immediately obvious what was going on. If this is how we treat packages that are pretty much transitively depended on by everyone, no wonder things are slow.

Another one is lazily allocate the threaded buffers and allocate them on the thread that will access it by KristofferC · Pull Request #704 · JuliaWeb/HTTP.jl · GitHub. A single @profile and some easy rewrites and HTTP.jl is 4x faster to load.

So we can come quite far by people actually caring and measuring and putting in a little bit of work.

24 Likes

On the more pessimistic side of things, Julia nightly is currently getting slower to import things (which I am certain is related to improvements in other parts of julia, and there are probably already awesome volunteers working on reversing this slowdown):

# julia 1.7.1
@time using Makie
  7.797519 seconds (13.29 M allocations: 947.065 MiB, 5.52% gc time, 11.28% compilation time)

# julia nightly 1.8.0-DEV.1275 (2022-01-11)
@time using Makie
  9.354482 seconds (14.44 M allocations: 994.084 MiB, 4.89% gc time, 21.24% compilation time)

# julia with the UNFINISHED CodeInstance caching work by Tim Holy
# 1.8.0-DEV.1372 (2022-01-22)
# teh/relocatable_cis/7aac48347e (fork: 1 commits, 1 day)
# (this last comparison does not mean much as this is not finished yet)
@time using Makie
  9.647164 seconds (14.26 M allocations: 1001.128 MiB, 4.79% gc time, 19.08% compilation time)

On

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, znver1)
1 Like

But… I honestly don’t know why you would focus on that. Compiler performance varies month to month for various things but has generally gotten a lot better in the last few years.

As you can see from the various posted example, it’s not always the compilers responsibility to fix things for us, there are 5x performance gains sitting around if you look. Maybe not in Makie. But I guess for at least half of the packages you would say are slow to load.

If we act like its only the compilers responsibility to fix everything, packages will inevitably remain slow to load despite the best efforts of the compiler team.

2 Likes

No, this one is a problem. Hopefully we can fix it before 1.8.

3 Likes

My bad if what I said was perceived as “the compiler is bad” or “this slowdown makes package profiling unworthy of attention” (or maybe the issue is that I derailed the conversation). I eagerly work on making my code easier to infer/compile. However, I do believe the above example of slowdown is just as important to address, as instilling a culture of “inference profiling” among us, casual package developers. And from listening to JuliaCon talks and reading Hacker News comments, I do know that core devs are taking that seriously (I was not trying to insinuate the opposite).

1 Like

I didn’t think you meant that… I mean of course that needs to be fixed as well and is good to know about.

But after it is fixed there will still be slow packages around that someone still needs to profile. That’s all we really have influence on unless we actually work on the compiler.

1 Like

Getting good MWE’s of compile time regressions are really useful. Whenever you find one, make an issue. We can’t fix regressions we don’t know about.

10 Likes

Can we hack together some list of best practices, which could become a page in the docs? Turning the initial list into instructions or alternatively a “debug guide”:

How to write packages with good startup times

  1. Write type stable code whose return types can be inferred
  2. Avoid using Requires.jl, because enclosed code cannot be precompiled
  3. If possible, put widely used calls into a let block in the package main source file
  4. Use precompile for functions with side effects, avoid it for functions without.
  5. Avoid unnecessary dependencies
  6. Use ProfileView.@profview using YourPackage to visualize the loading process.

i copied the list over here, where you can freely edit it and make corrections, if necessary, as I’m by no means an authority in this topic. Maybe we can turn this into a PR?

Regarding the last point: I just @profviewed one of my packages and the output is really hard to interpret. Are there any specific things to look out for? Currently, most of the time is spent in task.jl, poptask (80%) and loading.jl, _tryrequire_from_serialized (20%) - which tells me nothing, does it?

16 Likes