Taking TTFX seriously: Can we make common packages faster to load and use

That seems to be overblown.

Its 80% of package load time in the example, so maybe not. And using Requires does exactly mean that the code behind the requires block is not precompiled. But thanks for the link.

This is not recommended. Instead.

Its not always possible to actually call functions, but feel free to add a PR that calls Blink.Window() to improve compile time :wink:

But I’ll alter the text to include both methods.

Definitely not all of the time, but using precompile should probably the exception rather than the rule. There are some pretty clear maintenance reasons to prefer never using precompile unless you have to.

Sure it can happen, but I dug through a ton of packages and just didn’t find it to be the case for most package load times. In fact, StaticArrays ended up contributing more to using time than Requires for pretty much any package I tried (which is what led to When should a package move to the standard library or system image? StaticArrays, what is it?). So there’s always exceptions, but I don’t think the exceptions should be highlighted as the norm.

I think we are probably optimizing different kinds of packages, and that is reflected in the approach and the problems we encounter.

The places where I have still found classic Julia load times are in things like file IO tools like HDF5.jl and NCDatasets.jl, and some other visualisation things like Blink.Window(), Interact.slider(x). These all still have pretty slow TTFX. Largely they would have to use precompile instead of running functions directly, because of IO. There are other package (e.g. Makie.plot and CSV.File) that are also pretty slow, but for harder to solve reasons.

Maybe it would be more productive to make a list of packages that are still much slower than they need to be, but a lot of people use. Key examples for me are:

Blink.jl
Interact.jl
HDF5.jl

FWIW, precompile statements - especially where there is IO:

GUIs and IO need to do some weird stuff, yes.

I put some effort into this every now and then, reading about packages for analyzing, and techniques, trying things. I have tried both precompile and running code during precompilation. I have never seen any benefit. In fact, running somefunction(...) in a let block as above made TTFX slower in one case. That is, running somefunction(...) the first time after using the module took longer and allocated more. I know there are ways to investigate to find what to call or precompile, etc. But, it’s fairly complicated.

So in my case, it’s definitely not “written without much thought to startup time”. It’s that reducing start up time takes a lot of learning and a lot of time.

8 Likes

If I could use this to share my own “I tried but the results were bad / made no sense” story: I tried to add let block with a “typical code use” statement (and nothing else, no change to the rest of the library), and that made the runtime (not compiletime) of the library worse: Removing precompilation leads to lower allocations!? Some methods started to allocate instead of being allocation free (again, without changing anything in their source code).

I thought that should not be possible, and I still can not figure out what is so special about this fairly boring simulation code, that led to such weird behavior.

3 Likes

I’ve also had this experience. In the example here, precompiling Window for Blink.jl does essentially nothing. But I’m pretty sure that’s because its not type stable very far, and the methods it will call don’t actually get compiled.

With NCDatasets.jl I experienced this too, but then fixed the type stability of the objects so all the fields were concrete. That alone improved compile time, and afterwards adding precompile also helped because it compiled much further down. But getting the type stability was actually the most important part.

@jlapeyre yes I could have phrased that better. Its totally true that sometimes you can do a lot of work on compilation and get no benefit from it. So the situation is more that either nothing has been tried, or improving anything is actually quite difficult.

1 Like

This could be an instance of sporadic type instability with mapreduce / norm · Issue #35800 · JuliaLang/julia · GitHub triggered by having Polyester and other JuliaSIMD tools in the stack. We’ve been talking about this inference bug for a bit, it’s a rough one but hopefully it will get resolved soon.

3 Likes

Its such an awful bug, we wasted a lot of time on it with Accessors.jl, Revise compilation can fix it too giving you hope that you actually fixed something, when you didn’t.

1 Like

Also @jlapeyre, with this post I was hoping we could start sharing problems like you describe, and workshop them here like we do with performance issues. We can all dev a package and run @profview using SomePackage without too much hassle. It’s also easy to push a branch with changes to compare and add to.

3 Likes

Exploring the linked issues, led me to a (seemingly esoteric) thing that is being done with kwargs to help with type inference.

Instead of function f(...; kw::Bool=false) people seem to be doing function f(...;kw::Val{T}=Val(false)) where T.

I think I am seeing things like this in these OrdinaryDiffEq and Polyester changes:

Why is this necessary? Why is it better than saying the keyword will be of type Bool?

EDIT: discourse seems to be stripping out the anchors from the links above making it difficult to tell which lines I am talking about. Here are the SciML and Polyester links with anchors.

That’s to force specialization. Functions and DataTypes do not always specialize when passed into a function.

1 Like

Here’s a fun find in the Blink.jl (and Interact.jl) TTFX story: JSON.jl serialization seems to be responsible for half the TTFX of most JuliaGizmos packages! Swapping to JSON3.jl has huge TTFX gains for Interact.jl, and should for WebIO.jl/Blink.jl:

5 Likes

There is an inherent problem with Julia design as it is now.
I use it for anything that is a little bit more involved… and the developer experience gets worse and worse with huge start-up time recompilations, and the methods to speed that like concrete typing
style: struct A{T1,T2,T3} makes the error messages I get hideous(Flux anyone?)

I used to circumvent that problem by using my port of package compiler , but recently all incremental builds fails for anything that is more than a text book example.

I pointed out in the past in this forum, that multiple dispatch should work upside-down to what is now… a dispatch table for a function should be determined by the context of the caller. thus making all binary code cacheable.

I would also in my dream language investigate Thorin - AnyDSL which uses Continuation Passing Style under the hood , which I think is the holy grail for language like Julia.
Instead of propagating type instability during inference … resolve the instability in the point of contact and carry on compiling a type stable code.

4 Likes

I’m not sure I believe there is any way forward on this issue in general, except for the language itself to improve.

Part of the promise of Julia is that you can have a high-level programming language that was also fast, as long as you internalize some idioms. If the social convention become that library authors need to profile the precise inference/compilation of the package and adjust the package accordingly, then at least to me, Julia is no longer particularly convenient in the first place, and its advantage over a static language becomes less clear.

Realistically, it just won’t be possible to create a social norm around this kind of inference whack-a-mole if it is as difficult or annoying as it is now. If there was a simple checklist to improve TTFX similar to the Julia performance tips, then maybe there could be some limited traction

Of course, nothing prevents enthusiastic individual authors to take deep dives into the compilation process and improve the TTPX of their own package, especially if they’ve authored a widely used package with large latency. But this kind of individual effort for specific packages is not the same as a blanket effort affecting the whole ecosystem

By all means, do make PRs on individual packages where the latency annoys you. I just don’t see how we can meaningfully make a collective effort on this issue.

22 Likes

Isn’t that an unnecessarily depressing take? The whack mole that you are talking about isn’t needed for many packages to be much better than they are now. There are simple improvements available in 100s of packages that have very little to do with the compiler nuances you and @TsurHerman discuss.

That’s what I’m trying to get at here. Half of the improvements are just easy, low hanging fruit, but often also things the compiler will always struggle to optimise.

For example, Interact.jl has 450 stars. Its super useful. And you can get the TTFX down by 80% in half a days work, fixing most of the JuliaGizmos packages at the same time. Probably by 90% with a few more hours.

This stuff isn’t hard, its just basic profiling and type stability, with ProfileView and Cthulhu. We just have to do it, and waiting for the compiler to fix everything isn’t going to work. It wont.

I never expected to avoid this in package dev. I’ve worked on R/C++ packages and these things are trivial compared to the amount of work you need to put in to make that fast. The promise of julia is that you can do whatever you want in a script. Expecting that in package dev is a big ask.

Making it collective involves improving awareness of known TTFX pitfalls, the need to profile, and normalising that its ok to ask for help on TTFX problems like it is for performance optimisation.

I don’t know why you think we have to slave away solo and not share the experience, as I’m trying to do here, and as @ChrisRackauckas has also been doing really well from his experience.

21 Likes

How much is this going to help?

1 Like

Yes, a lot of problems (of course not all) is that people don’t profile (or even time) e.g. package load time or first call latency.

Just as an example, ChainRulesCore dependency causes 4x load time regression · Issue #310 · JuliaMath/SpecialFunctions.jl · GitHub. This was supposed to add a “small lightweight core dependency” and it ended up making the package 4-5x slower to load. It wasn’t until someone complained that ForwardDiff.jl was getting slow to load that I did a @profile using ForwardDiff.jl and it was immediately obvious what was going on. If this is how we treat packages that are pretty much transitively depended on by everyone, no wonder things are slow.

Another one is lazily allocate the threaded buffers and allocate them on the thread that will access it by KristofferC · Pull Request #704 · JuliaWeb/HTTP.jl · GitHub. A single @profile and some easy rewrites and HTTP.jl is 4x faster to load.

So we can come quite far by people actually caring and measuring and putting in a little bit of work.

24 Likes

On the more pessimistic side of things, Julia nightly is currently getting slower to import things (which I am certain is related to improvements in other parts of julia, and there are probably already awesome volunteers working on reversing this slowdown):

# julia 1.7.1
@time using Makie
  7.797519 seconds (13.29 M allocations: 947.065 MiB, 5.52% gc time, 11.28% compilation time)

# julia nightly 1.8.0-DEV.1275 (2022-01-11)
@time using Makie
  9.354482 seconds (14.44 M allocations: 994.084 MiB, 4.89% gc time, 21.24% compilation time)

# julia with the UNFINISHED CodeInstance caching work by Tim Holy
# 1.8.0-DEV.1372 (2022-01-22)
# teh/relocatable_cis/7aac48347e (fork: 1 commits, 1 day)
# (this last comparison does not mean much as this is not finished yet)
@time using Makie
  9.647164 seconds (14.26 M allocations: 1001.128 MiB, 4.79% gc time, 19.08% compilation time)

On

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, znver1)
1 Like