Taking TTFX seriously: Can we make common packages faster to load and use

A segfault is never normal; in Julia, assuming you run with:

julia --check-bounds=yes

So I would report the error (with, and without above), also check for 1.6, especially if not a problem there.

[If your code calls C code (or other language), then all bets are off, segfaults can be expected, and it may not be Julia’s fault. I would still report, even if your Julia code, including for dependencies, isn’t pure.]

I’m not sure [de]serialilization is an exception, I recall something about it not being safe across Julia major versions, but I doubt that applies here, and I don’t recall if that’s something that should lead to a segfault.

@antoine-levitt a good part of your package using time is also Requires.jl, in Brillouin.jl, Setfield.jl and ArrayInterface.jl.

This is another problem with the “just run it” method to precompilation. You are running that code before Requires.jl runs, so a lot of the methods wont be available. I don’t know what the result of that is.

Whats the example calculation you are trying to run?

@Palli thanks, I opened an issue at https://github.com/JuliaLang/julia/issues/43926. Weirdly, it only happens when I put the example code outside of the module. We are calling a C library, but not doing anything fancy with it (in particular we’re not keeping memory references to C objects)

@Raf thanks! How do you see that? Profiling using? Note that the using itself is only a small part of the TTFX: 9s for the using, 65s for the first run. The example I’m running is the snippet in https://github.com/JuliaLang/julia/issues/43926 (which is instantaneous without precompilation, but exercises the main code paths of “standard” calculations). If you’re talking about the @require in DFTK.jl, those are only to make the code play nice with extra packages, and are not run at all with the example code above. Commenting out all the @require in DFTK.jl doesn’t change the TTFX. In fact, commenting out lines 148-214 of DFTK.jl (which doesn’t use any require, and not even Brillouin.jl) does not change the TTFX much (68s).

The Requires.jl problem isn’t DFTK, its in the dependencies ArrayInterface.jl, Setfield.jl and Brillouin.jl.

If you profile using DFTK you will see the red lines calling require.jl in Requires.jl, and around them a line where the @require macro in ArrayInterface.jl etc calls eval on some code.

This may contribute to you first call compile time as well.

OK I see. It doesn’t appear to contribute much to the bulk of the 70s though.

Maybe not. But it will prevent you from fixing the 70s because you can’t precompile it.

You could test it by making the packages you use that are in the @require blocks actual dependencies of those packages (and move the code out of the macro). And see how that changes things.

Also add some function calls/precompile

What I tried was to completley remove all the @require (they are not needed for standard computations), remove the import of Requires and Brillouin, and add the example script to the DFTK module (which doesn’t result in a segfault, unlike when it’s outside), resulting in https://gist.github.com/antoine-levitt/31b539666c6b9efc2c73e973fa353c3b. It results in a marginal reduction to TTFX (to ~60s), at the expense of a much longer precompilation time, which to me is not worth it. That makes me think the real problem is somewhere else, but I don’t know how to diagnose it.

There are still at least 2 more @require blocks though. You use a few array packages in the ArrayInterface.jl require block. And StaticArrays is in the Setfield.jl block.

And yes precompilation will get longer but in all of this Im making the assumption that we dont care as much about precompile time, because for users it runs threaded on package updates rather than in the main thread on starting a new session. Its more annoying for devs for sure.

There are probably other problem like type stability issues to get 60s, Requires is just one of the problems. I doubt there is one “real problem” with a package this big.

Also note: your own require block is not part of the problem, you can keep that. Its more of a problem in lower level deps.

I guess we need to ship .ji files at some point. :slight_smile:

1 Like

@antoine-levitt profiling your code in the gist, TimerOutput.jl seems to cause a lot of problems. The type instability in default_symmetries, energy_wald and AtomicNonLocal also seem to contribute.

I think longer precompilation times are OK if they lead to strong reduction in the TTFX, but here going from 70s to 60s is not worth it.

I doubt there is one “real problem” with a package this big

Profiling does show only a few big blocks, which originally made me hopeful that with one or two fixes it should be doable to halve the time easily…

@Raf thanks for taking the time to profile! Did you profile including the gist? The way I profile is slightly different: I precompile DFTK, then profile an example code (the one in the let block in the gist)

TimerOutput just looks bad because it’s used as a macro to time various parts of the code and so appears in stacktraces, but it’s not a factor (I just tested by disabling it completely, it did not noticeably decrease TTFX). The other functions you mention are indeed type unstable, but is that really an issue? The big spikes I see in the profile appear unrelated.

The big blocks are compilation, which won’t necessarily show exactly where the function is, I guess that happens before the function runs.

The type stability is totally going to be a problem, especially because it breaks precompilation. You can almost certainly get the compile time down under 20s if you try, making all of these things worth it. But you would have to take all those little issues seriously, there is no silver bullet.

It feels a little difficult to convince you of this so I’ll leave it at that, I dont have the time to actually do the work on a package I dont use.

I appreciate you taking the time to try to explain and I didn’t want to derail the conversation with my particular case. It’s just a time investment tradeoff at that point: I’m willing to actually do work and complicate my code base if I have a more or less clear understanding of how to proceed. This is the case for performance optimization: equipped with ProfileView and the “performance tips” in the docs, I can just eyeball the bottlenecks and optimize them my spare time, making the readability/performance tradeoffs I’m comfortable with.

Latency optimization definitely doesn’t feel that way: I have absolutely no idea what impacts latency, it’s not clear to me what patterns will impact it, and the resources I’ve looked at didn’t really clarify the situation (at least not without what feels to me to be a pretty significant time investment into a somewhat arcane topic). I can’t just make random functions type stable (which is often significant work and require code reorganization; the whole point of using julia is to be able to write type unstable code for those parts of the code that are not performance-critical, otherwise I’d use a static language) in the hope that it’ll improve things. So I’ll just take the bad TTFX and live with it, which is fine. I’m guessing most people are in the same boat.

12 Likes

Yeah its totally a time tradeoff, and I hear that this seems like arcane knowledge, maybe it is. Maybe Julia packages are just doomed to be slow until we have first class glue packages and a compile cache.

I some cases I do think its worth trying seriously once to tune in what the problems are so you can generally get it right afterwards. After doing this for 10 packages or so there are patterns… its nearly always Requires.jl and type instability. That was my approach to Interact.jl on the weekend, and it has 5x faster TTFX from just that. Blink.jl is harder, because the design problem is harder to solve. But it is generally possible to do in a systematic way.

You dont have to tune random functions, only those showing time in the profile. Although also e.g. fixing unstable struct fields can just improve things a little bit everywhere they are used.

3 Likes

Out of curiosity, there is this workshop from JuliaCon 2021.

So far, I haven’t found the motivation to work through this. Are those tips & tricks explained there? Is it worth it to watch and to understand the difficulties going on?

Edit: oops, there also seems to be a shorter/different one: https://youtu.be/rVBgrWYKLHY

1 Like

@antoine-levitt I didn’t look at your code, so I’m sorry if this doesn’t help. Do you happen to use StaticArrays.jl quite a bit? We experienced significant compilation latency for not-very-small static arrays, see https://github.com/trixi-framework/Trixi.jl/issues/516. We could fix that without decreasing runtime efficiency by switching to plain arrays in our case (see the PR linked in the issue above).

Thanks, that’s helpful! We only use static arrays for points and matrices in R^3, and the part of the code that takes a large TTFX is not the one that uses them, so I don’t think that’s the issue.

Maybe we can turn this discussion into a crowdsourced list of tips, and put that in a pinned thread or something? Here’s what I’ve gotten from this discussion, feel free to correct or add to it.

Don’t use large StaticArrays

https://github.com/trixi-framework/Trixi.jl/issues/516

Don’t use Requires

Enclosed code cannot be precompiled.

Careful about type instabilities

(this one is not clear to me)

Trigger precompilation in your package module

Call functions that don’t have a side effect. Use precompile on those that do. Info on the internet about __precompile__ is outdated, don’t use it.

Avoid invalidations

Profile your first run

Sometimes you get useful information out of it. SnoopCompile also gives a potentially more helpful view.

Use @nospecialize to disable inference of a particular argument

“use it when a function is called multiple times with different argument types, compiling that function is expensive but specialising on argument types does not have a significant impact on run-time performance” When should we use `@nospecialize`?

16 Likes

That sounds good to me :+1:

Thats a good list. Type instability causes long load times (in my understanding) largely because:

a. the code for handling boxed variables is much more complicated than for e.g a known Int. So it takes longer to compile.
b. The julia compiler tries really hard to resolve types. If your function is simple and type stable, this is very fast. But we can see that the towers of abstractinterpretation.jl type inference function calls often take most of the load time. Often these disappear with simple type stable code. This is probably slightly wrong, but someone who works on the compiler can maybe explain this more…
c: the compiler is more likely to precompile the right methods if everything is type stable, instead of random things you don’t actually use: Why isn't `size` always inferred to be an Integer? - #22 by mbauman

But the fact that removing type instability reduces load time isnt controversial at all, although definately misunderstood… every PR linked in this thread does that.

3 Likes

Good to know, but it’s entirely unclear to me it needs to be that way. The good thing, is that you often want to fix type instabilities anyway, since your code will be faster, and gets rid of allocations, thus less of (or no) GC pressure. It’s just unclear why such code can’t be precompiled. E.g. code in C with malloc can be compiled… I guess it’s just a matter of priorities for the compiler people, since you want to get rid of this inefficiency anyway. And I guess you mean it just breaks precompilation with that specific code (if you’re actually correct), not the whole module?!

Right, compilation takes longer (but needs not, if I understood Elrod correctly, the package could be improved). Do you mean the precompilation time only, or does it also affect using (since precompilation isn’t full, some of the compilation is deferred)?