Ah, I didn’t realize TTL wasn’t just a subset of TTFX. (TBH, seems like cheating IMO.) So my results above do align with the blog post. Sorry for the confusion.
Anyway, I don’t mean to be raining on the parade. My experience certainly improved as soon as I jumped to 1.9-rc1, and I really appreciate all the work you (Tim) and everyone else have put into this. It’s just the one use case for my colleagues that I was mistaken about, but even that should be much better with Miles’s approach.
There’s a lot of diversity in usage, but when I use the terms TTFX and TTL are completely independent of one another. (It’s not that TTL is a subset of TTFX, they are just two different things.) In my usage TTFX is just about (pre)compilation; TTL is just about loading. From a mechanistic perspective, everything you do to improve one doesn’t directly make one iota of difference to the other. I always use TTFX + TTL when I want to talk about them both. So it’s not cheating, it’s just being precise. But I acknowledge that this distinction may be lost on many people. (Indeed, these days we often discuss TTL in the slack channel named #ttfx, lol, although there actually was discussion about renaming the channel because of that.)
Just to elaborate on this, my understanding is that if we look at first-order indirect effects, usually improving one comes at the detriment of the other.
Brute force precompiling more methods will reduce the TTFX, but it means that when you load the package, there’s more precompiled code to load and verify, so load times suffer.
Precompiling less code will mean there’s less work to do at the loading time so hopefully faster, but more work to do when you want to actually call a function.
Those tradeoffs are why a lot of the advances to TTL are going to involve being smarter about how code is loaded, and smarter about what code specifically is precompiled.
It seems like the confusion about “time to first x” is “from what starting point?” and I imagine most nonspecialists have in mind a bash prompt as the starting point.
Spot on, @Mason. They’re not strictly inverse-correlated because you can improve the mechanisms of code-loading without hurting TTFX (i.e., there are cases where can you escape even the indirect effects). With 1.9 we’re also already at a place where both are lower than they were in either 1.8 or 1.7, so we’re making progress in an absolute sense. But you’re absolutely right that an indirect affect of reducing TTFX is typically a small increase in TTL.
Sure, I agree. But just as we in the Julia community tend to differentiate between the performance of the first call and the second call (we tend to view only the second as a true measure of Julia’s performance), those of us who work on latency cannot afford to be handicapped by imprecise language about a many-faceted problem. I thought we were reasonably clear about terms in that blog post, but if you can suggest language that would make this more obvious, I’m happy to make changes.
I’m using “x time” there to indicate duration rather than “time to x” because “time to x” more readily raises the “from?” question.
“invocation” is intended to sound more like it refers to a specific function call rather than the whole “start julia and do x” job, though it’s admittedly a bit clunky.
but maybe that’s too brief? This measure also doesn’t include Julia startup time, but I would bet that many users will think naturally of “time from first REPL prompt” (I would).
I know you’ve gone over this a million times, and I’m being obtuse, but this conversation made me realize I don’t fully understand.
Consider the following steps:
Pkg.add("PackageX")
using PackageX
PackageX.f() # f included in precompile routine
Restart REPL
using PackageX
PackageX.f()
I had assumed native code was cached not only between calls but between sessions, so v1.9 would make step 5 above faster. But I’m wrong about that, right? Step 2 and 5 should take the same amount of time? 1.9 improves the time it takes to run step 3 “only” (I am not trying to downplay that giant win here, just checking my understanding).
If I was right that 1.9 will probably make step 1 slower but tradeoff with faster 2 and even faster 3, then TTL is a little bit confusing to me because I could consider importing packages each time I start working as “load time”, and the very first time I am setting things up for a project “install time”.
In general, precompiling and caching more method instances should make loading slower. If 1.9 is not strictly slower to load (given there’s more being cached) it’s because there was also some work on making the loading faster.
My simplified understanding is such:
Let’s say to run some code X you need to compile 1000 little method instances. Let’s assume they are all of the same “complexity” to keep it simple, so they all take the same amount to compile. So you could look at what happens for one little method instance.
The precompile duration difference between 1.9 and previous versions would depend on
time to precompile the method without native caching (less)
time to precompile it with native caching (more)
But this happens once at install time. What the user sees during regular usage is
time to load it without cached native code (less)
time to load it with cached native code (more)
time to compile it without native cache (more)
time to compile it with native cache (much less, ideally nothing)
time to execute it (should be the same assuming no other optimizations between versions)
So the latency improvement depends on the ratio between these less/more. As long as the compilation ratio dominates the load ratio you’ll see a net improvement.
Precompilation and the subsequent native code caching is only done after Step 1 in this case.
If this package were included in the environment via Pkg.develop then it might happen after Step 2 if the package was modified. Julia would tell you that it is precompiling.
The execution of PackageX.f() only results in the compiled function being cached to RAM for that REPL session unless it was already precompiled. If f() is not run during module load or init for precompilation Package.f() gets compiled twice, at steps 3 and 6.
To optimize for total time (TTL + TTFX), you want to precompile exactly what you will use and no more. Usually the time to load cached native code from disk is faster than compiling the code. By precompiling the concept is that you increase TTL slightly to decrease TTFX a lot. However, if you load compiled code that you do not use, then you have increased TTL without decreasing TTFX.
What is the difference here from import numpy or import matplotlib in python? Doesn’t these have to load any code, or the amount of code they load is much smaller anyway?
Does the precompiled code have to be completely loaded into memory? All of it? All methods that are precompiled are loaded into memory? That sounds strange (and not scalable).
I think a big part of what is slow is to validate the loaded code. All that is what creates the difference between code loading using pkgimages and PackageCompiler. The latter doesn’t need to validate and it’s very fast. I think that’s more comparable to what Python packages loading C libraries and such do.
Since Python doesn’t compile, there’s nothing to validate. Likewise, any static language just throws an error if there is an unexpected duplication in compiled code. Julia, being far more dynamic and flexible, has a much more complicated problem to solve. That’s why package loading is slow.