Help building MWE for 1.8 latency regression

Motivated by the SnoopPrecompile announcement, I checked out my package’s TTFX on 1.8rc3. Using a sample script like:

# test_startup.jl
@time using CMBLensing
@time LenseFlow(FlatMap(randn(10,10))/1e6) * FlatMap(randn(10,10))

I get about 2X slower on 1.8 :cry:

$ julia +1.7.3 test_startup.jl
  3.949029 seconds (13.52 M allocations: 844.374 MiB, 7.40% gc time, 47.93% compilation time)
  5.599200 seconds (20.76 M allocations: 1.123 GiB, 4.16% gc time, 98.32% compilation time)
$ julia +1.8.0-rc3 test_startup.jl
  4.963559 seconds (13.48 M allocations: 855.347 MiB, 6.77% gc time, 53.88% compilation time: 38% of which was recompilation)
 11.640764 seconds (44.94 M allocations: 2.787 GiB, 4.42% gc time, 99.50% compilation time)

I’m wondering what I could do to distill this into some more MWE that could be useful for devs? Or perhaps this level of possible regression is a known issue that’s already being worked on? (or even fixed on nightly, although the package errors there so I can’t directly check)

It would be worthwhile to try nightly to see if the regression still exists there.

Unfortunately

specifically its dependencies error so I can’t try a quick fix.

Slightly larger loading time for packages is not unusual on v1.8, as this version is able to cache more code during precompilation (which means more code to read back from the cache when loading the package), but the about 20% of total loading time spent on recompilation doesn’t sound good (note that v1.7 doesn’t track recompilation time at all, so you can’t compare that figure with v1.7): it means something is invalidating previously defined methods, often one of your dependencies. This may or may not be related to your issue, but avoiding recompilation is something good anyway. @time_imports is a useful tool for this.

The increased loading time in v1.8 is usually related to a slightly shorter TTFX at runtime, but in your case that one also got worse. What’s significant is the increase of memory allocations. Do you have very type-unstable code? That’s often what can change performance more wildly between different releases, and that can affect memory allocations a lot.

Thanks. For the first thing, I got this out @time_imports (which I just learned about now, looks great), sorted and filtered for >100ms

    103.0 ms  FFTW 3.17% compilation time
    124.7 ms  FileIO 4.99% compilation time (100% recompilation)
    146.6 ms  FillArrays
    176.1 ms  Distributions
    181.9 ms  Tullio 16.42% compilation time
    232.2 ms  IntelOpenMP_jll 99.76% compilation time (100% recompilation)
    250.3 ms  JLD2
    451.9 ms  StaticArrays
    500.9 ms  CMBLensing 19.92% compilation time
    946.3 ms  Measurements 93.80% compilation time (65% recompilation)
   1622.5 ms  Zygote 95.24% compilation time (20% recompilation)

I have also done an invalidation check with SnoopCompile and although I’m a novice there, it kind of seemed like most things I saw were out of my hands, they were my dependencies unrelated to my types / functions. Not sure if that’s consistent with the above. Anyway, yea, 1 extra second loading the package is not my main concern, the command is.

As for the code, that should be fairly type stable aside from the top-level couple of functions calls. I can check a litle deeper though, maybe something somewhere slipped due to changes in inference?

Instead of sorting by loading time, you should leave the loading order: what triggers recompilation in a package is typically one of the packages which come before and invalidates methods needed by the following packages. For example, JLL packages should all have 0% recompilation (they’re very stable and we take they’re loading time extremely seriously, given the impact they have on the entire ecosystem), if you see something different, it’s likely something else’s fault. I can add that recompilation in Measurements is at least partially triggered by Requires, but on my laptop it has a less significant impact (note that this is v1.9, which shows the timings slightly differently, without nesting):

julia> @time_imports using Measurements
      2.9 ms  Calculus
      0.4 ms  Requires
     37.2 ms  RecipesBase
     31.0 ms  Measurements 41.44% compilation time (10% recompilation)

Again, this may very likely be unrelated to your core problem, but knowing where the this other problem lies can be helpful.

Yes, that’s totally possible, but as I said, that more usually happens with type unstable code: the inference is often tuned slightly in each version, resulting in it giving up earlier or later in some cases, but type-stable code shouldn’t be usually affected too much by these changes. Cthulhu.jl can usually be helpful to go deep into these details.

Still helpful, thanks. So, does all that mean that I could improve my package load time (in a fresh session) by simply changing the order of usings in my package? Like if I moved Measurements into a better place, I could go from almost a second for me to closer to your 30ms?

I don’t quite follow. What I was saying was that on 1.7 that code is well tested and quite type stable, but maybe on 1.8 something in inference changed and now thing are no longer inferred and hence the code is no longer type stable (maybe we’re using two slightly different meaning for type stable, mine is syonymous with “did inference succeed in concretely typing everything”).

And to confirm, regardless if inference failed on 1.8, the runtime is negligible, so is the theory that maybe inference failing still causes longer latency just because type unstable code in general takes longer to infer / compile?

Oh, no no, I’m saying that to find the culprit of the invalidations you should look at the order of loading. What I wanted to show is that Measurements alone, at least on my machine, has 10% recompilation because of Requires (but I’d encourage you to do the test on your machine), if you have 40% it likely means that there is yet another nasty package invalidating even more methods. We can’t do much about Requires at the moment, its entire job is to do something very precompilation-unfriendly, but hopefully in the not-so-distant feature we’ll have better mechanisms for dealing with optional dependencies.

Yes, it’s possible that inference failed, but usually rocky stable functions don’t have this problem, more borderline functions can be inferred perfectly on one version and become uninferred in another. I guess @code_warntype or Cthulhu.jl can shed some light on this.

Yes, and actually judging by the amount of compile time the runtime should be better in v1.8 (0.06 seconds vs 0.09), but maybe the difference isn’t significant.

Giving another update here, if I could entice any other help, since its a pretty depressing result.

Using Cthulhu, I confirmed stuff is inferred identically on 1.7.3 and 1.8.0-rc4, so its not a change in inference. The entire code whose latency is being timed is not perfectly inferred, but the run-time critical stuff is. This is basically intentional since I don’t want to prematurely optimize or complicate the code if it doesn’t matter for runtime. But maybe this is the cause of the really bad latency regression, and 1.8 for some reason is way worse at dealing with not-perfectly-inferred stuff?

I suppose I can try to make everything perfectly inferred and see if it removes the difference between versions? Any other suggestions for where to dig?

Here is an updated test case, reproducible here, ~2X slower package load, and almost 3X slower “time to first thing 1”:

image

3 Likes

Fwiw here’s a copy-pastable MWE for the package loading 2X regression on 1.8. Is this a file-able issue somewhere?

] activate --temp

] add CompositeStructs, DataStructures, DelimitedFiles, Distributions, EllipsisNotation, FileIO, IterTools, JLD2, KahanSummation, Loess, MacroTools, Match, Measurements, Memoization, NamedTupleTools, PDMats, Preferences, Printf, ProgressMeter, QuadGK, Roots, Requires, Setfield, SparseArrays, StaticArrays, StatsBase, TimerOutputs, Tullio, UnPack, Zygote

@time using CompositeStructs, DataStructures, DelimitedFiles, Distributions, EllipsisNotation, 
        FileIO, IterTools, JLD2, KahanSummation, Loess, MacroTools, Match, Measurements, Memoization, 
        NamedTupleTools, PDMats, Preferences, Printf, ProgressMeter, QuadGK, Roots, Requires, 
        Setfield, SparseArrays, StaticArrays, StatsBase, TimerOutputs, Tullio, UnPack, Zygote
1 Like

There are 5 open issues with the labels latency and regression: https://github.com/JuliaLang/julia/issues?q=is%3Aopen+label%3Aregression+label%3Alatency

But I think your MWE might be worth a new issue because it looks different…

Figured I’d check back in here with some time passed. I know some people have been working on various packages and Julia itself in response to some of this and related issues. I’m very happy to say as of latest package versions today and 1.8.2, my latency benchmarks now look like what you’d expect. Package load time increases by a little bit in 1.8 compared to 1.7, but TTFX decreased (rather than doubling like before). With this now working, excited to be able to try some of the 1.8 precompilation stuff.

Many thanks to the people involved! (can’t say I know the specific thing that did it, or if there even was just one)

image

3 Likes