Extremely long compilation time (> 2000 sec on Win32)

Hej!

I’m observing extremely long compilation times of more than 2000 sec on Appveyor (Win32) for the delay differential equation solver package DelayDiffEq.

A week ago, with version 5.7.0 the first test took around 160 sec on Win32 (the test failures are due to upstream issues; the timings are roughly the same as a week before). However, on version 5.8.0 we’re suddenly up to more than 2000 sec.

In version 5.8.0, we introduced a major change under the hood, that consists mainly of an easier and more consistent handling of solutions and a simpler setup of the solvers. Intuitively the simpler setup should also reduce the amount/layers of nested types. In our benchmark we could not observe any performance regressions, and also the times and allocations of all subsequent tests (apart from the first ones) are still similar to version 5.7.0 (and earlier).

@andreasnoack noticed that we’re accessing closed over variables in some closures, which apparently is undefined behaviour, so we fixed that in version 5.9.0 yesterday night. However, the compilation time issues persist.

I also noticed that we have some type inference problems in the initialization of our integrators. Reducing the number of non-inferred parameters improves the performance quite drastically (with respect to times and allocations) in our tests, but still the initial test takes more than 2000 sec to run.

Interestingly, the problems are a lot less severe on Win64 and Linux 64bit (Travis), but still compilation takes too long IMO, especially compared with the time it takes to run all subsequent tests.

Has anyone observed anything like this before? Does anyone have an idea what could cause these extremely long compilation times? And how to profile it?

2 Likes

Maybe related, I see larger(in the order of GBs vs a half of it around 2 weeks ago) memory usage spikes in OrdinaryDiffEq while compiling it.

I think that’s a good guess. Tests in DelayDiffEq are usually more sensitive than the ones in OrdinaryDiffEq and DiffEqBase, and usually problems in those upstream packages are exponentiated since we nest different ODE integrators.

I just looked through the recent Appveyor logs of OrdinaryDiffEq (only Win64, Win32 is not tested). It seems, exactly when PR 791 was merged, the times and allocations of the first test increased from around 180 sec and 5.4 GiB (see, e.g., the last build in another PR and the last build of the master branch before this PR was merged) to around 330 sec and 8.1 GiB (see, e.g., the build when the PR was merged and the build of the most recent commit). Again the times and allocations of the other tests seem to be not affected.

This PR added SparseDiffTools (and hence implicitly also Zygote) as dependencies. Due to the method warnings that this integration provoked (fixed shortly thereafter) my first guess was actually that the problem was caused by this PR. I tried to check this assumption by reverting OrdinaryDiffEq to an earlier commit and running some benchmarks - after some time I figured out that there was a change in DiffEqDevTools that changed the compilation time of the benchmarks significantly and didn’t investigate this further. I think I should just rerun the DelayDiffEq PR with an upper bound on OrdinaryDiffEq.

I’m just trying to understand why adding Zygote might have such a severe impact on the compilation time, both in OrdinaryDiffEq and, naturally to a much larger extent, in DelayDiffEq? Does it point to some problem in OrdinaryDiffEq (or even DiffEqBase) such as the undefined behaviour of the closures mentioned above? Or is it to be expected (in which case the dependency should maybe made optional)? I noticed that there are more of those closure issues in upstream packages such as DiffEqBase.

1 Like

It’s not undefined behavior. More undocumented I’d say. The main problem I can think of with it is that renaming a local variable can break code far away.

What version of julia is this?

Here are some things to try:

  • Enable ENABLE_TIMINGS in src/options.h
  • Try this patch:
diff --git a/base/compiler/params.jl b/base/compiler/params.jl
index 46c0862..2908b21 100644
--- a/base/compiler/params.jl
+++ b/base/compiler/params.jl
@@ -64,7 +64,7 @@ struct Params
                    #=inlining, ipo_constant_propagation, aggressive_constant_propagation, inline_cost_threshold, inline
                    inlining, true, false, 100, 1000,
                    #=inline_tupleret_bonus, max_methods, union_splitting, apply_union_enum=#
-                   400, 4, 4, 8,
+                   400, 1, 4, 8,
                    #=tupletype_depth, tuple_splat=#
                    3, 32)
     end
1 Like

Thanks for your suggestions!

So it’s fine to do that (apart from the possible issue you mentioned)? I thought due to the possible performance regressions it’s advisable to add type annotations or to use the “let block fix”?

The tests run on Julia 1.1.1 and Julia 1.3.0-DEV.557 on AppVeyor (32bit and 64bit) and on Julia 1.0.4 and Julia 1.3.0-DEV.557 on Travis (Linux 64bit). We observe compilation time regressions for all Julia versions and all systems, but they are most prominent on Win32 and there seems to be some improvement with the nightlies (e.g., in one run times on Win32 were around 2000 sec for the first test on Julia 1.1.1 and around 1600 sec on the nightly version).

I tried two things in the last hours with a fork of the DelayDiffEq PR in which we observed these regressions for the first time: first I fixed OrdinaryDiffEq to the commit right before the SparseDiffTools PR was merged, and then I made the dependency of SparseDiffTools on Cassette (and SpecialFunctions) optional and checked if removing this dependency fixes the issue.

When I upper bounded OrdinaryDiffEq and fully removed SparseDiffTools the times went back to around 280 sec (Win32 bit, Julia 1.1.1) and around 200 sec (Win32bit, Julia nightlies), so basically the same values that we observed in DelayDiffEq prior to this PR. Similarly, on Linux 64bit we were back at around 250 sec and 160 sec on Julia 1.0.4 and Julia nightly, respectively. That indicates that the issues are not (only) due to the PR in DelayDiffEq but rather due to upstream issues.

Unfortunately, only removing the dependency on Cassette does not resolve the problem. Using my modified fork, the times of the first test go up again to around 2300 sec (Win32, Julia 1.1.1), 390 sec (Linux 64bit, Julia 1.0.4), and 300 sec (Linux 64bit, Julia nightly) (the test on Win32 with Julia nightly is still compiling at the time of writing this post).

Since I’ve never compiled julia, I performed these two tests first, but I’ll try your suggestions as well. Not sure though if I can find the time for some thorough investigations in the next days, I still have to prepare my talk for JuliaCon :smile:

1 Like

Yes, there are other potential issues with closures, but “undefined behavior” is pretty much the dirtiest word in the programming languages world :slight_smile: That would be more akin to a segfault in C.

Often in these cases there is one “bad signature” that kicks off an explosion in type inference. That can cause seemingly small or unrelated changes to change the compilation time by a lot, so a black-box approach tends to be frustrating.

(me too…) Great, see you soon then! :slight_smile:

2 Likes

Just as a note (I haven’t tried the suggestions yet): a similar PR that would add SparseDiffTools functionality to StochasticDiffEq seems to show similar compilation time regressions.

So, for anyone who wants to follow along our debugging process, here are some news.

By adding @time to the testsets in my SparseDiffTools fork (with Cassette dependency removed), I noticed that the compilation time on Linux 64bit is around 2-3 minutes, i.e., around the increase in compilation time in OrdinaryDiffEq, and that the compilation on Win32 fails due to Int32/Int64 issues (after 34 minutes!) and the test on Win64 throws a EXCEPTION_ACCESS_VIOLATION.

After fixing the Int32/Int64 issues, the tests on Win32 and Win64 throw both a EXCEPTION_ACCESS_VIOLATION. Hereby the time of the first test on Win32 (Julia 1.1.1) is around 1700 sec whereas on Win64 (Julia 1.1.1) the first test takes around 170 sec.

Try it without Zygote in SparseDiffTools

It seems Zygote is the main culprit. After making the dependency on Zygote optional, the first tests of SparseDiffTools on Win32 (Julia 1.1.1) finish in about 117 sec instead of 1750 sec but when Zygote is loaded for another test set times there now increase to more than 1500 sec. As expected, the last test set still throws an EXCEPTION_ACCESS_VIOLATION for Julia 1.1.1 on Windows (both 32 and 64 bit).

I also ran tests of the master branch of DelayDiffEq against the fork of SparseDiffTools that does not contain a dependency on Zygote anymore. The times of the first tests are reduced on all tested platforms and Julia versions. However, even though, e.g., on Win32 the times are back to around 280 sec with around 7GiB allocations for the first test, with version 5.7.0 a week ago it only took around 160 sec and 6 GiB allocations. Maybe the remaining compilation time of SparseDiffTools could be responsible for this difference?

1 Like

What qualifies as a “bad signature”? Is there a way to debug it?

1 Like

Unfortunately there’s no way to tell just by looking at it. It will be caused by an abstract type, but the problem is non-local. For example one call site matches two methods, each of which leads to a call site that matches two methods, and so on. Usually the process terminates quickly, but you can get unlucky.

So this is fixed on master as far as we know?

Yes, the exception is gone and the tests pass with Julia 1.2 and the nightly version (but the compilation time issue is still there).

Is this something that is impossible to avoid, or will future versions of the compiler behave differently?

It’s very frustrating trying to design code which may or may not trigger a huge performance regression in an unpredictable way. So if it’s permanent perhaps there could be a compiler tag that recognises it (say, by users specifying a limit to the number of “call site” levels) and throws an error?