Delay Differential Equations Networks with different delay per edge

Here I reported that for some reason when using Julia 1.8.0-beta3 and loading DifferentialEquations.jl, the reproducer of the Base inference issue gives different results (and seems to work well in that case). However, under those conditions the poor performance of the DDE with heterogeneous delays posted here remains the same. So maybe the suggested reproducer is not good for that Base inference issue but it is also possible that the DDE performance problem has another cause.

@ChrisRackauckas, could you run the code that led you to the conclusion that the cause of the DDE issue is the Base inference issue again but this time on Julia 1.8.0-beta3 while loading DifferentialEquations.jl, please?

Dear @ChrisRackauckas, the delayed Kuramoto code is still taking a lot more time using the suggested code than with the Python alternative Jitcdde. Do you know if there is any fix coming soon by any chance?

Can you try the v1.8-beta3 and nightly to see if the allocations from the interpolations are gone?

https://github.com/SciML/OrdinaryDiffEq.jl/issues/1502

Try the example from Significant allocations with Callbacks (Tsit5) . It’s the same problem as callbacks. That interpolation inference bug makes it so every use of the interpolations has allocations. In theory it was fixed by

https://github.com/JuliaLang/julia/pull/43990

it would be good to get a confirmation of that, since it should be on v1.8. If that’s fixed, then the rest should be much simpler.

2 Likes

Thanks for your reply!

TL;DR: Both for the Callbacks example and for the delayed Kuramoto, the allocations are significantly reduced but the computation times remain similar. For the delayed Kuramoto case, unfortunately the Python Jitcdde implementation is still at least 3 orders of magnitude more performant.

You can find the details bellow:

Example from Significant allocations with Callbacks (Tsit5)

Using

[1dea7af3] OrdinaryDiffEq v6.10.0
[6e4b80f9] BenchmarkTools v1.3.1

Running on Julia versions 1.8.0-beta3 and1.9.0-DEV.439 (current nightly)

@btime solve(prob,Tsit5()) # no callback

and

@btime solve(prob,Tsit5(), callback=cellFate) # with callback

I get

| n             | Julia version | time       | allocations |
|---------------|---------------|------------|-------------|
| no callback   | 1.8.0-beta3   | 661.347 Îźs | 997.27 KiB  |
|               | 1.9.0-DEV.439 | 673.936 Îźs | 997.27 KiB  |
|---------------|---------------|------------|-------------|
| with callback | 1.8.0-beta3   | 2.288 ms   | 1021.34 KiB |
|               | 1.9.0-DEV.439 | 2.292 ms   | 1021.34 KiB |

So there is no difference in performance nor in allocations between these Julia versions. Compared to the values reported in the original post (before the fix), there is a big improvement in terms of allocations but the performance ratio (time with call back / time without callback) remains similar.

Delayed Kuramoto

Using

[bcd4f6db] DelayDiffEq v5.35.1
[31c24e10] Distributions v0.25.58
[1dea7af3] OrdinaryDiffEq v6.10.0
[6e4b80f9] BenchmarkTools v1.3.1

the following code (a wrap of all your suggestions from the previous replies)

using DelayDiffEq
using Distributions
using Random
using BenchmarkTools

function f!(dθ, θ, h::H, p, t) where H
    ω, A = p
    n = length(θ)
    lags = reshape(lag, n,n)
    @inbounds for j in 1:n
        coupling = 0.0
        @inbounds for i in 1:n
            coupling += A[i,j]*sin(h(p, t-lags[i,j]; idxs=i) - θ[j])
        end
        dθ[j] = ω[j] + coupling
    end
    nothing
end

n = 10
Random.seed!(1)
ω = rand(n)
A = rand(n,n)
const lag = rand(n*n)
θ₀ = rand(Uniform(0, 2π), n)
p = (ω, A)
const past = rand(Uniform(0, 2π), n)
h(p, t; idxs=nothing) = typeof(idxs) <: Number ? past[idxs] : past

prob = DDEProblem(f!, θ₀, h, (0.0, 1.0), p, constant_lags=lag)
@btime solve(prob, MethodOfSteps(BS3()), saveat=0.01, reltol=0.0, abstol=1e-5)

depending on the number of oscillators n, gives the follwing results:

@btime solve(prob, MethodOfSteps(BS3()), saveat=0.01, reltol=0.0, abstol=1e-5)
| n  | Julia version | time      | allocations |
|----|---------------|-----------|-------------|
| 10 | 1.8.0-beta3   | 10.754 s  | 213.43 MiB  |
|    | 1.9.0-DEV.439 | 10.063 s  | 213.43 MiB  |
|----|---------------|-----------|-------------|
| 20 | 1.8.0-beta3   | 112.689 s | 1.74 GiB    |
|    | 1.9.0-DEV.439 | 119.260 s | 1.49 GiB    |

For the case of n = 10, those are half the allocations that you reported above but with not much difference in time.

On the other hand, when using n = 20 oscillators, the results from both Julia versions (which in this case are slightly different) are similar to those I posted above (before the fix):

@time solve(prob, MethodOfSteps(BS3()), saveat=0.01, reltol=0, abstol=1e-5)
119.585544 seconds (11.80 M allocations: 1.706 GiB, 3.26% gc time)

When solving with n = 20, the warning of Interrupted. Larger maxiters is needed. was raised. And I realized that the solver that I was using (BS3, matching the Jitcdde implementation) is a non-stiff solver and probably many Kuramotos with very different frequencies is a stiff system. When using the recommended TRBDF2 solver for stiff ODEs, these are the results:

@btime solve(prob, MethodOfSteps(TRBDF2()), saveat=0.01, reltol=0.0, abstol=1e-5)
| n  | Julia version | min time   | allocations |
|----|---------------|------------|-------------|
| 10 | 1.8.0-beta3   | 993.642 ms | 52.72 MiB   |
|    | 1.9.0-DEV.439 | 1.104 s    | 52.72 MiB   |
|----|---------------|------------|-------------|
| 20 | 1.8.0-beta3   | 193.267 s  | 1.62 GiB    |
|    | 1.9.0-DEV.439 | 1566.749 s | 1.62 GiB    |

So you can see that TRBDF2 greatly improved the time and allocations in the case of the n = 10 but for n = 20 the performance got much worse (with a greater difference in the nightly version). The maxiter warning was still raised for n = 20 when using TRBDF2, so I’m still looking for the cause of that. If you have any insight, I’ll be very grateful.

Unfortunately, the Python implementation with Jitcdde still seems more performant that the Julia implementation. For n = 10 we don’t have the maxiter warning and the best I could get with TRBDF2 was ~ 1 s, while with the Python Jitcdde (even with the default Bogacki–Shampine pair solver) it takes ~ 0.005 s for the n = 10 case, and ~ 0.007 s for n = 20.

By the way, what is the right way to include the jacobian for a DDE? Is it possible to calculate the symbolic jac automatically? I was trying with modelingtoolkitize without success. This won’t solve the probem causeing the maxiter warining but at least I wanted to see how much I could improve the performance for the n = 10 case.

Can you share some flamegraphs with GitHub - tkluck/StatProfilerHTML.jl: Show Julia profiling data in an explorable HTML page ?

For some reason I’m getting errors when trying to use StatProfilerHTML.jl:

ERROR: KeyError: key 0x0000000000000003 not found
Stacktrace:
[1] getindex
@ ./dict.jl:498 [inlined]
[2] StatProfilerHTML.Reports.Report(data::Vector{UInt64}, litrace::Dict{UInt64, Vector{Base.StackTraces.StackFrame}}, from_c::Bool)
@ StatProfilerHTML.Reports ~/.julia/packages/StatProfilerHTML/FRimo/src/Reports.jl:106
[3] statprofilehtml(data::Vector{UInt64}, litrace::Dict{UInt64, Vector{Base.StackTraces.StackFrame}}; from_c::Bool, path::String)
@ StatProfilerHTML ~/.julia/packages/StatProfilerHTML/FRimo/src/StatProfilerHTML.jl:23
[4] statprofilehtml (repeats 3 times)
@ ~/.julia/packages/StatProfilerHTML/FRimo/src/StatProfilerHTML.jl:17 [inlined]
[5] top-level scope
@ ~/Documents/checking_Kuramoto_fix/checking_if_fixed.jl:94

Maybe it’s some compatibility issue with Julia 1.8 and nightly. Nevertheless, I created some flamegraphs with ProfileSVG.jl, which you can find here. If you download those svg files and open them with Chrome or some other browser you can get some interactivity.

They are all quite similar. Most of the time appears to be spent in taks_done_hook from task.jl but it seems that it was a known bug from profiling when using more than one thread. It was supposed to be fixed here but maybe it wasn’t. I’ll repeat the profiling later with a single thread.

1 Like

Here you have the profiling done right, with 1 thread.

1 Like

So yeah if the number of steps is similar (is it?), then it all seems to be interpolation cost. So you’re sure the number of steps is the same? To me it sounds like you found an error estimate issue somewhere.

I looked into this some more for the case of n=10 oscillators.

First, note that the Python and Julia examples are not solving the exact same system, since the random generators are different. For Kuramoto systems (without delay) i have seen very different runtimes depending on the initial conditions, even when the coupling is the same. However, Python is consistently fast for several random initializations, so for the moment i didn’t bother to export the matrices from Julia.

Second, Jitcdde integrates “blindly” on the interval [0,1] (i.e. with fixed timesteps) to avoid discontinuities in the history function. I tried to emulate that behaviour in Julia but it didn’t significantly affect performance.

JitCDDE doesn’t have a nice solution object, so at first i did not know how to get the internal timesteps, but this code did the trick:

times = []
DDE._initiate()
while DDE.t < 3:
    times.append(DDE.t)
    if DDE.try_single_step(DDE.dt):
        DDE.DDE.accept_step()

This snippet can be appended after the code posted by @Ger.
On the interval [2,3] the JITCDDE integrator apparently takes just 18 steps:

>>> times
[2.0067283171685166, 2.050840295443998, 2.0949522737194797, 2.1440292173146496, 2.1931061609098195, 2.2421831045049894, 2.2964430256027533, 2.350702946700517, 2.404962867798281, 2.459222788896045, 2.513482709993809, 2.5677426310915727, 2.6293658236742443, 2.690989016256916, 2.7526122088395875, 2.814235401422259, 2.88362964704675, 2.953023892671241]

This seems very low so i hope i understood the integrator interface right.

On the other hand the Julia implementation take 231977 steps on the interval [0,1]. While the python steps seemed to be very little, that seems excessive. Here is a scatterplot of dt.

scatter_dt

The timesteps are getting smaller and smaller. In fact i couldn’t integrate on the whole interval [0,2] because of a maxiters error.

On the other hands, if the lags are left undeclared in the problem definition, the system solves in 10 ms with 270 timesteps.

1 Like

What are the delay sizes? I wonder how many declared points are in the discontinuity tree.

100 different delays (1 for each edge), minimum(lag) = 0.0043... and maximum(lag) = 0.9963....

To handle the delays correctly, wouldn’t it be enough to take the minimum lag as time step?

No, it’s also the multiplication combination of all delays too.

OK, if i understand this correctly the solver tries to step to each discontinuity caused by the 100 different lags, resulting in ~200,000 steps on [0,1]. The long time to solve is then just a consequence of the problems complexity and not a bug?

JITCDDE has a different “philosophy” explained here: The JiTCDDE module — JiTCDDE 1.8.2.dev0+gf427f33.d20220305 documentation

Namely, if only the long term behaviour is of interest and the initial conditions are somewhat arbitrary:

  • Integrate until maximum(lag) with a fixed timestep to obtain a sufficiently smooth initial history.
  • Ignore the discontinuities from then on.

How do I get this behaviour with DelayDiffEq.jl? Do i just leave the lags undeclared? Or is there a way to assert that the initial history is smooth?

It would still have a second derivative discontinuity, hence not smooth.

Yes, if you want to resolve the discontinuities exactly. If you’re fine with the induced error, then just don’t declare the lag.

1 Like

Thank you for all your answers!

I am fine with errors as long as they get smoothed out after an initial period. From the JITCDDE manual:

Fortunately, the discontinuities are quickly “smoothed out” (i.e., reduced in order) with time evolution and can then be ignored.

Is this the same behavior i get with undeclared lags or are there further sources of error?

Yes indeed, that’s just a property of it being a standard (non-neutral) DDE.

1 Like

It sounds like their “smoothing” idea is to just cut the lags off after some n which is lower than the order? @devmotion do we have an easy way to do that?

I repeated the benchmark with undeclared lags. It solves faster but still scales badly. At n=35 JiTCDDE is 1000x faster.

undeclared_lags_benchmark

1 Like

Yes, we should probably add the “smoothing”, but the majority of the issue is still the interpolation issue that I referenced, and there will be a slow down until that is fixed. No amount of benchmarking will help that, someone just needs to sit down and help inference.

1 Like