Thanks for your reply!

**TL;DR**: Both for the Callbacks example and for the delayed Kuramoto, the allocations are significantly reduced but the computation times remain similar. **For the delayed Kuramoto case, unfortunately the Python Jitcdde implementation is still at least 3 orders of magnitude more performant.**

You can find the details bellow:

Using

```
[1dea7af3] OrdinaryDiffEq v6.10.0
[6e4b80f9] BenchmarkTools v1.3.1
```

Running on Julia versions 1.8.0-beta3 and1.9.0-DEV.439 (current nightly)

```
@btime solve(prob,Tsit5()) # no callback
```

and

```
@btime solve(prob,Tsit5(), callback=cellFate) # with callback
```

I get

```
| n | Julia version | time | allocations |
|---------------|---------------|------------|-------------|
| no callback | 1.8.0-beta3 | 661.347 μs | 997.27 KiB |
| | 1.9.0-DEV.439 | 673.936 μs | 997.27 KiB |
|---------------|---------------|------------|-------------|
| with callback | 1.8.0-beta3 | 2.288 ms | 1021.34 KiB |
| | 1.9.0-DEV.439 | 2.292 ms | 1021.34 KiB |
```

So there is no difference in performance nor in allocations between these Julia versions. Compared to the values reported in the original post (before the fix), there is a big improvement in terms of allocations but the performance ratio (time with call back / time without callback) remains similar.

###
Delayed Kuramoto

Using

```
[bcd4f6db] DelayDiffEq v5.35.1
[31c24e10] Distributions v0.25.58
[1dea7af3] OrdinaryDiffEq v6.10.0
[6e4b80f9] BenchmarkTools v1.3.1
```

the following code (a wrap of all your suggestions from the previous replies)

```
using DelayDiffEq
using Distributions
using Random
using BenchmarkTools
function f!(dθ, θ, h::H, p, t) where H
ω, A = p
n = length(θ)
lags = reshape(lag, n,n)
@inbounds for j in 1:n
coupling = 0.0
@inbounds for i in 1:n
coupling += A[i,j]*sin(h(p, t-lags[i,j]; idxs=i) - θ[j])
end
dθ[j] = ω[j] + coupling
end
nothing
end
n = 10
Random.seed!(1)
ω = rand(n)
A = rand(n,n)
const lag = rand(n*n)
θ₀ = rand(Uniform(0, 2π), n)
p = (ω, A)
const past = rand(Uniform(0, 2π), n)
h(p, t; idxs=nothing) = typeof(idxs) <: Number ? past[idxs] : past
prob = DDEProblem(f!, θ₀, h, (0.0, 1.0), p, constant_lags=lag)
@btime solve(prob, MethodOfSteps(BS3()), saveat=0.01, reltol=0.0, abstol=1e-5)
```

depending on the number of oscillators `n`

, gives the follwing results:

```
@btime solve(prob, MethodOfSteps(BS3()), saveat=0.01, reltol=0.0, abstol=1e-5)
```

```
| n | Julia version | time | allocations |
|----|---------------|-----------|-------------|
| 10 | 1.8.0-beta3 | 10.754 s | 213.43 MiB |
| | 1.9.0-DEV.439 | 10.063 s | 213.43 MiB |
|----|---------------|-----------|-------------|
| 20 | 1.8.0-beta3 | 112.689 s | 1.74 GiB |
| | 1.9.0-DEV.439 | 119.260 s | 1.49 GiB |
```

For the case of `n = 10`

, those are half the allocations that you reported above but with not much difference in time.

On the other hand, when using `n = 20`

oscillators, the results from both Julia versions (which in this case are slightly different) are similar to those I posted above (before the fix):

```
@time solve(prob, MethodOfSteps(BS3()), saveat=0.01, reltol=0, abstol=1e-5)
119.585544 seconds (11.80 M allocations: 1.706 GiB, 3.26% gc time)
```

When solving with `n = 20`

, the warning of `Interrupted. Larger maxiters is needed.`

was raised. And I realized that the solver that I was using (BS3, matching the Jitcdde implementation) is a non-stiff solver and probably many Kuramotos with very different frequencies is a stiff system. When using the recommended `TRBDF2`

solver for stiff ODEs, these are the results:

```
@btime solve(prob, MethodOfSteps(TRBDF2()), saveat=0.01, reltol=0.0, abstol=1e-5)
```

```
| n | Julia version | min time | allocations |
|----|---------------|------------|-------------|
| 10 | 1.8.0-beta3 | 993.642 ms | 52.72 MiB |
| | 1.9.0-DEV.439 | 1.104 s | 52.72 MiB |
|----|---------------|------------|-------------|
| 20 | 1.8.0-beta3 | 193.267 s | 1.62 GiB |
| | 1.9.0-DEV.439 | 1566.749 s | 1.62 GiB |
```

So you can see that `TRBDF2`

greatly improved the time and allocations in the case of the `n = 10`

but for `n = 20`

the performance got much worse (with a greater difference in the nightly version). The maxiter warning was still raised for `n = 20`

when using `TRBDF2`

, so I’m still looking for the cause of that. If you have any insight, I’ll be very grateful.

Unfortunately, the Python implementation with Jitcdde still seems more performant that the Julia implementation. For `n = 10`

we don’t have the maxiter warning and the best I could get with `TRBDF2`

was ~ 1 s, while with the Python Jitcdde (even with the default Bogacki–Shampine pair solver) it takes ~ 0.005 s for the `n = 10`

case, and ~ 0.007 s for `n = 20`

.

By the way, what is the right way to include the jacobian for a DDE? Is it possible to calculate the symbolic jac automatically? I was trying with `modelingtoolkitize`

without success. This won’t solve the probem causeing the maxiter warining but at least I wanted to see how much I could improve the performance for the `n = 10`

case.