Large CPU time overhead RK4 ODE solver for parabolic problem

I am solving a simple periodic vector parabolic equation using DifferentialEquations’ RK4() ODE solver. The code below is self contained.

For the parameters specified, the ODE solver on my computer takes 0.23s (average time after compilation) during which it calls the function dydx a total of 24261 times.

Separate benchmarking of the function dydx indicates it takes 1.4 microseconds to execute. The function dydx seems pretty optimized (0 allocations…).

The overhead of the RK4() solver appears large. Specifically, the ratio 0.23s/24261 (the effective time per call in RK4) is roughly 7 times 1.4 microseconds (the time fore one call). Given the simplicity of the RK4 scheme (just a few additions of the results from dydx), this ratio seems large.

Are there ODE parameters that can reduce this overhead? Thanks!

using DifferentialEquations

ny = 1000    # # of points along y
dy = 0.05     # discretization step along y
xspan = 90.0   # size of domain along x
nhar = 10;      # number of periods along y

k = 2pi  # wavenumber
ky = nhar*k/(ny*dy)   # wavenumber along x
kx = sqrt(k^2-ky^2)   # wavenumber along y
phi = atan(ky/kx)
phi*180/pi            # angle of propagation
yin = [exp(-im*k*j*dy*sin(phi)) for j in 1:ny]  # periodic excitation

abstol = 1e-6
reltol = 1e-6

@fastmath @views @inbounds function dydx!(dydxa,y,p,x)
    ny,dy,k,neval = p
    neval[1] += 1   #counter to keep track of function calls
    ym1 = y[1]
    y0 = y[2]
    c2 = -im/(2.0*k*dy^2)
    for i in 2:ny-1
      yp1 = y[i+1]
      dydxa[i] = c2*(ym1-2y0+yp1)
      ym1 = y0
      y0 = yp1
    end
    dydxa[1] = c2*(y[ny] -2y[1] + y[2])
    dydxa[ny] = c2*(y[ny-1] - 2y[end] + y[1])
end


#time dydx routine
neval = [0]
p = ny,dy,k,neval
dydxa = similar(yin)
x = 0.1
@btime dydx!($dydxa,$yin,$p,$x)
# reveals time of 1.4 microsecond per call
tsingle = 1.4e-6

#time ODE solver
abstol = 1e-6
reltol = 1e-6
neval = [0]
p = ny,dy,k,neval
prob1 = ODEProblem(dydx!,yin,xspan,p)
@time sol1 = solve(prob1,RK4(),abstol=abstol,reltol=reltol,save_everystep=false)
neval[1]
#reveals time of 0.234 sec and 24261 calls to dydx
#timing was obtained after compilation
tpercallinODE = 0.234/24261

#Ratio approaches 7
tpercallinODE/tsingle
2 Likes

That sounds about right. You’re using a method with residual adaptivity controller, so it takes about 6 f calls per step. I am not sure why you chose this method for this application though: residual adaptivity control is somewhat expensive and is more suitable with applications where you’re propogating discontinuities, like when it’s used for method of steps in a delay differential equation.

Note that you can check this yourself by using destats:

@show sol1.destats

DiffEqBase.DEStats
Number of function 1 evaluations:                  24261
Number of function 2 evaluations:                  0
Number of W matrix evaluations:                    0
Number of linear solves:                           0
Number of Jacobians created:                       0
Number of nonlinear solver iterations:             0
Number of nonlinear solver convergence failures:   0
Number of rootfind condition calls:                0
Number of accepted steps:                          4041
Number of rejected steps:                          2

(24261-3)/4043=6, almost exactly (+3 f-evaluations because of the 3 FSAL resets, 1 at the start and 2 due to rejected steps).

So what’s the shadow 7th? Well, in adaptivity there’s a norm calculation on an error function, which is O(n) with the ODE size. It turns out that your ODE is a simple enough O(n) calculation that taking the norm of the state is comparable to the calculation of f:

@btime dydx!($dydxa,$yin,$p,$x) # 2.333 μs (0 allocations: 0 bytes)
@btime sqrt.(sum($abs2,$yin) ./ length($yin)) # 1.900 μs (0 allocations: 0 bytes)

And so the total cost of one step of residual adaptive RK4 is close to 7 f evaluations (for this ODE. If f is reasonably complex, then this goes down to 6).

For standard applications of ODEs like this, other methods like Tsit5 will be much more efficient. For example, Tsit5 is 6 evaluations per step (7, but a true 6 because of FSAL) but achieves 5th order and much smaller error coefficients, so it’ll take much larger steps for the same accuracy. And its error estimator is designed to be “free” (only the cost of a norm calculation of course), but it’s not discontinuity-robust like a residual adaptive method.

I think we’re talking different factors of 7… Using my timings:

The ODE solver runs for 0.23 s
Each “function 1” evaluation (call to dydx) takes 1.4 microsecond.

0.23 s divided by 1.4 microsecond = 164286, which is much bigger than the 24261 calls the solver makes for 4041 steps.

What am I missing?

There are 24261 total “function 1 calls”

Thanks for the Tsit5 suggestion though.

I see what you mean. I dug in a bit.

0.833414 seconds (16.22 k allocations: 248.988 MiB, 6.28% gc time)
0.902520 seconds (16.22 k allocations: 248.988 MiB, 5.74% gc time)

There was an allocation in the residual adaptivity calculation, so that was easy to remove: https://github.com/SciML/OrdinaryDiffEq.jl/pull/1176 . That brings it to:

0.699608 seconds (51 allocations: 254.281 KiB)
0.645772 seconds (51 allocations: 254.281 KiB)

Much better, but doesn’t close the gap. So I ran a profile:

using Profile
Profile.clear()
@profile for i in 1:10 solve(prob1,RK4(),abstol=abstol,reltol=reltol,save_everystep=false) end
Juno.profiler()

Turns out the norm is expensive, but all of that extra time is spent in calculating the absolute value of complex numbers as part of the adaptivity norm. I never knew that abs(z) was so expensive… in fact, it’s a bit absurd:

@btime dydx!($dydxa,$yin,$p,$x) # 1.570 μs (0 allocations: 0 bytes)
@btime sqrt.(sum($abs2,$yin) ./ length($yin)) # 1.018 μs (0 allocations: 0 bytes)
@btime map!($abs,$out,$yin) # 10.399 μs (0 allocations: 0 bytes)

So there you go, that’s the culprit.

@YingboMa can you think of a remedy for this? This is probably worth an issue.

2 Likes

I think a sqrt is needed to calculate abs(z) for a complex number z, which is rather expensive, especially compared to additions and multiplications. So getting rid of that, maybe by just using abs2, should save quite some time.

Where it shows up is in the calculation of relative tolerance, i.e.:

@. error = tmp / (abstol + reltol * max(abs(uprev),abs(u))

If you don’t sqrt then then it’s scaling the relative tolerance incorrectly. Since it’s just for an error norm approximation though, a relatively quicker but less accurate sqrt approximation would be okay here, since error is just a heuristic with like 3 digits of accuracy for the dt calculation.

The problem is not the sqrt, it’s that abs uses hypot, which has increased stability against overflow wrt sqrt(x^2 + y^2), but is super slow (and will definitely not vectorize). Perhaps fastmath could help.

Yep, abs_fast is much faster for complex arguments, at the expense of the accuracy in some situations

Yes, @fastmath abs brings the runtime down to:

0.186665 seconds (55 allocations: 254.344 KiB)

So we’re close to a 5x speedup on this example. I knew we could better on complex numbers, but man, this has been a productive MWE.

3 Likes

Thank you so much for digging in! Great progress.

  1. How do I activate the updated DifferentialEquations in my environment? I tried Pkg.update but saw no improvement in timing. Also tried ] update DifferentialEquations#master (did not work so well – see below).
  2. Does the change apply to all ODE solvers? (I note that https://github.com/SciML/OrdinaryDiffEq.jl/pull/1176 refers to “RK4 adaptivity”)
  3. I assume I could specify my own internalnorm and play with a few different options?

Resolving package versions…
┌ Warning: julia version requirement for package DifferentialEquations not satisfied
└ @ Pkg.Operations C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.2\Pkg\src\Operations.jl:225
ERROR: Unsatisfiable requirements detected for package ParameterizedFunctions [65888b18]:
ParameterizedFunctions [65888b18] log:
├─possible versions are: [3.2.0, 3.3.0, 3.4.0, 3.5.0, 3.6.0, 3.7.0-3.7.2, 3.8.0, 3.9.0, 3.10.0, 3.11.0, 3.12.0-3.12.1, 4.0.0, 4.1.0-4.1.1, 4.2.0-4.2.1, 5.0.0-5.0.3, 5.1.0, 5.2.0, 5.3.0] or uninstalled
├─restricted to versions 5.0.0-5 by DifferentialEquations [0c46a032], leaving only versions [5.0.0-5.0.3, 5.1.0, 5.2.0, 5.3.0]
│ └─DifferentialEquations [0c46a032] log:
│ ├─possible versions are: 6.14.0 or uninstalled
│ └─DifferentialEquations [0c46a032] is fixed to version 6.14.0
└─restricted by compatibility requirements with ModelingToolkit [961ee093] to versions: [3.2.0, 3.3.0, 3.4.0, 3.5.0, 3.6.0, 3.7.0-3.7.2, 3.8.0, 3.9.0, 3.10.0, 3.11.0, 3.12.0-3.12.1, 4.0.0, 4.1.0-4.1.1, 4.2.0-4.2.1] or uninstalled — no versions left
└─ModelingToolkit [961ee093] log:
├─possible versions are: [0.0.1-0.0.2, 0.1.0, 0.2.0, 0.4.0, 0.5.0, 0.6.0-0.6.1, 0.6.3-0.6.4, 0.7.0-0.7.2, 0.8.0, 0.9.0-0.9.1, 0.10.0, 1.0.0-1.0.3, 1.1.0-1.1.3, 1.2.0-1.2.10, 1.3.0-1.3.3, 1.4.0-1.4.3, 2.0.0, 3.0.0-3.0.2, 3.1.0-3.1.1, 3.2.0, 3.3.0, 3.4.0, 3.5.0, 3.6.0-3.6.4, 3.7.0-3.7.1, 3.8.0, 3.9.0] or uninstalled
└─restricted to versions 0.8.0 by an explicit requirement, leaving only versions 0.8.

Is there a reason you’re restricting ModelingToolkit to v0.8? Or maybe some weird package bound somewhere else?

Don’t worry about adding masters though: I’ll get those both tagged today.

The RK4 adaptivity allocations is a very specific issue that only occurred in the RK4 adaptivity, so that isn’t a broad change. The only real issue that would show up in other solves is this complex abs, so the default on all solvers was changed in DiffEqBase, and that should help Tsit5 as well.

Yes exactly. This is only changing the default internalnorm, but you can specify your own to see how things work out.

1 Like

my output for destats appears different (no change to code)

sol1.destats = DiffEqBase.DEStats
Number of function 1 evaluations: -1
Number of function 2 evaluations: -1
Number of W matrix evaluations: -1
Number of linear solves: -1
Number of Jacobians created: -1
Number of nonlinear solver iterations: -1
Number of nonlinear solver convergence failures: -1
Number of rootfind condition calls: -1
Number of accepted steps: 4040
Number of rejected steps: 1

Julia v1.4.2?

Julia 1.2.0
Time to upgrade I guess.

Yes, some of the latest versions lower bound to require Julia v1.3 for multithreading support.

Upgraded to 1.4.2 and Pkg.update(d) DifferentialEquations - I’m now using v6.14.0. DEStats now works but speedup issue of two days ago still not resolved. Is the fixed code available for download? (my code will run 2 weeks without fix, hopefully <3 days with!). Thanks!