Comparing the Diffeq.jl Euler stepper with a by-hand Euler stepper

I am trying to understand the behavior of Diffeq.jl in a toy example, with the eventual goal of implementing networks of integrate-and-fire neurons. These questions are probably very basic, but any clarification would be much appreciated! The vector field of interest (parameters are pretty much random) is

function vf!(du, u,p,t)
    du .= (2 .- u) ./ 3.0 .+ 0.2
end

I am comparing the behavior of two Euler stepping functions, one written by hand and one using the Diffeq.jl integrator interface.

function mystep!(du,u)
    vf!(du,u,nothing,nothing)
    u .= u .+ 0.01*du
end

versus

u0 = rand(10000)
prob = ODEProblem(vf!, u0, (0.0,1000.0))
integ = init(prob, Euler(),dt=.01,save_everystep=false)
step!(integ)

When I run @benchmark mystep!(du0,u0) and @benchmark step!(integ), I get mean results 21.107 μs ± 42.136 μs and 27.882 μs ± 2.594 μs respectively. The increased variability in the first case seems to be attributable to one step in every 500 or so taking on the order of ~1 ms. This can’t be chalked up to any function compilation, as far as I understand.

Question 1: why does the by-hand step function occasionally take much longer, and how does the integrator in Diffeq.jl avoid this?

Question 2: what exactly is the additional overhead introduced by the integrator interface such that each step! takes longer on average than each mystep!?

0.01*du is allocating a vector, so garbage collection gets called occasionally. You wanted 0.01.*du

Ah, thanks! That makes sense. However, this seems to result in the difference between step! and mystep! becoming even more dramatic (30 microsecs per step for step! vs 10 microsecs for mystep!). So my remaining question is: what is Diffeq.jl doing with those extra 20 microsecs?

Yeah, this is a fun one. If you want to know, just take a look at the profile:

using Profile
@profile for i in 1:100000 step!(integ) end
using ProfileView
ProfileView.view()

In the profile you will notice that OrdinaryDiffEq.Euler has 5 O(n) operations:

@muladd @.. u = uprev + dt*integrator.fsalfirst
f(fsallast,u,p,t+dt)
any(isnan,u)
recursivecopy!(uprev,integrator.u)
recursivecopy!(fsalfirst,fsallast)

That’s simplified a bit, but you catch the drift. So first of all, this computation is so cheap that any(isnan,u) is the most expensive thing in the list :tada:. You can always turn that off:

integ = init(prob, Euler(),dt=.01,save_everystep=false,unstable_check = (du,u,p,t)->false)

but you can guess why a production ODE solver has defaults for exiting on NaN. But there’s 2 other extra O(n) kernel calls, why? This is explained in a very old video:

Basically what’s going on is that if you do this inverted form, update then calculate f, then you can have the derivative at both sides which defines a smooth interpolation over the solve. Note that the video explains how for most RK methods that are used there is a separate trick that actually makes it free. It’s really just Euler (and a few other low order less used methods), that require this trick, which then results in 2 extra copyto! operations.

So yes, if your f is only some trivial O(n) calculation, you can handwrite an Euler method to be 2n operations per step, while the default here is 4n operations + error checking.

So oh no, why is this generally not a problem? Let’s take a look at the stepping code for the standard “simple” method: Tsit5

@muladd function perform_step!(integrator, cache::Tsit5Cache, repeat_step=false)
  @unpack t,dt,uprev,u,f,p = integrator
  @unpack c1,c2,c3,c4,c5,c6,a21,a31,a32,a41,a42,a43,a51,a52,a53,a54,a61,a62,a63,a64,a65,a71,a72,a73,a74,a75,a76,btilde1,btilde2,btilde3,btilde4,btilde5,btilde6,btilde7 = cache.tab
  @unpack k1,k2,k3,k4,k5,k6,k7,utilde,tmp,atmp = cache
  a = dt*a21
  @.. tmp = uprev+a*k1
  f(k2, tmp, p, t+c1*dt)
  @.. tmp = uprev+dt*(a31*k1+a32*k2)
  f(k3, tmp, p, t+c2*dt)
  @.. tmp = uprev+dt*(a41*k1+a42*k2+a43*k3)
  f(k4, tmp, p, t+c3*dt)
  @.. tmp = uprev+dt*(a51*k1+a52*k2+a53*k3+a54*k4)
  f(k5, tmp, p, t+c4*dt)
  @.. tmp = uprev+dt*(a61*k1+a62*k2+a63*k3+a64*k4+a65*k5)
  f(k6, tmp, p, t+dt)
  @.. u = uprev+dt*(a71*k1+a72*k2+a73*k3+a74*k4+a75*k5+a76*k6)
  f(k7, u, p, t+dt)
  integrator.destats.nf += 6
  if integrator.alg isa CompositeAlgorithm
    g7 = u
    g6 = tmp
    # Hairer II, page 22
    @.. utilde = k7 - k6
    ϱu = integrator.opts.internalnorm(utilde,t)
    @.. utilde = g7 - g6
    ϱd = integrator.opts.internalnorm(utilde,t)
    integrator.eigen_est = ϱu/ϱd
  end
  if integrator.opts.adaptive
    @.. utilde = dt*(btilde1*k1 + btilde2*k2 + btilde3*k3 + btilde4*k4 + btilde5*k5 + btilde6*k6 + btilde7*k7)
    calculate_residuals!(atmp, utilde, uprev, u, integrator.opts.abstol, integrator.opts.reltol,integrator.opts.internalnorm,t)
    integrator.EEst = integrator.opts.internalnorm(atmp,t)
  end
  return nothing
end

In this code you see that a single step of the 5th order RK method, with eigenvalue checks for stiffness detection, takes 13 O(n) operations and 6 f calls (6 instead of 7 due to FSAL). The FSAL gives one more O(n) operation by cutting out 1 f call. Then you have the O(n) operation for error checking / unstable check. That is how the other pieces usually get masked: the more complicated the method, and the complicated the f, the less the other details matter. Also (as mentioned in the video), here the FSAL handling gets rid of an f call so it’s cheaper than not doing it (as opposed to an extra O(n) operation), and the interpolant construction is free by design. Thus if you think about it like this, the worst case scenario is to minimize the number of internal stages and minimize the complexity of f, which is precisely Euler’s method and your kind of example.

But really what’s going on here is that it doesn’t make sense for the simplest of simplest cases to use all of the machinery, which is why for this edge case we created SimpleDiffEq.jl

There is no SimpleEuler(), but if you wanted to add one that would be nice (in fact, there’s actually one built into DiffEqBase itself: DiffEqBase.jl/internal_euler.jl at master · SciML/DiffEqBase.jl · GitHub).

Now to finish off, other interesting things can come into play if n is really small, and that’s its own can of worms which is interesting too…

It’s sometimes fun to see if you can get the fully generic version down to match the most basic implementation when u is a scalar and f is identity (@Oscar_Smith do you know if there’s some way to do a faster isnan?). But yeah, I hope that gives a bit of insight into how the simplest ODEs are somewhat of a corner case and why SimpleDiffEq.jl exists.

https://diffeq.sciml.ai/stable/solvers/ode_solve/#SimpleDiffEq.jl

11 Likes

Don’t believe so. isnan(x) is just x!=x (ie a single vectorizable instruction). You can’t get much faster than that

I think any(isnan,u) isn’t SIMD’ing though. We should look into this a bit more.