Time integration in OrdinaryDiffEq.jl in Julia is slower than calling it from Python

Hi,

I work on the software that allows to call numerical solvers across different languages, which is implemented using libffi (Function Foreign Interface library).
I was working recently on the performance study, to see how using a mediator library affects performance in comparison with just using a numerical solver from the same language in which it is written (i.e., callling, say, a Julia’s solver from a Julia script).

I got somewhat strange results, when calling OrdinaryDiffEq.jl with the integrator DP5() directly from Julia is /slower/ than calling it from Python via this software.

Precisely, I solve inviscid Burgers’ partial differential equation using the method of lines, where the time integration is done via OrdinaryDiffEq.jl. I solve it multiple times to average the runtime and report the mean value with 95% confidence interval.

Here the solution directly in Julia:

using OrdinaryDiffEq
using Plots
using Printf
using Statistics

function compute_rhs(udot, u, p, t)
    dx, = p

    f = 0.5 * u.^2
    c = maximum(abs.(u))  # Local sound speed.
    f_hat = 0.5 * (f[1:end-1] + f[2:end]) .- 0.5 * c * (u[2:end] - u[1:end-1])
    f_plus = f_hat[2:length(f_hat)]
    f_minus = f_hat[1:length(f_hat) - 1]
    
    udot[2:end-1] = -1.0 / dx * (f_plus - f_minus)

    local_ss_rb = max(abs(u[1]), abs(u[end]))
    f_rb = 0.5 * (f[1] + f[end]) - 0.5 * local_ss_rb * (u[1] - u[end])
    f_lb = f_rb

    udot[1] = -1.0 / dx * (f_minus[1] - f_lb)
    udot[end] = -1.0 / dx * (f_rb - f_plus[end])
end

function run()
    N = 4000
    x = collect(range(0, 2, N + 1))
    dx = 2 / N
    u0 = 0.5 .- 0.25 * sin.(pi * x)

    CFL = 0.5
    dt_max = dx * CFL

    t0 = 0.0
    tfinal = 2.0

    odeProblem = ODEProblem(compute_rhs, u0, (t0, tfinal), (dx,))

    odeProblemStep = ODEProblem(compute_rhs, u0, (t0, Inf), (dx,))

    times = collect(range(t0, tfinal, 11))

    local solution_last::Vector{Float64}

    # for i = 1:5
    #     K = 30
    #     @printf "Trial %d: estimating runtime from %d runs\n" i K
    #     elapsed_times = []
    #     for k = 1:K
    #         solver = init(odeProblemStep, DP5(), reltol = 1e-6, abstol = 1e-12)
    #         tic = time_ns()
    #         for t in times[2:end]
    #             step!(solver, t - solver.t, true)
    #         end
    #         toc = time_ns()
    #         push!(elapsed_times, (toc - tic) / 1e9)
    #         solution_last = solver.u
    #     end

    #     runtime_mean = mean(elapsed_times)
    #     runtime_std = std(elapsed_times; corrected=true, mean=runtime_mean)
    #     sem = runtime_std / sqrt(length(elapsed_times))
    #     ci = 2 * sem
    #     @printf "Runtime, sec: %.4f ± %.4f\n" runtime_mean ci
    #     @printf "Solution second point from the left value: %.16f\n" solution_last[2]
    # end

    for i = 1:5
        K = 30
        @printf "Trial %d: estimating runtime from %d runs\n" i K
        elapsed_times = []
        for k = 1:K
            tic = time_ns()
            solution = solve(odeProblem, DP5(), reltol = 1e-6, abstol = 1e-12)
            toc = time_ns()
            push!(elapsed_times, (toc - tic) / 1e9)
            solution_last = solution.u[end]
        end

        runtime_mean = mean(elapsed_times)
        runtime_std = std(elapsed_times; corrected=true, mean=runtime_mean)
        sem = runtime_std / sqrt(length(elapsed_times))
        ci = 2 * sem
        @printf "Runtime, sec: %.4f ± %.4f\n" runtime_mean ci
        @printf "Solution second point from the left value: %.16f\n" solution_last[2]
    end

    p = plot(x, solution_last, linewidth=3)
    display(p)
end

run()

with the following results:

Trial 1: estimating runtime from 30 runs
Runtime, sec: 1.0588 ± 0.0553
Solution second point from the left value: 0.5003054078352265
Trial 2: estimating runtime from 30 runs
Runtime, sec: 1.0851 ± 0.0565
Solution second point from the left value: 0.5003054078352265
Trial 3: estimating runtime from 30 runs
Runtime, sec: 1.1060 ± 0.0591
Solution second point from the left value: 0.5003054078352265
Trial 4: estimating runtime from 30 runs
Runtime, sec: 1.1252 ± 0.0597
Solution second point from the left value: 0.5003054078352265
Trial 5: estimating runtime from 30 runs
Runtime, sec: 1.1070 ± 0.0600
Solution second point from the left value: 0.5003054078352265

And this is the output from running the same OrdinaryDiffEq.jl integrator DP5 from Python, that is the ODE system is defined in Python and passed to Julia as a C callback (that wraps the Python callback):

import argparse
import dataclasses
import os
import time
import numpy as np
from oif.interfaces.ivp import IVP
from scipy import integrate

class BurgersEquationProblem:
    r"""
    Problem class for inviscid Burgers' equation:
    $$
            u_t + 0.5 * (u^2)_x = 0, \quad x \in [0, 2], \quad t \in [0, 2]
    $$
    with initial condition :math:`u(x, 0) = 0.5 - 0.25 * sin(pi * x)`
    and periodic boundary conditions.

    Parameters
    ----------
    N : int
        Grid resolution.
    """

    def __init__(self, N=101):
        self.N = N

        self.x, self.dx = np.linspace(0, 2, num=N + 1, retstep=True)
        self.u = 0.5 - 0.25 * np.sin(np.pi * self.x)
        self.invariant = np.sum(np.abs(self.u))

        self.cfl = 0.5
        self.dt_max = self.dx * self.cfl

        self.t0 = 0.0
        self.tfinal = 2.0

        self.udot = np.empty_like(self.u)

    def compute_rhs(self, t, u: np.ndarray, udot: np.ndarray, ___) -> None:
        dx = self.dx

        f = 0.5 * u**2
        local_ss = np.maximum(np.abs(u[0:-1]), np.abs(u[1:]))
        local_ss = np.max(np.abs(u))
        f_hat = 0.5 * (f[0:-1] + f[1:]) - 0.5 * local_ss * (u[1:] - u[0:-1])
        f_plus = f_hat[1:]
        f_minus = f_hat[0:-1]
        udot[1:-1] = -1.0 / dx * (f_plus - f_minus)

        local_ss_rb = np.maximum(np.abs(u[0]), np.abs(u[-1]))
        f_rb = 0.5 * (f[0] + f[-1]) - 0.5 * local_ss_rb * (u[0] - u[-1])
        f_lb = f_rb

        udot[+0] = -1.0 / dx * (f_minus[0] - f_lb)
        udot[-1] = -1.0 / dx * (f_rb - f_plus[-1])

    def compute_rhs_scipy(self, t, u):
        self.compute_rhs(t, u, self.udot, None)
        return self.udot


def main():
    problem = BurgersEquationProblem(N=4000)
    s = IVP("jl_diffeq")
    t0 = 0.0
    y0 = problem.u
    s.set_initial_value(y0, t0)

    times = np.linspace(problem.t0, problem.tfinal, num=11)

    for i in range(0, 5):
        K = 30
        print("Trial {:d}: estimating runtime from {:d} runs\n".format(i, K))
        print(f"Measure performance {K} times")
        elapsed_times = []
        solution_last = None
        for k in range(K):
            s.set_initial_value(y0, t0)
            tic = time.perf_counter()
            for t in times[1:]:
                s.integrate(t)
            toc = time.perf_counter()
            elapsed_times.append(toc - tic)
            solution_last = s.y

        mean = np.mean(elapsed_times)
        sem = np.std(elapsed_times, ddof=1) / np.sqrt(len(elapsed_times))
        ci = 2.0 * sem  # Coefficient corresponds to the 95% Confidence Interval.
        print(f"Runtime, sec: {mean:.4f} ± {ci:.4f}")
        print(f"Solution second point from the left value: {solution_last[1]:.16f}")


if __name__ == "__main__":
    main()

with the following output:

Trial 1: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.9764 ± 0.0475
Solution second point from the left value: 0.5003054078322573

Trial 2: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.9422 ± 0.0489
Solution second point from the left value: 0.5003054078322573

Trial 3: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.9329 ± 0.0620
Solution second point from the left value: 0.5003054078322573

Trial 4: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.9138 ± 0.0491
Solution second point from the left value: 0.5003054078322573

Trial 5: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.8905 ± 0.0451
Solution second point from the left value: 0.5003054078322573

Note how running Julia’s time integrator from Python with the overhead of wrapping a Python callback in C and calling it from Julia through a mediator library, somehow outperforms calling Julia’s code directly.

The question is, why this happens? Can anybody point me to somehting that I do not understand in Julia that could help improve its performance. I need this comparison for a scientific paper, therefore, I’d like to be maximally sure that I do things as accurate as possible.

Also, I ran from Python SciPy’s dopri5 (which is in principle should be the same as DP5 in OrdinaryDiffeq.jl), and it 30% faster than Julia’s code, although it is not even a native code but a wrapped Fortran from, AFAIK, 1980s:

Trial 1: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.7381 ± 0.0124
Solution second point from the left value: 0.5003054078346969

Trial 2: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.7433 ± 0.0094
Solution second point from the left value: 0.5003054078346969

Trial 3: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.7466 ± 0.0086
Solution second point from the left value: 0.5003054078346969

Trial 4: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.7382 ± 0.0111
Solution second point from the left value: 0.5003054078346969

Trial 5: estimating runtime from 30 runs
Measure performance 30 times
Runtime, sec: 0.7367 ± 0.0109
Solution second point from the left value: 0.5003054078346969

Note that I report everywhere the solution at the same grid point to make sure that I solve the same problem with the same method (tolerances, etc.), which all agree to 16 significant digits when I use DP5 from Julia and Python, and to about 12 significant digits when I call dopri5 in Python.

1 Like

I’m pretty sure the issue is just that you are solving the Julia model with reltol=1e-6 abstol=1e-9, but solving the python models with the default tolerances (reltol=1e-3, abstol=1e-6).

Also as a side-note, DP5 is pretty much strictly worse than Tsit5.

1 Like

No, I set Julia’s tolerances to reltol=1e-6, abstol=1e-12 because these are the default tolerances in dopri5 in SciPy. This is an excerpt from the dopri5 class from GitHub:

class dopri5(IntegratorBase):
    runner = getattr(_dop, 'dopri5', None)
    name = 'dopri5'
    ...  # omitted

    def __init__(self,
                 rtol=1e-6, atol=1e-12,
                 nsteps=500,
                 max_step=0.0,
                 first_step=0.0,  # determined by solver
                 safety=0.9,
                 ifactor=10.0,
                 dfactor=0.2,
                 beta=0.0,
                 method=None,
                 verbosity=-1,  # no messages if negative
                 ):

Link: scipy/scipy/integrate/_ode.py at main · scipy/scipy · GitHub

Good point. The actual difference is how much output you are requesting. The python code is closer to solve(odeProblem, DP5(), reltol = 1e-6, abstol = 1e-12, saveat=times) which for me takes 0.75 seconds, compared to 1.391952 seconds for solve(odeProblem, DP5(), reltol = 1e-6, abstol = 1e-12) (the difference is how often you save output).

As an aside, your Julia code is allocating a lot in the ODE derivative function, which can greatly impact performance. You should try to write compute_rhs to not allocate any memory as this would be a more standard way one writes an in-place method for DifferentialEquations.jl. See Julia’s performance tips, which also discuss benchmarking Julia code: Performance Tips · The Julia Language

These also show implementation recommendations for PDEs / stiff ODES: Solving Large Stiff Equations · DifferentialEquations.jl and GPU-Accelerated Stochastic Partial Differential Equations · Overview of Julia's SciML (the latter says it is for GPUS, but gives guidance on CPUs too)

That’s a good point. The following is far from optimal, but still ~2x faster

function compute_rhs_2(udot, u, p, t)
    dx, = p

    f = 0.5 .* u.^2
    c = maximum(abs, u)  # Local sound speed.
    f_hat = @views 0.5 .* (f[1:end-1] .+ f[2:end] .- c .* (u[2:end] .- u[1:end-1]))
    f_plus = @views f_hat[2:length(f_hat)]
    f_minus = @views f_hat[1:length(f_hat) - 1]
           
    udot[2:end-1] = (-1.0 / dx) .* (f_plus .- f_minus)

    local_ss_rb = max(abs(u[1]), abs(u[end]))
    f_rb = 0.5 * (f[1] + f[end]) - 0.5 * local_ss_rb * (u[1] - u[end])
    f_lb = f_rb

    udot[1] = -1.0 / dx * (f_minus[1] - f_lb)
    udot[end] = -1.0 / dx * (f_rb - f_plus[end])
end
1 Like

OK, with saveat=times argument to the solve function, I get a pretty much similar runtimes as I get when driving OrdinaryDiffEq.jl from Python:

Trial 1: estimating runtime from 30 runs
Runtime, sec: 0.9391 ± 0.0932
Solution second point from the left value: 0.5003054078352265
Trial 2: estimating runtime from 30 runs
Runtime, sec: 0.8958 ± 0.0057
Solution second point from the left value: 0.5003054078352265
Trial 3: estimating runtime from 30 runs
Runtime, sec: 0.9064 ± 0.0065
Solution second point from the left value: 0.5003054078352265
Trial 4: estimating runtime from 30 runs
Runtime, sec: 0.9127 ± 0.0111
Solution second point from the left value: 0.5003054078352265
Trial 5: estimating runtime from 30 runs
Runtime, sec: 0.9377 ± 0.0181
Solution second point from the left value: 0.5003054078352265

Nice, thank you!

However, in practice, I want to have an API, where the users write their own time integration loop, and for that I’d like to use the step! method (see the commented code in the Julia script above). With the step! method, I get 3x slower runtimes:

Trial 1: estimating runtime from 30 runs
Runtime, sec: 2.9005 ± 0.0280
Solution second point from the left value: 0.5003054078322574
Trial 2: estimating runtime from 30 runs
Runtime, sec: 2.9158 ± 0.0444
Solution second point from the left value: 0.5003054078322574
Trial 3: estimating runtime from 30 runs
Runtime, sec: 2.9298 ± 0.0346
Solution second point from the left value: 0.5003054078322574
Trial 4: estimating runtime from 30 runs
Runtime, sec: 2.8926 ± 0.0240
Solution second point from the left value: 0.5003054078322574
Trial 5: estimating runtime from 30 runs
Runtime, sec: 2.9191 ± 0.0267
Solution second point from the left value: 0.5003054078322574

It just does not make sense to me. How writing my own time-stepping loop over step! can be three times slower than calling solve?

When I add @views macro as you suggested (all three lines), then the runtime decreases from ~2.9 to ~2.5:

Runtime, sec: 2.4495 ± 0.0438

the step! being slower definitely doesn’t make sense…

The difference in performance was with the dense output.

The step! function was slower compared to the solve function because before using step!, one needs to initialize a solver with the init function. By default, initialization of the solver enables dense output; however, it can be disabled with the save_everystep = false keyword argument. For solve, setting saveat = times also disables the dense output.

In my previous message, I was comparing solve without dense output with step! with dense output, hence 3x performance penalty in step!.

Personally, I find it somewhat counterintuitive, that init enables dense output by default, as with step! I get only one solution instance anyway, not a sequence; I am not sure how to retrieve this dense output.

1 Like

Here is a fully in-place, allocation-free version which just uses a single loop (two passes over the data, since one is required to compute c). (I’ve verified that it gives the same results as the original code on random inputs.)

function compute_rhs!(udot, u, p, t)
    dx, = p
    dx⁻¹ = inv(dx)
    
    c = maximum(abs, u)  # Local sound speed
    local_ss_rb = max(abs(u[1]), abs(u[end]))
        
    f_cur = 0.5 * u[1]^2
    f̂_lb = 0.5 * (f_cur + 0.5 * u[end]^2) - 0.5 * local_ss_rb * (u[1] - u[end])
    f̂_prev = f̂_lb
    @inbounds for i = 1:length(udot)-1
        f_next = 0.5 * u[i+1]^2
        f̂_cur = 0.5 * ((f_cur+f_next) - c * (u[i+1]-u[i]))
        udot[i] = dx⁻¹ * (f̂_prev - f̂_cur)
        f̂_prev, f_cur = f̂_cur, f_next
    end
    f̂_rb = f̂_lb
    udot[end] = dx⁻¹ * (f̂_prev - f̂_rb)
end

(Note that in both Python3 and in Julia you can use as a variable name. No need to have variables named f_hat these days. If I wanted to be even fancier I could have used f̂ᵢ₋₁ instead of f̂_prev, etc.)

PS. Note that udot[2:end-1] = (-1.0 / dx) .* (f_plus .- f_minus) still allocates unless you use .=.

2 Likes

Hi @stevengj thank you for the suggested loop fusion (and also @inbounds that I was not aware of).

I was working on similar performance improvements and comparison with Python/NumPy + Numba (automated optimizer for Python), and I have noticed the following:

  • If I replace vector version of the code with each “vector expression” expanded in a loop, then Julia computes twice faster (0.74 sec vs 1.83 sec in my tests based on this problem (inviscid Burgers’ equation)
  • This performance is comparable with what Numba gives with similar expansion of the loops, actually Numba was a bit faster (0.74 sec Julia and 0.72 sec Numba)
  • When I do loop fusion + @inbounds as suggested by @stevengj in the previous message, then Julia becomes twice faster (0.34 sec vs 0.78 sec)
  • Numba with loop fusion significantly lags behind Julia (0.34 sec Julia versus 0.50 sec Numba), although I did not find the way to disable bounds checking in Numba

Probably you got rid of some temporaries; you can generally achieve the same thing by judicious use of dots and views.

The fully fused version is harder to automate well because I had to introduce some inter-iteration dependencies to avoid computing f and f_hat twice. I doubt Numba can do this automatically? You can try removing the inbounds… it probably only helps a little.