Python vs. Julia ODE Solver

Hello! Brand new to Julia. I have a very stiff system of ODEs that I’m trying to optimize for large N. I’ve achieved reasonable performance in Python using SciPy+Numba, but I wanted to see if Julia would be superior. So far it’s been 2-3x slower (and worse as N increases to, say, 100), and I can’t tell what’s going wrong. I’ve run time and profile and noticed that the memory allocation is quite severe, which is confusing to me because as far as I can tell I’m just mutating my arrays.

Here is my MWE in Julia:

using OrdinaryDiffEq, BenchmarkTools, LSODA, Plots, Profile

N = 20

parameters = [2, N,
              -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5, 
              25.0, 6.0, 12.0, 10.0,
              2.0*1.4 .* (rand(Float64, (N, 2)) .- 0.5), 0.5 .* (rand(Float64, (N, N)) .- 0.8), 0.14 .* (rand(Float64, (N, 1)) .- 0.5),
              rand(Float64, (2, 4))]

function forward_RHS(du, u, p, t)
    v_s = view(u, 1:p[2])
    u_s = view(u, p[2]+1:2*p[2])
    w_s = view(u, 2*p[2]+1:3*p[2])
    s = view(u, 3*p[2]+1:4*p[2])

    du[1:p[2]] = ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0) .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15]) .+ p[22] .+ p[20]*sin.(p[23][:, 1].*t) .+ p[21]*s)./p[16]
    du[p[2]+1:2*p[2]] = (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
    du[2*p[2]+1:3*p[2]] = (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
    du[3*p[2]+1:4*p[2]] = (-s .+ (v_s .> 0.25).*(v_s .< 0.3).*(du[1:p[2]] .> 0.0).*du[1:p[2]]./0.05)./p[19]
end

prob = ODEProblem(forward_RHS, zeros(4*N), (0.0, 500.0), parameters)
@time solution = solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8)
plot(solution, vars = (1:N), plotdensity = 10000, lw = 2.0)

On my laptop, time gives
“6.487295 seconds (13.95 M allocations: 1.183 GiB, 1.89% gc time)”

Here it is in Python:

import numba
import numpy
import time as timer
from scipy import integrate
from matplotlib import pyplot

N = 20

parameters = (0, 2, N,
              -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5, 
              25.0, 6.0, 12.0, 10.0,
              2.0*1.4 * (numpy.random.rand(N, 2) - 0.5), 0.5 * (numpy.random.rand(N, N) - 0.8), 0.14 * (numpy.random.rand(N, ) - 0.5),
              numpy.random.rand(2, 4))

@numba.njit
def forward_RHS(u, t, *p):
    v_s = u[0:p[2]]
    u_s = u[p[2]:2*p[2]]
    w_s = u[2*p[2]:3*p[2]]
    s = u[3*p[2]:4*p[2]]

    du = numpy.zeros((4*p[2], ))

    du[0:p[2]] = ((p[3]*(v_s - p[4])**2.0 + p[5])*(v_s - 1.0) - p[12]*u_s*(v_s - p[14]) - p[13]*w_s*(v_s - p[15]) + p[22] + p[20]@numpy.sin(p[23][:, 1]*t) + p[21]@s)/p[16]
    du[p[2]:2*p[2]] = (numpy.exp(p[6]*(v_s - p[7])) + p[8] - u_s)/p[17]
    du[2*p[2]:3*p[2]] = (p[9]*(v_s - p[10])**2.0 + p[11] - w_s)/p[18]
    du[3*p[2]:4*p[2]] = (-s + (v_s > 0.25)*(v_s < 0.3)*(du[0:p[2]] > 0.0)*du[0:p[2]]/0.05)/p[19]    
    
    return du
            
tic = timer.time()
            
y = integrate.odeint(forward_RHS, numpy.zeros(4*N, ), numpy.linspace(0.0, 500.0, 5000), args = parameters, rtol = 1e-8, atol = 1e-8)
            
print(str(timer.time() - tic) + ' seconds')

pyplot.plot(y[:, 0:N])

On my laptop this takes ~2.4 seconds.

I’ve looked through the performance tips, but I remain confused. Am I missing something obvious?

1 Like

This allocates a new array. Try explicitly using views for these slices as well.

2 Likes

The problem is that Parameters is massively type unstable (It’s a Vector{Any}). If you use a struct or similar, it should be a few hundred times faster.

7 Likes

Ah, I should have mentioned that I tried to use views for du as well, like so:

function forward_RHS(du, u, p, t)
    v_s = view(u, 1:p[2])
    u_s = view(u, p[2]+1:2*p[2])
    w_s = view(u, 2*p[2]+1:3*p[2])
    s = view(u, 3*p[2]+1:4*p[2])
    
    dv_s = view(du, 1:p[2])
    du_s = view(du, p[2]+1:2*p[2])
    dw_s = view(du, 2*p[2]+1:3*p[2])
    ds = view(du, 3*p[2]+1:4*p[2])

    dv_s = ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0) .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15]) .+ p[22] .+ p[20]*sin.(p[23][:, 1].*t) .+ p[21]*s)./p[16]
    du_s = (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
    dw_s = (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
    ds = (-s .+ (v_s .> 0.25).*(v_s .< 0.3).*(dv_s .> 0.0).*dv_s./0.05)./p[19]
end

However, now my kernel seems to stall in Jupyter.

1 Like

Amazing, I’ll try this and report back. Thank you!

One option to do what @Oscar_Smith suggests is just change the brackets by parenthesis in the definition of parameters. Then you get a type-stable tuple. The time here of the execution goes from ~4s to 0.3s by doing that. (remember that you should run the problem twice to not measure compilation time).

There are still quite a lot allocations in the forward_RHS function which are harder to to track. They become smaller if using some views in place of slices, but I could not easily eliminate all of them (all that broadcasting is somewhat confusing to me).

5 Likes

Basic question: Does ODEProblem care what forward_RHS returns? Currently it is returning the right hand side of the last line.

One smaller thing I’d change is replace (v_s .> 0.25).*(v_s .< 0.3) with (0.25 .<= v_s .< 0.3). I’m not sure it will be faster, but it is (at least to me) much clearer.

4 Likes

The following code gives a time of about 130ms on my machine. I can’t compare the python code right now because pip install numba is failing.

using OrdinaryDiffEq, BenchmarkTools, LSODA

function runprob()
    N = 20
    parameters = (2, N,
              -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
              25.0, 6.0, 12.0, 10.0,
              2.0*1.4 .* (rand(Float64, (N, 2)) .- 0.5), 0.5 .* (rand(Float64, (N, N)) .- 0.8), 0.14 .* (rand(Float64, (N, 1)) .- 0.5),
                  rand(Float64, (2, 4)))
    prob = ODEProblem(forward_RHS, zeros(4*N), (0.0, 500.0), parameters)
    solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8);
end

function forward_RHS(du, u, p, t)
    v_s = view(u, 1:p[2])
    u_s = view(u, p[2]+1:2*p[2])
    w_s = view(u, 2*p[2]+1:3*p[2])
    s = view(u, 3*p[2]+1:4*p[2])

    du[1:p[2]] = ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0) .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15]) .+ p[22] .+ p[20]*sin.(p[23][:, 1].*t) .+ p[21]*s)./p[16]
    du[p[2]+1:2*p[2]] = (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
    du[2*p[2]+1:3*p[2]] = (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
    du[3*p[2]+1:4*p[2]] = (-s .+ (v_s .> 0.25).*(v_s .< 0.3).*(du[1:p[2]] .> 0.0).*du[1:p[2]]./0.05)./p[19]
end

@btime runprob();

nothing
1 Like

It’s been a while since I looked at the performance tips page, it’s getting kind of big. I scanned it and didn’t see anything about preferring tuples to arrays when possible (although I may have missed it). I did get a factor of 2 perhaps by wrapping everything in a function. This is an easy one, it doesn’t always help, but if it removes a common potential source of inefficiency.
I hope this helps!

1 Like

I don’t think it does. I prefer a style rule that says you always must use return if you care about the value. Then if you see a blob at the bottom of a function in a code base, and there’s no return you can strongly guess that the return value is not needed. Stronger still is to require return nothing in this case. But, that can be verbose.

1 Like

The tips here are all super helpful, and changing my parameters to a tuple did indeed improve the speed by a factor of 10!

I’d still expect the Julia code to be faster as N gets larger - say, 500. Is the poor scale-up with N due to my allocations to du? Like I mentioned, I tried using views instead of slices and now my Julia code takes ~100 seconds with N = 1!

Thanks for all the help so far, even if I can’t eke out any more performance this is a big improvement.

There are some other things that I have fiddled around:

  1. rand(Float64, (N, 1))
    These are more clearly written as rand(Float64,N) (the Float64 is redundant there, but ok).
    The thing is that that was generating an two-dimensional array instead of a vector, and that didn’t allow the broadcasting of the result of the first of the four lines there.

  2. The remaining allocations come from, as far as I could see:
    (a) -s which can be written as -1 .* s
    (b) Two matrix-vector multiplications: p[21]*s and p[20]*sin.(p[23][:, 1].*t)

To solve those maybe someone has better advice, but I think you have to use an inplace matrix multiplication routine as mul! from LinearAlgebra. But for that you need to store auxiliary arrays to store the results.

I have done that in the code bellow (added some auxiliary arrays to the parameters). That code does not allocate anything in that function. It is possible that I have done something wrong, but if I didn’t the code runs in:

julia> @time solution = solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8);
  0.135555 seconds (166.55 k allocations: 16.639 MiB)

for N=20

and for N=500 it runs in:

 71.583810 seconds (945.29 k allocations: 149.516 MiB, 0.11% gc time)

Is that good?

In the code below I also added some views, but I think those didn’t make much of a difference. To check.

using OrdinaryDiffEq, BenchmarkTools, LSODA, Plots, Profile
using LinearAlgebra

N = 20

parameters = (2, N,
              -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
              25.0, 6.0, 12.0, 10.0,
              2.0*1.4 .* (rand(Float64,N,2) .- 0.5), #20 
              0.5 .* (rand(Float64,N,N) .- 0.8), #21
              0.14 .* (rand(Float64,N) .- 0.5), #22
              rand(Float64,2,4),  #23
              zeros(N), #24
              zeros(N), #25 
              zeros(2)) #26

function forward_RHS(du, u, p, t)
    v_s = view(u, 1:p[2])
    u_s = view(u, p[2]+1:2*p[2])
    w_s = view(u, 2*p[2]+1:3*p[2])
    s = view(u, 3*p[2]+1:4*p[2])

    p[26] .= t .* view(p[23],:,1)
    p[26] .= sin.(p[26])
    mul!(p[24],p[20],p[26])

    du[1:p[2]] .= ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0)
                  .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15])
                  .+ p[22]
                  .+ p[24]
                  .+ mul!(p[25],p[21],s))./p[16]
    du[p[2]+1:2*p[2]] .= (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
    du[2*p[2]+1:3*p[2]] .= (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
    du[3*p[2]+1:4*p[2]] .= (-1 .* s .+ (v_s .> 0.25).*(v_s .< 0.3).*(view(du,1:p[2]) .> 0.0).*view(du,1:p[2])./0.05)./p[19]

end

prob = ODEProblem(forward_RHS, zeros(4*N), (0.0, 500.0), parameters)
@time solution = solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8)

One note: In Julia you do not need to write everything in vectorized form. Maybe you are comfortable with that, but I would have split those operations in a lot of lines to makes things more clear and debug errors easier.

ps: There is an auxiliary vector of dimension 2 which could be substituted by an StaticArray, but that I think is for another moment :-).

7 Likes

The problem was not array vs: tuple, but the fact that the array was an array of elements of type Any, because each element had a different type:

julia> parameters = [ 1, rand(), zeros(2) ]
3-element Array{Any,1}:
 1
 0.11504105156893152
  [0.0, 0.0]

While the tuple is type-annotated for every element, thus type-stable:

julia> parameters = ( 1, rand(), zeros(2) )
(1, 0.9903147864959447, [0.0, 0.0])

julia> typeof(parameters)
Tuple{Int64,Float64,Array{Float64,1}}

Anyway, that was the quick fix. But the nice way of solving that, as pointed above, was to use a struct where each field had a meaningful name, with the Parameters package much clearer:

using Parameters
@with_kw struct ODEParameters
   M :: Int = 2
   N :: Int
   parameter22 :: Vector{Float64} = rand(N) 
end

julia> parameters = ODEParameters(N=20)
ODEParameters
  M: Int64 2
  N: Int64 20
  parameter22: Array{Float64}((20,)) [0.2369215315953066, 0.1388373787467665, 0.8089684496645597, 0.06784046734646032, 0.24206961930726
845, 0.10706828302340621, 0.09521317845577193, 0.246334342773739, 0.3818388215713078, 0.3805627173340529, 0.9603070026894776, 0.7650759
27380056, 0.6539639815092488, 0.3804045078213889, 0.3783914306777776, 0.2322091159091748, 0.8992389331423147, 0.04990408047193573, 0.03
18994329238822, 0.7820434538220526]


5 Likes

This section alludes to it although maybe we should say it more explicitly.
https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-abstract-container

julia> [1 1.0 "sfsdf"]
1×3 Array{Any,2}:
 1  1.0  "sfsdf"

julia> typeof(ans)
Array{Any,2}

julia> (1, 1.0, "sfsdf")
(1, 1.0, "sfsdf")

julia> typeof(ans)
Tuple{Int64,Float64,String}

An array of Any is abstract and is hard for the compiler to write optimized code for. See @stevengj 's lecture to understand the problem there:

This code runs wonderfully on my end! I’m impressed you caught the -s allocations. Thank you for the help. I did end up writing a struct for my parameters and it also seems to have sped things up a bit.

1 Like

I use this approach for that (the manual one):

2 Likes

Point of curiosity: I’ve noticed that the runtime increases dramatically from, say, N = 90 to N = 100 on my end. Results from second run of time:

N = 80: 2.894335 seconds (2.22 M allocations: 529.516 MiB, 3.83% gc time)

N = 90: 2.986387 seconds (2.09 M allocations: 540.191 MiB, 4.80% gc time)

N = 95: 3.144457 seconds (1.87 M allocations: 520.397 MiB, 4.14% gc time)

N = 100: 28.765728 seconds (1.95 M allocations: 542.425 MiB, 0.42% gc time)

N = 110: 42.906974 seconds (2.78 M allocations: 842.570 MiB, 0.99% gc time)

That’s odd, right?

Also note the following difference

function foo(x, y)
    x[2:5] = y .+ y
    nothing
end

function bar(x, y)
    x[2:5] .= y .+ y
    nothing
end

y = randn(4)
x = randn(10)
@time foo(x,y)
@time bar(x,y)
  0.000006 seconds (1 allocation: 112 bytes) # foo

  0.000006 seconds # bar

the dot on the assignment operator is required to avoid the last allocation of the broadcasted expressions.
I also return nothing, otherwise the last slice is heap allocated and returned.

5 Likes

I think there are still many things to improve, but a simple thing that sticks out to me is squaring with ^2.0. Avoid this and use ^2 instead. The latter is a simple multiplication, while the former is a complicated floating point power function.

4 Likes