# Python vs. Julia ODE Solver

Hello! Brand new to Julia. I have a very stiff system of ODEs that I’m trying to optimize for large N. I’ve achieved reasonable performance in Python using SciPy+Numba, but I wanted to see if Julia would be superior. So far it’s been 2-3x slower (and worse as N increases to, say, 100), and I can’t tell what’s going wrong. I’ve run time and profile and noticed that the memory allocation is quite severe, which is confusing to me because as far as I can tell I’m just mutating my arrays.

Here is my MWE in Julia:

``````using OrdinaryDiffEq, BenchmarkTools, LSODA, Plots, Profile

N = 20

parameters = [2, N,
-500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
25.0, 6.0, 12.0, 10.0,
2.0*1.4 .* (rand(Float64, (N, 2)) .- 0.5), 0.5 .* (rand(Float64, (N, N)) .- 0.8), 0.14 .* (rand(Float64, (N, 1)) .- 0.5),
rand(Float64, (2, 4))]

function forward_RHS(du, u, p, t)
v_s = view(u, 1:p[2])
u_s = view(u, p[2]+1:2*p[2])
w_s = view(u, 2*p[2]+1:3*p[2])
s = view(u, 3*p[2]+1:4*p[2])

du[1:p[2]] = ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0) .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15]) .+ p[22] .+ p[20]*sin.(p[23][:, 1].*t) .+ p[21]*s)./p[16]
du[p[2]+1:2*p[2]] = (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
du[2*p[2]+1:3*p[2]] = (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
du[3*p[2]+1:4*p[2]] = (-s .+ (v_s .> 0.25).*(v_s .< 0.3).*(du[1:p[2]] .> 0.0).*du[1:p[2]]./0.05)./p[19]
end

prob = ODEProblem(forward_RHS, zeros(4*N), (0.0, 500.0), parameters)
@time solution = solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8)
plot(solution, vars = (1:N), plotdensity = 10000, lw = 2.0)
``````

On my laptop, time gives
“6.487295 seconds (13.95 M allocations: 1.183 GiB, 1.89% gc time)”

Here it is in Python:

``````import numba
import numpy
import time as timer
from scipy import integrate
from matplotlib import pyplot

N = 20

parameters = (0, 2, N,
-500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
25.0, 6.0, 12.0, 10.0,
2.0*1.4 * (numpy.random.rand(N, 2) - 0.5), 0.5 * (numpy.random.rand(N, N) - 0.8), 0.14 * (numpy.random.rand(N, ) - 0.5),
numpy.random.rand(2, 4))

@numba.njit
def forward_RHS(u, t, *p):
v_s = u[0:p[2]]
u_s = u[p[2]:2*p[2]]
w_s = u[2*p[2]:3*p[2]]
s = u[3*p[2]:4*p[2]]

du = numpy.zeros((4*p[2], ))

du[0:p[2]] = ((p[3]*(v_s - p[4])**2.0 + p[5])*(v_s - 1.0) - p[12]*u_s*(v_s - p[14]) - p[13]*w_s*(v_s - p[15]) + p[22] + p[20]@numpy.sin(p[23][:, 1]*t) + p[21]@s)/p[16]
du[p[2]:2*p[2]] = (numpy.exp(p[6]*(v_s - p[7])) + p[8] - u_s)/p[17]
du[2*p[2]:3*p[2]] = (p[9]*(v_s - p[10])**2.0 + p[11] - w_s)/p[18]
du[3*p[2]:4*p[2]] = (-s + (v_s > 0.25)*(v_s < 0.3)*(du[0:p[2]] > 0.0)*du[0:p[2]]/0.05)/p[19]

return du

tic = timer.time()

y = integrate.odeint(forward_RHS, numpy.zeros(4*N, ), numpy.linspace(0.0, 500.0, 5000), args = parameters, rtol = 1e-8, atol = 1e-8)

print(str(timer.time() - tic) + ' seconds')

pyplot.plot(y[:, 0:N])
``````

On my laptop this takes ~2.4 seconds.

I’ve looked through the performance tips, but I remain confused. Am I missing something obvious?

1 Like

This allocates a new array. Try explicitly using views for these slices as well.

2 Likes

The problem is that `Parameters` is massively type unstable (It’s a `Vector{Any}`). If you use a `struct` or similar, it should be a few hundred times faster.

7 Likes

Ah, I should have mentioned that I tried to use views for du as well, like so:

``````function forward_RHS(du, u, p, t)
v_s = view(u, 1:p[2])
u_s = view(u, p[2]+1:2*p[2])
w_s = view(u, 2*p[2]+1:3*p[2])
s = view(u, 3*p[2]+1:4*p[2])

dv_s = view(du, 1:p[2])
du_s = view(du, p[2]+1:2*p[2])
dw_s = view(du, 2*p[2]+1:3*p[2])
ds = view(du, 3*p[2]+1:4*p[2])

dv_s = ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0) .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15]) .+ p[22] .+ p[20]*sin.(p[23][:, 1].*t) .+ p[21]*s)./p[16]
du_s = (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
dw_s = (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
ds = (-s .+ (v_s .> 0.25).*(v_s .< 0.3).*(dv_s .> 0.0).*dv_s./0.05)./p[19]
end
``````

However, now my kernel seems to stall in Jupyter.

1 Like

Amazing, I’ll try this and report back. Thank you!

One option to do what @Oscar_Smith suggests is just change the brackets by parenthesis in the definition of parameters. Then you get a type-stable tuple. The time here of the execution goes from `~4s` to `0.3s` by doing that. (remember that you should run the problem twice to not measure compilation time).

There are still quite a lot allocations in the `forward_RHS` function which are harder to to track. They become smaller if using some views in place of slices, but I could not easily eliminate all of them (all that broadcasting is somewhat confusing to me).

4 Likes

Basic question: Does `ODEProblem` care what `forward_RHS` returns? Currently it is returning the right hand side of the last line.

One smaller thing I’d change is replace `(v_s .> 0.25).*(v_s .< 0.3)` with `(0.25 .<= v_s .< 0.3)`. I’m not sure it will be faster, but it is (at least to me) much clearer.

4 Likes

The following code gives a time of about 130ms on my machine. I can’t compare the python code right now because pip install numba is failing.

``````using OrdinaryDiffEq, BenchmarkTools, LSODA

function runprob()
N = 20
parameters = (2, N,
-500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
25.0, 6.0, 12.0, 10.0,
2.0*1.4 .* (rand(Float64, (N, 2)) .- 0.5), 0.5 .* (rand(Float64, (N, N)) .- 0.8), 0.14 .* (rand(Float64, (N, 1)) .- 0.5),
rand(Float64, (2, 4)))
prob = ODEProblem(forward_RHS, zeros(4*N), (0.0, 500.0), parameters)
solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8);
end

function forward_RHS(du, u, p, t)
v_s = view(u, 1:p[2])
u_s = view(u, p[2]+1:2*p[2])
w_s = view(u, 2*p[2]+1:3*p[2])
s = view(u, 3*p[2]+1:4*p[2])

du[1:p[2]] = ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0) .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15]) .+ p[22] .+ p[20]*sin.(p[23][:, 1].*t) .+ p[21]*s)./p[16]
du[p[2]+1:2*p[2]] = (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
du[2*p[2]+1:3*p[2]] = (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
du[3*p[2]+1:4*p[2]] = (-s .+ (v_s .> 0.25).*(v_s .< 0.3).*(du[1:p[2]] .> 0.0).*du[1:p[2]]./0.05)./p[19]
end

@btime runprob();

nothing
``````
1 Like

It’s been a while since I looked at the performance tips page, it’s getting kind of big. I scanned it and didn’t see anything about preferring tuples to arrays when possible (although I may have missed it). I did get a factor of 2 perhaps by wrapping everything in a function. This is an easy one, it doesn’t always help, but if it removes a common potential source of inefficiency.
I hope this helps!

1 Like

I don’t think it does. I prefer a style rule that says you always must use return if you care about the value. Then if you see a blob at the bottom of a function in a code base, and there’s no return you can strongly guess that the return value is not needed. Stronger still is to require `return nothing` in this case. But, that can be verbose.

1 Like

The tips here are all super helpful, and changing my parameters to a tuple did indeed improve the speed by a factor of 10!

I’d still expect the Julia code to be faster as N gets larger - say, 500. Is the poor scale-up with N due to my allocations to du? Like I mentioned, I tried using views instead of slices and now my Julia code takes ~100 seconds with N = 1!

Thanks for all the help so far, even if I can’t eke out any more performance this is a big improvement.

There are some other things that I have fiddled around:

1. `rand(Float64, (N, 1))`
These are more clearly written as `rand(Float64,N)` (the `Float64` is redundant there, but ok).
The thing is that that was generating an two-dimensional array instead of a vector, and that didn’t allow the broadcasting of the result of the first of the four lines there.

2. The remaining allocations come from, as far as I could see:
(a) `-s` which can be written as `-1 .* s`
(b) Two matrix-vector multiplications: `p[21]*s` and `p[20]*sin.(p[23][:, 1].*t)`

To solve those maybe someone has better advice, but I think you have to use an inplace matrix multiplication routine as `mul!` from `LinearAlgebra`. But for that you need to store auxiliary arrays to store the results.

I have done that in the code bellow (added some auxiliary arrays to the parameters). That code does not allocate anything in that function. It is possible that I have done something wrong, but if I didn’t the code runs in:

``````julia> @time solution = solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8);
0.135555 seconds (166.55 k allocations: 16.639 MiB)
``````

for `N=20`

and for `N=500` it runs in:

`````` 71.583810 seconds (945.29 k allocations: 149.516 MiB, 0.11% gc time)
``````

Is that good?

In the code below I also added some views, but I think those didn’t make much of a difference. To check.

``````using OrdinaryDiffEq, BenchmarkTools, LSODA, Plots, Profile
using LinearAlgebra

N = 20

parameters = (2, N,
-500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
25.0, 6.0, 12.0, 10.0,
2.0*1.4 .* (rand(Float64,N,2) .- 0.5), #20
0.5 .* (rand(Float64,N,N) .- 0.8), #21
0.14 .* (rand(Float64,N) .- 0.5), #22
rand(Float64,2,4),  #23
zeros(N), #24
zeros(N), #25
zeros(2)) #26

function forward_RHS(du, u, p, t)
v_s = view(u, 1:p[2])
u_s = view(u, p[2]+1:2*p[2])
w_s = view(u, 2*p[2]+1:3*p[2])
s = view(u, 3*p[2]+1:4*p[2])

p[26] .= t .* view(p[23],:,1)
p[26] .= sin.(p[26])
mul!(p[24],p[20],p[26])

du[1:p[2]] .= ((p[3].*(v_s .- p[4]).^2.0 .+ p[5]).*(v_s .- 1.0)
.- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15])
.+ p[22]
.+ p[24]
.+ mul!(p[25],p[21],s))./p[16]
du[p[2]+1:2*p[2]] .= (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
du[2*p[2]+1:3*p[2]] .= (p[9].*(v_s .- p[10]).^2.0 .+ p[11] .- w_s)./p[18]
du[3*p[2]+1:4*p[2]] .= (-1 .* s .+ (v_s .> 0.25).*(v_s .< 0.3).*(view(du,1:p[2]) .> 0.0).*view(du,1:p[2])./0.05)./p[19]

end

prob = ODEProblem(forward_RHS, zeros(4*N), (0.0, 500.0), parameters)
@time solution = solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8)

``````

One note: In Julia you do not need to write everything in vectorized form. Maybe you are comfortable with that, but I would have split those operations in a lot of lines to makes things more clear and debug errors easier.

ps: There is an auxiliary vector of dimension 2 which could be substituted by an StaticArray, but that I think is for another moment :-).

7 Likes

The problem was not array vs: tuple, but the fact that the array was an array of elements of type `Any`, because each element had a different type:

``````julia> parameters = [ 1, rand(), zeros(2) ]
3-element Array{Any,1}:
1
0.11504105156893152
[0.0, 0.0]

``````

While the tuple is type-annotated for every element, thus type-stable:

``````julia> parameters = ( 1, rand(), zeros(2) )
(1, 0.9903147864959447, [0.0, 0.0])

julia> typeof(parameters)
Tuple{Int64,Float64,Array{Float64,1}}

``````

Anyway, that was the quick fix. But the nice way of solving that, as pointed above, was to use a `struct` where each field had a meaningful name, with the `Parameters` package much clearer:

``````using Parameters
@with_kw struct ODEParameters
M :: Int = 2
N :: Int
parameter22 :: Vector{Float64} = rand(N)
end

julia> parameters = ODEParameters(N=20)
ODEParameters
M: Int64 2
N: Int64 20
parameter22: Array{Float64}((20,)) [0.2369215315953066, 0.1388373787467665, 0.8089684496645597, 0.06784046734646032, 0.24206961930726
845, 0.10706828302340621, 0.09521317845577193, 0.246334342773739, 0.3818388215713078, 0.3805627173340529, 0.9603070026894776, 0.7650759
27380056, 0.6539639815092488, 0.3804045078213889, 0.3783914306777776, 0.2322091159091748, 0.8992389331423147, 0.04990408047193573, 0.03
18994329238822, 0.7820434538220526]

``````
5 Likes

This section alludes to it although maybe we should say it more explicitly.
https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-abstract-container

``````julia> [1 1.0 "sfsdf"]
1×3 Array{Any,2}:
1  1.0  "sfsdf"

julia> typeof(ans)
Array{Any,2}

julia> (1, 1.0, "sfsdf")
(1, 1.0, "sfsdf")

julia> typeof(ans)
Tuple{Int64,Float64,String}
``````

An array of `Any` is abstract and is hard for the compiler to write optimized code for. See @stevengj 's lecture to understand the problem there:

This code runs wonderfully on my end! I’m impressed you caught the -s allocations. Thank you for the help. I did end up writing a struct for my parameters and it also seems to have sped things up a bit.

1 Like

I use this approach for that (the manual one):

https://m3g.github.io/JuliaNotes.jl/stable/memory/

2 Likes

Point of curiosity: I’ve noticed that the runtime increases dramatically from, say, N = 90 to N = 100 on my end. Results from second run of time:

N = 80: 2.894335 seconds (2.22 M allocations: 529.516 MiB, 3.83% gc time)

N = 90: 2.986387 seconds (2.09 M allocations: 540.191 MiB, 4.80% gc time)

N = 95: 3.144457 seconds (1.87 M allocations: 520.397 MiB, 4.14% gc time)

N = 100: 28.765728 seconds (1.95 M allocations: 542.425 MiB, 0.42% gc time)

N = 110: 42.906974 seconds (2.78 M allocations: 842.570 MiB, 0.99% gc time)

That’s odd, right?

Also note the following difference

``````function foo(x, y)
x[2:5] = y .+ y
nothing
end

function bar(x, y)
x[2:5] .= y .+ y
nothing
end

y = randn(4)
x = randn(10)
@time foo(x,y)
@time bar(x,y)
``````
``````  0.000006 seconds (1 allocation: 112 bytes) # foo

0.000006 seconds # bar
``````

the dot on the assignment operator is required to avoid the last allocation of the broadcasted expressions.
I also return `nothing`, otherwise the last slice is heap allocated and returned.

5 Likes

I think there are still many things to improve, but a simple thing that sticks out to me is squaring with `^2.0`. Avoid this and use `^2` instead. The latter is a simple multiplication, while the former is a complicated floating point power function.

4 Likes