Python vs. Julia ODE Solver

It is. But you are not running exactly the “my” version, because I have much less allocations than you.

Just to be clear, it was not possible to do this in the first version because of the rand(Float64,(N,1)) definition I mentioned above:

julia> v = rand(Float64,(5,1));

julia> y = rand(5);

julia> z = zeros(5);

julia> z[1:5] .= v .+ y
ERROR: DimensionMismatch("cannot broadcast array to have fewer dimensions")

The problem is that v here is a two-dimension array (even though the second dimension is 1):

julia> v
5×1 Array{Float64,2}:
 0.23460063971299383
 0.977599317555971
 0.11413962191900962
 0.9259610441668678
 0.0253857101609829

That is solved if v is a vector:

julia> v = rand(Float64,5);

julia> z[1:5] .= v .+ y
5-element view(::Array{Float64,1}, 1:5) with eltype Float64:
 1.7707818802990465
 1.5662031379445822
 1.1979284388950557
 0.7127751475367243
 1.1919315206952428

It might be a parallel discussion why/if the first broadcasting should work.

The final code I am running is here, It seems to be faster than what you are running there still.

Code
using OrdinaryDiffEq, BenchmarkTools, LSODA, Plots, Profile
using LinearAlgebra

function forward_RHS(du, u, p, t)
    v_s = view(u, 1:p[2])
    u_s = view(u, p[2]+1:2*p[2])
    w_s = view(u, 2*p[2]+1:3*p[2])
    s = view(u, 3*p[2]+1:4*p[2])

    p[26] .= t .* view(p[23],:,1)
    p[26] .= sin.(p[26])
    mul!(p[24],p[20],p[26])

    du[1:p[2]] .= ((p[3].*(v_s .- p[4]).^2 .+ p[5]).*(v_s .- 1.0)
                  .- p[12].*u_s.*(v_s .- p[14]) .- p[13].*w_s.*(v_s .- p[15])
                  .+ p[22]
                  .+ p[24]
                  .+ mul!(p[25],p[21],s))./p[16]
    du[p[2]+1:2*p[2]] .= (exp.(p[6].*(v_s .- p[7])) .+ p[8] .- u_s)./p[17]
    du[2*p[2]+1:3*p[2]] .= (p[9].*(v_s .- p[10]).^2 .+ p[11] .- w_s)./p[18]
    du[3*p[2]+1:4*p[2]] .= (-1 .* s .+ (v_s .> 0.25).*(v_s .< 0.3).*(view(du,1:p[2]) .> 0.0).*view(du,1:p[2])./0.05)./p[19]

    nothing

end

function runprob(N)

 parameters = (2, N,
              -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
              25.0, 6.0, 12.0, 10.0,
              2.0*1.4 .* (rand(Float64,N,2) .- 0.5), #20 
              0.5 .* (rand(Float64,N,N) .- 0.8), #21
              0.14 .* (rand(Float64,N) .- 0.5), #22
              rand(Float64,2,4),  #23
              zeros(N), #24
              zeros(N), #25 
              zeros(2)) #26

  prob = ODEProblem(forward_RHS, zeros(4*N), (0.0, 500.0), parameters)

  solve(prob, lsoda(), saveat=0.1, reltol=1e-8, abstol=1e-8)

end

#@time solution = runprob(N)

I get:

julia> @time runprob(20);
  0.132728 seconds (195.93 k allocations: 18.890 MiB)

julia> @time runprob(90);
  1.365390 seconds (512.74 k allocations: 54.033 MiB)

julia> @time runprob(100);
  5.902604 seconds (563.84 k allocations: 59.490 MiB, 0.33% gc time)

julia> @time runprob(120);
  6.846018 seconds (559.88 k allocations: 62.312 MiB, 0.26% gc time)

julia> @time runprob(200);
 16.374404 seconds (837.08 k allocations: 96.083 MiB, 0.10% gc time)


Perhaps there are indeed some “jumps” in the performance?

3 Likes

Maybe it is the size the cache starts trashing?

1 Like

Implemented the changes recommended here, which have made a big difference in the runtime and made it realistic for me to start considering larger N. I’ve been benchmarking against Numba, and once we get to N>200 or so, Numba outpaces Julia. Profiling indicates this is probably due to the matrix multiplication steps. Is mul! the fastest way for me to do it in Julia? I tried rewriting into a for loop (Julia 1 code below), which is very fast for smaller N but ramps up quickly.

Numba:

import numba
import numpy
import time as timer
from scipy import integrate
from matplotlib import pyplot

@numba.njit
def forward_RHS(u, t, *p):
    N = p[2]
    
    v_s = u[0:N]
    u_s = u[N:2*N]
    w_s = u[2*N:3*N]
    s = u[3*N:4*N]

    du = p[24]

    du[0:N] = ((p[3]*(v_s - p[4])**2.0 + p[5])*(v_s - 1.0) 
               - p[12]*u_s*(v_s - p[14]) - p[13]*w_s*(v_s - p[15]) 
               + p[20]@(numpy.sin(p[23][:, 0]*t) + numpy.sin(p[23][:, 1]*t) - numpy.sin(p[23][:, 2]*t) - numpy.sin(p[23][:, 3]*t))
               + p[21]@s
               + p[22])/p[16]
    du[N:2*N] = (numpy.exp(p[6]*(v_s - p[7])) + p[8] - u_s)/p[17]
    du[2*N:3*N] = (p[9]*(v_s - p[10])**2.0 + p[11] - w_s)/p[18]
    du[3*N:4*N] = (-s + (v_s > 0.25)*(v_s < 0.3)*(du[0:N] > 0.0)*du[0:N]/0.05)/p[19]    
    
    return du

@numba.njit
def forward_jacobian(u, t, *p):
    N = p[2]
    
    v_s = u[0:N]
    u_s = u[N:2*N]
    w_s = u[2*N:3*N]
    s = u[3*N:4*N]
    
    J = numpy.zeros((4*N, 4*N))
    
    mask = numpy.identity(N)
    
    J[0:N, 0:N] = (p[5] + 2.0*p[3]*(v_s - 1.0)*(v_s - p[4]) + p[3]*(v_s - p[4])**2.0 - p[12]*u_s - p[13]*w_s)/p[16]*mask
    J[0:N, N:2*N] = -(p[12]*(v_s - p[14]))/p[16]*mask
    J[0:N, 2*N:3*N] = -(p[13]*(v_s - p[15]))/p[16]*mask
    J[0:N, 3*N:4*N] = self.recurrent_weights_s.sum(axis = 1)*mask

    J[N:2*N, 0:N] = p[6]*numpy.exp(p[6]*(v_s - p[7]))/p[17]*mask
    J[N:2*N, N:2*N] = -mask/p[17]

    J[2*N:3*N, 0:N] = 2.0*(p[9]*(v_s - p[10]))/p[18]*mask
    J[2*N:3*N, 2*N:3*N] = -mask/p[18]

    J[3*N:4*N, 3*N:4*N] = -mask/p[19]
                            
    return J

def runprob(N):

    parameters = (0, 2, N,
                  -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5, 
                  25.0, 6.0, 12.0, 10.0,
                  2.8*(numpy.random.rand(N, 2) - 0.5), 0.5*(numpy.random.rand(N, N) - 0.8), 0.14*(numpy.random.rand(N, ) - 0.5),
                  numpy.random.rand(2, 4),
                  numpy.zeros(4*N),
                  numpy.zeros((4*N, 4*N)),
                  numpy.identity(N))
    
    y = integrate.odeint(forward_RHS, numpy.zeros(4*N), numpy.linspace(0.0, 500.0, 5000), args = parameters, Dfun = forward_jacobian, rtol = 1e-8, atol = 1e-8, mxstep = 50000)

    return y

%timeit runprob(50)

Julia 1 (for loop):

using OrdinaryDiffEq, BenchmarkTools, LSODA, Plots, Profile
using LinearAlgebra
using Parameters

@with_kw struct ODEParameters
    number_of_inputs::Int
    N::Int
    a_s::Float64
    b_s::Float64
    c_s::Float64
    d_s::Float64
    e_s::Float64
    f_s::Float64
    g_s::Float64
    h_s::Float64
    i_s::Float64
    j_s::Float64
    k_s::Float64
    l_s::Float64
    m_s::Float64
    cm_s::Float64
    tau_u_s::Float64
    tau_w_s::Float64
    tau_s::Float64
    input_weights_s::Array{Float64,2}
    recurrent_weights_s::Array{Float64,2}
    tonic_inputs_s::Vector{Float64}
    frequencies::Array{Float64,2}
    network_inputs_s::Vector{Float64}
    recurrent_inputs_s::Vector{Float64}
    input_cache::Vector{Float64}
end

function forward_RHS_1(du, u, p, t)
    v_s = view(u, 1:p.N)
    u_s = view(u, p.N+1:2*p.N)
    w_s = view(u, 2*p.N+1:3*p.N)
    s = view(u, 3*p.N+1:4*p.N)

    p.input_cache .= sin.(t.*view(p.frequencies, :, 1)) .+ sin.(t.*view(p.frequencies, :, 2)) .- sin.(t.*view(p.frequencies, :, 3)) .- sin.(t.*view(p.frequencies, :, 4))
    
    @inbounds for i = 1:p.N
        @inbounds for j = 1:p.number_of_inputs
            p.network_inputs_s[i] = p.input_weights_s[i, j]*p.input_cache[j]
        end
        
        @inbounds for j = 1:p.N
            p.recurrent_inputs_s[i] = p.recurrent_weights_s[i, j]*s[j]
        end
    end
    
    du[1:p.N] .= ((p.a_s.*(v_s .- p.b_s).^2 .+ p.c_s).*(v_s .- 1.0)
                  .- p.j_s.*u_s.*(v_s .- p.l_s) .- p.k_s.*w_s.*(v_s .- p.m_s)
                  .+ p.network_inputs_s
                  .+ p.recurrent_inputs_s
                  .+ p.tonic_inputs_s)./p.cm_s
    du[p.N+1:2*p.N] .= (exp.(p.d_s.*(v_s .- p.e_s)) .+ p.f_s .- u_s)./p.tau_u_s
    du[2*p.N+1:3*p.N] .= (p.g_s.*(v_s .- p.h_s).^2 .+ p.i_s .- w_s)./p.tau_w_s
    du[3*p.N+1:4*p.N] .= (-1 .* s .+ (v_s .> 0.25).*(v_s .< 0.3).*(view(du, 1:p.N) .> 0.0).*view(du, 1:p.N)./0.05)./p.tau_s

    nothing
end

function runprob1(N)
    p = ODEParameters(2, N,
                   -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
                   25.0, 6.0, 12.0, 10.0,
                   2.8.*(rand(Float64, (N, 2)) .- 0.5), 0.5.*(rand(Float64, (N, N)) .- 0.8), 0.14.*(rand(Float64, (N, )) .- 0.5),
                   rand(Float64, (2, 4)),
                   zeros(N), zeros(N),
                   zeros(2))

    prob = ODEProblem(forward_RHS_1, zeros(4*N), (0.0, 500.0), p)
    
    solution = solve(prob, lsoda(), saveat = 0.1, reltol = 1e-8, abstol = 1e-8)
   
    return solution
end

N = 50

#@time runprob1(N);
#@time runprob1(N);
@btime runprob1($N);

Julia 2:

using OrdinaryDiffEq, BenchmarkTools, LSODA, Plots, Profile
using LinearAlgebra
using Parameters

@with_kw struct ODEParameters
    number_of_inputs::Int
    N::Int
    a_s::Float64
    b_s::Float64
    c_s::Float64
    d_s::Float64
    e_s::Float64
    f_s::Float64
    g_s::Float64
    h_s::Float64
    i_s::Float64
    j_s::Float64
    k_s::Float64
    l_s::Float64
    m_s::Float64
    cm_s::Float64
    tau_u_s::Float64
    tau_w_s::Float64
    tau_s::Float64
    input_weights_s::Array{Float64,2}
    recurrent_weights_s::Array{Float64,2}
    tonic_inputs_s::Vector{Float64}
    frequencies::Array{Float64,2}
    network_inputs_s::Vector{Float64}
    recurrent_inputs_s::Vector{Float64}
    input_cache::Vector{Float64}
end

function forward_RHS_2(du, u, p, t)
    v_s = view(u, 1:p.N)
    u_s = view(u, p.N+1:2*p.N)
    w_s = view(u, 2*p.N+1:3*p.N)
    s = view(u, 3*p.N+1:4*p.N)

    p.input_cache .= sin.(t.*view(p.frequencies, :, 1)) .+ sin.(t.*view(p.frequencies, :, 2)) .- sin.(t.*view(p.frequencies, :, 3)) .- sin.(t.*view(p.frequencies, :, 4))

    mul!(p.network_inputs_s, p.input_weights_s, p.input_cache)
    mul!(p.recurrent_inputs_s, p.recurrent_weights_s, s)
    
    du[1:p.N] .= ((p.a_s.*(v_s .- p.b_s).^2 .+ p.c_s).*(v_s .- 1.0)
                  .- p.j_s.*u_s.*(v_s .- p.l_s) .- p.k_s.*w_s.*(v_s .- p.m_s)
                  .+ p.network_inputs_s
                  .+ p.recurrent_inputs_s
                  .+ p.tonic_inputs_s)./p.cm_s
    du[p.N+1:2*p.N] .= (exp.(p.d_s.*(v_s .- p.e_s)) .+ p.f_s .- u_s)./p.tau_u_s
    du[2*p.N+1:3*p.N] .= (p.g_s.*(v_s .- p.h_s).^2 .+ p.i_s .- w_s)./p.tau_w_s
    du[3*p.N+1:4*p.N] .= (-1 .* s .+ (v_s .> 0.25).*(v_s .< 0.3).*(view(du, 1:p.N) .> 0.0).*view(du, 1:p.N)./0.05)./p.tau_s

    nothing
end

function runprob2(N)
    p = ODEParameters(2, N,
                   -500, 0.2, 0.0, 100.0, 0.7, 0.0, 15.0, 0.24, 0.02, 1700.0, 8.0, 0.19, -0.5,
                   25.0, 6.0, 12.0, 10.0,
                   2.8.*(rand(Float64, (N, 2)) .- 0.5), 0.5.*(rand(Float64, (N, N)) .- 0.8), 0.14.*(rand(Float64, (N, )) .- 0.5),
                   rand(Float64, (2, 4)),
                   zeros(N), zeros(N),
                   zeros(2))

    prob = ODEProblem(forward_RHS_2, zeros(4*N), (0.0, 500.0), p)
    
    solution = solve(prob, lsoda(), saveat = 0.1, reltol = 1e-8, abstol = 1e-8)
   
    return solution
end

N = 50

#@time runprob2(N);
#@time runprob2(N);
@btime runprob2($N);

Numba+SciPy
N = 50: 1.96 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
N = 100: 4.6 s ± 220 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
N = 300: 18.8 s ± 769 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
N = 500: 35.5 s ± 2.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Julia 1 (for loop)
N = 50: 792.417 ms (426951 allocations: 41.67 MiB)
N = 100: 3.925 s (842437 allocations: 80.75 MiB)
N = 300: 58.075 s (1994026 allocations: 200.40 MiB)
N = 500: 262.799 s (2912560 allocations: 303.50 MiB)

Julia 2
N = 50: 834.853 ms (540793 allocations: 50.35 MiB)
N = 100: 15.337 s (962251 allocations: 89.89 MiB)
N = 300: 31.940 s (1420379 allocations: 156.64 MiB)
N = 500: 55.922 s (1731853 allocations: 213.42 MiB)

If this is the most I can expect that’s probably okay, I’ll just set up a switch between Numba and Julia in my final code when N is large enough.

Once you are limited by matrix multiplications, all that matters is what BLAS library you are linking to (since both numba and Julia’s mul! are calling BLAS for double-precision matrix-vector multiplications). If you link the same BLAS with the same number of threads in the two cases, the matrix multiplications should be the same speed.

By default, Julia uses OpenBLAS, but you might also try MKL. And check that the BLAS is set up to use the same number of threads in both cases. Note also that Julia’s arrays are stored column-major and numpy’s row-major (by default), which might affect performance in some cases (e.g. you can try transposing the matrix to see if that changes things).

PS. You can use the @. macro to get rid of the dots and make your expressions a bit more readable. You can also use the @views macro to make your whole function use views for slices, so that you don’t need to call view explicitly.

11 Likes

This isn’t a performance tip, but I noticed you set up your Python version with a variable-length argument/unpacking *p, and Julia has that too, though it looks like p... instead. I’m not positive but I think that’s as good as a fully concretely annotated struct as far as type inference goes because p is packed in a tuple, whose type is entirely dependent on the values’ types.

2 Likes

That cracked it, switched to MKL and Julia beats Numba. Thank you all very much!!

7 Likes

If speed is critical and you don’t mind getting near the bleeding edge, you may want to try Octavian.jl, a Julia-native BLAS which can be faster than MKL in some cases (particularly for AMD processors, since Intel puts a ton of work into optimizing MKL for their own chips).

6 Likes

Note though that I haven’t implemented gemv! yet, just gemm!.
While you could try treating the vector as a single column matrix, the gemm won’t currently multithread a matrix that skinny because of a hard coded check (that I could remove), and of course gemv! kernels will look fairly different from gemm!.

That said, MKL doesn’t seem to use/benefit from multithreading at that size yet:

using LinearAlgebra, MKL, LoopVectorization, BenchmarkTools

julia> A = rand(200,200);b = rand(200); c = similar(b);

julia> function Amulb!(y, A, x)
           @avx for m ∈ axes(A, 1)
               ym = zero(eltype(y))
               for n ∈ axes(A,2)
                   ym += A[m,n] * x[n]
               end
               y[m] = ym
           end
       end
Amulb! (generic function with 1 method)

julia> @benchmark Amulb!($c,$A,$b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.793 μs (0.00% GC)
  median time:      1.893 μs (0.00% GC)
  mean time:        1.921 μs (0.00% GC)
  maximum time:     5.008 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> d = similar(c);

julia> @benchmark mul!($d,$A,$b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.036 μs (0.00% GC)
  median time:      2.150 μs (0.00% GC)
  mean time:        2.212 μs (0.00% GC)
  maximum time:     22.008 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> d ≈ c
true

But MKL does do a great job handling alignment, while LoopVectorization doesn’t. Thus LoopVectorization’s performance falls off a cliff for gemv when the number of rows of A isn’t a multiple of 8, because gemv is entirely bound by how quickly we can load from A (and when stride(A,2) isn’t a multiple of 8, then most of the loads will be unaligned; loads across a 64-byte boundary suffer a 2x penalty on most x86 CPUs; this CPU loads 64-bytes at a time, meaning 7/8 loads will cross such a boundary when isodd(stride(A,2)), hence our 2x regression ):

julia> A = rand(199,199);b = rand(199); c = similar(b);

julia> d = similar(c);

julia> @benchmark mul!($d,$A,$b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.981 μs (0.00% GC)
  median time:      2.079 μs (0.00% GC)
  mean time:        2.125 μs (0.00% GC)
  maximum time:     17.188 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark Amulb!($c,$A,$b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.317 μs (0.00% GC)
  median time:      3.405 μs (0.00% GC)
  mean time:        3.431 μs (0.00% GC)
  maximum time:     15.349 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> Apadded = view(Matrix{Float64}(undef, 200, 199), 1:199, :) .= A;

julia> @benchmark Amulb!($c,$Apadded,$b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.771 μs (0.00% GC)
  median time:      1.867 μs (0.00% GC)
  mean time:        1.887 μs (0.00% GC)
  maximum time:     4.811 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

But you can work around this issue by padding out the number of rows (and ignoring the extra, e.g. with a view), like with the Apadded example above.

julia> stride(A,2), stride(A,2) % 8
(199, 7)

julia> stride(Apadded,2), stride(Apadded,2) % 8
(200, 0)

Adding the padding gives us back all our performance.

5 Likes

If you relax your type constraints to allow AbstractArray (using type parameterization), then you can use StaticArrays.jl for small array sizes, which could yield very large speedups for the array ops.

2 Likes

Note that you can start all those parameters with default values, making the constructor much cleaner later, only N and optionally some other parameters have to be set.

1 Like

I’m not sure why, but this added extra allocations when I implemented it. Weird?