Trying to optimize system of SDE's

Hello!
I’ve been introduced to Julia very recently, and I’ve been trying to adapt my already tested system of equations, with additive noise, based on a Takens-Bogdanov bifurcation.

I’ve got this far trying to improve performance as much as possible, but I need help, I think it’s still lacking.

The original code I coded in Matlab performs an EM integration with dt=0.01, and does it in ~0.7 secs (same conditions as below)
Originally I was hoping to get at least 1 order of magnitude of improvement, I was surprised that I wasn’t achieving that.

I attach the code, but since the SC matrix is sensible information, I’ve attached a random matrix that yields same performance time.


using Plots
using DelimitedFiles

using DifferentialEquations
using BenchmarkTools
using StaticArrays
using LinearAlgebra

#generate artificial coupling matrix
SC=abs.(randn(90,90))
SC=0.2*SC./maximum(SC)

#initial condition
u0=0.1*ones(90,2);

#initial-final time
t0=0.0;
tend=1200.0;

#define parameters
p0=[-1.5 , 2]

p=p0'.*ones(90,2);
SC=SC;
p=[p SC]
p=[p sum(SC,dims=2)]
#col 1 :alfa
#col 2 : beta
#col 3-93 : SC
#col 94: sumSC




function stakens!(du,u,p,t)
    #A = similar(u)
   
    @views(A = p[:, 3:end-1] * u[:,1])
    #LinearAlgebra.mul!(A, @view(p[:, 3:end-1]), u)  # in-place matrix product
    
    for i in axes(u, 1)
          du[i,1] = u[i,2] + (0.5 * p[i, end] * u[i, 1]) + (0.5 * A[i,1])

          du[i,2] = 100.0 * p[i,1] - 100.0 * p[i,2] * u[i,1] + 100.0 * u[i,1] ^2 - 10.0 * u[i,1] * u[i,2] - 100.0 * u[i,1] ^3 -
                10.0 * u[i,1] ^2 * u[ i , 2]

    end
end


function σ_STakens(du,u,p,t)

 du[1:90, 1] .= 0.04
 du[1:90, 2] .= 0.0

end


prob_sde_stakens = SDEProblem(stakens!,σ_STakens,u0,(t0,tend),p)

@benchmark solve(prob_sde_stakens,EM(),dt=0.01,save_every_step=false)

BenchmarkTools.Trial:
memory estimate: 307.65 MiB
allocs estimate: 720115

minimum time: 401.633 ms (0.00% GC)
median time: 456.275 ms (11.87% GC)
mean time: 444.711 ms (9.64% GC)
maximum time: 480.100 ms (17.26% GC)

samples: 12
evals/sample: 1

I did the following to get about another 2x speedup:

  • Cached the A for the matmuls
  • Fixed save_every_step -> save_everystep
  • BLAS.set_num_threads(4) to fix the BLAS default.

The new version is:

using Plots
using DelimitedFiles

using StochasticDiffEq
using BenchmarkTools
using StaticArrays
using LinearAlgebra

#generate artificial coupling matrix
SC=abs.(randn(90,90))
SC=0.2*SC./maximum(SC)

#initial condition
u0=0.1*ones(90,2);

#initial-final time
t0=0.0;
tend=1200.0;

#define parameters
p0=[-1.5 , 2]

p=p0'.*ones(90,2);
SC=SC;
p=[p SC]
p=[p sum(SC,dims=2)]
#col 1 :alfa
#col 2 : beta
#col 3-93 : SC
#col 94: sumSC

const A = p[:, 3:end-1] * u0[:,1]

function stakens!(du,u,p,t)
    LinearAlgebra.mul!(A, @view(p[:, 3:end-1]), @view(u[:,1]))  # in-place matrix product

    @inbounds for i in axes(u, 1)
          du[i,1] = u[i,2] + (0.5 * p[i, end] * u[i, 1]) + (0.5 * A[i,1])

          du[i,2] = 100.0 * p[i,1] - 100.0 * p[i,2] * u[i,1] + 100.0 * u[i,1] ^2 - 10.0 * u[i,1] * u[i,2] - 100.0 * u[i,1] ^3 -
                10.0 * u[i,1] ^2 * u[ i , 2]

    end
end


function σ_STakens(du,u,p,t)
 du[1:90, 1] .= 0.04
 du[1:90, 2] .= 0.0
end

prob_sde_stakens = SDEProblem(stakens!,σ_STakens,u0,(t0,tend),p)

using LinearAlgebra
BLAS.set_num_threads(4)
@benchmark solve(prob_sde_stakens,EM(),dt=0.01,save_everystep=false)

The old timing was:

BenchmarkTools.Trial:
  memory estimate:  307.65 MiB
  allocs estimate:  720113
  --------------
  minimum time:     492.269 ms (10.34% GC)
  median time:      538.707 ms (9.45% GC)
  mean time:        568.998 ms (15.18% GC)
  maximum time:     727.507 ms (32.93% GC)
  --------------
  samples:          9
  evals/sample:     1

The new timing is:

BenchmarkTools.Trial:
  memory estimate:  22.01 MiB
  allocs estimate:  480109
  --------------
  minimum time:     296.526 ms (0.00% GC)
  median time:      308.266 ms (0.00% GC)
  mean time:        318.210 ms (0.46% GC)
  maximum time:     381.942 ms (0.00% GC)
  --------------
  samples:          16
  evals/sample:     1

About 60% of the remaining time is in the mul! which is quite small, so @Elrod or @YingboMa might have a small matrix matmul to demonstrate another speedup here. Or you can try MKL.jl. MATLAB uses MKL which performs matmuls a bit better for small matrices, so this might be where MATLAB is making up from some of the expected difference. Then about 25%-30% of the rest is computing randn, which there isn’t much you can do about it.

This is an intriguing problem though because it’s quite stiff but the noise is small, so it’s not so well-optimized for the methods made for SDEs. We have a new solver that will be specifically for stiff problems with small noise and this might be a good problem to demonstrate it on.

On my computer at these sizes (90x90), MKL > LoopVectorization > OpenBLAS:

julia> using LinearAlgebra, LoopVectorization, BenchmarkTools

julia> BLAS.set_num_threads(1); BLAS.vendor()
:mkl

julia> A = Vector{Float64}(undef, 90);

julia> p = rand(90,93); u = rand(90,2);

julia> function jgemvavx!(𝐲, 𝐀, 𝐱)
           @avx for i ∈ eachindex(𝐲)
               𝐲ᵢ = zero(eltype(𝐲))
               for j ∈ eachindex(𝐱)
                   𝐲ᵢ += 𝐀[i,j] * 𝐱[j]
               end
               𝐲[i] = 𝐲ᵢ
           end
       end
jgemvavx! (generic function with 1 method)

julia> @benchmark jgemvavx!($A, @view($p[:,3:end-1]), @view($u[:,1]))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     817.274 ns (0.00% GC)
  median time:      819.643 ns (0.00% GC)
  mean time:        820.530 ns (0.00% GC)
  maximum time:     1.050 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     84

julia> @benchmark mul!($A, @view($p[:,3:end-1]), @view($u[:,1]))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     794.092 ns (0.00% GC)
  median time:      796.908 ns (0.00% GC)
  mean time:        797.829 ns (0.00% GC)
  maximum time:     1.177 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     98

While OpenBLAS is slower:

julia> using LinearAlgebra, LoopVectorization, BenchmarkTools
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> BLAS.set_num_threads(1); BLAS.vendor()
:openblas64

julia> A = Vector{Float64}(undef, 90);

julia> p = rand(90,93); u = rand(90,2);

julia> function jgemvavx!(𝐲, 𝐀, 𝐱)
           @avx for i ∈ eachindex(𝐲)
               𝐲ᵢ = zero(eltype(𝐲))
               for j ∈ eachindex(𝐱)
                   𝐲ᵢ += 𝐀[i,j] * 𝐱[j]
               end
               𝐲[i] = 𝐲ᵢ
           end
       end
jgemvavx! (generic function with 1 method)

julia> @benchmark jgemvavx!($A, @view($p[:,3:end-1]), @view($u[:,1]))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     815.488 ns (0.00% GC)
  median time:      817.837 ns (0.00% GC)
  mean time:        818.713 ns (0.00% GC)
  maximum time:     1.177 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     86

julia> @benchmark mul!($A, @view($p[:,3:end-1]), @view($u[:,1]))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     898.186 ns (0.00% GC)
  median time:      902.186 ns (0.00% GC)
  mean time:        903.214 ns (0.00% GC)
  maximum time:     1.529 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     43

Maybe we can?

julia> using VectorizedRNG, Random

julia> @benchmark randn!($A)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     224.943 ns (0.00% GC)
  median time:      231.640 ns (0.00% GC)
  mean time:        231.699 ns (0.00% GC)
  maximum time:     271.341 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     545

julia> @benchmark randn!(local_rng(), $A)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     136.820 ns (0.00% GC)
  median time:      137.727 ns (0.00% GC)
  mean time:        137.675 ns (0.00% GC)
  maximum time:     185.754 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     878

while Random uses the Ziggurat algorithm, VectorizedRNG uses a SIMD Box-Muller.
The sin approximation could use some performance work (right now it’s just using Remez’s algorithm).

Let’s continue this: https://github.com/SciML/StochasticDiffEq.jl/pull/304

Thanks a lot for your reply!
I’m amazed at the helpfull input I’ve received.
I’ll implement your suggestions,
Could there be a computation efficiency boost if I use nlsolve?
Still haven’t got to reading how to use it here.

Thanks !

How so? Are you trying to calculate steady states of the system or something?

Now that I’ve optimized all I can the code, yes, I’m interested in timeseries after the transitory solution disipates, originally I simulated for 1000 seconds, and took the next 200 seconds.
For now, I wrote saveat=1000:1:1200 , in order to take last 200 seconds (that speeds up much of the process, as I said, my original purpose was to optimize first, reduce points taken later.