# Trying to optimize system of SDE's

Hello!
I’ve been introduced to Julia very recently, and I’ve been trying to adapt my already tested system of equations, with additive noise, based on a Takens-Bogdanov bifurcation.

I’ve got this far trying to improve performance as much as possible, but I need help, I think it’s still lacking.

The original code I coded in Matlab performs an EM integration with dt=0.01, and does it in ~0.7 secs (same conditions as below)
Originally I was hoping to get at least 1 order of magnitude of improvement, I was surprised that I wasn’t achieving that.

I attach the code, but since the SC matrix is sensible information, I’ve attached a random matrix that yields same performance time.

``````
using Plots
using DelimitedFiles

using DifferentialEquations
using BenchmarkTools
using StaticArrays
using LinearAlgebra

#generate artificial coupling matrix
SC=abs.(randn(90,90))
SC=0.2*SC./maximum(SC)

#initial condition
u0=0.1*ones(90,2);

#initial-final time
t0=0.0;
tend=1200.0;

#define parameters
p0=[-1.5 , 2]

p=p0'.*ones(90,2);
SC=SC;
p=[p SC]
p=[p sum(SC,dims=2)]
#col 1 :alfa
#col 2 : beta
#col 3-93 : SC
#col 94: sumSC

function stakens!(du,u,p,t)
#A = similar(u)

@views(A = p[:, 3:end-1] * u[:,1])
#LinearAlgebra.mul!(A, @view(p[:, 3:end-1]), u)  # in-place matrix product

for i in axes(u, 1)
du[i,1] = u[i,2] + (0.5 * p[i, end] * u[i, 1]) + (0.5 * A[i,1])

du[i,2] = 100.0 * p[i,1] - 100.0 * p[i,2] * u[i,1] + 100.0 * u[i,1] ^2 - 10.0 * u[i,1] * u[i,2] - 100.0 * u[i,1] ^3 -
10.0 * u[i,1] ^2 * u[ i , 2]

end
end

function σ_STakens(du,u,p,t)

du[1:90, 1] .= 0.04
du[1:90, 2] .= 0.0

end

prob_sde_stakens = SDEProblem(stakens!,σ_STakens,u0,(t0,tend),p)

@benchmark solve(prob_sde_stakens,EM(),dt=0.01,save_every_step=false)

``````

## minimum time: 401.633 ms (0.00% GC) median time: 456.275 ms (11.87% GC) mean time: 444.711 ms (9.64% GC) maximum time: 480.100 ms (17.26% GC)

samples: 12
evals/sample: 1

I did the following to get about another 2x speedup:

• Cached the `A` for the matmuls
• Fixed `save_every_step` -> `save_everystep`
• `BLAS.set_num_threads(4)` to fix the BLAS default.

The new version is:

``````using Plots
using DelimitedFiles

using StochasticDiffEq
using BenchmarkTools
using StaticArrays
using LinearAlgebra

#generate artificial coupling matrix
SC=abs.(randn(90,90))
SC=0.2*SC./maximum(SC)

#initial condition
u0=0.1*ones(90,2);

#initial-final time
t0=0.0;
tend=1200.0;

#define parameters
p0=[-1.5 , 2]

p=p0'.*ones(90,2);
SC=SC;
p=[p SC]
p=[p sum(SC,dims=2)]
#col 1 :alfa
#col 2 : beta
#col 3-93 : SC
#col 94: sumSC

const A = p[:, 3:end-1] * u0[:,1]

function stakens!(du,u,p,t)
LinearAlgebra.mul!(A, @view(p[:, 3:end-1]), @view(u[:,1]))  # in-place matrix product

@inbounds for i in axes(u, 1)
du[i,1] = u[i,2] + (0.5 * p[i, end] * u[i, 1]) + (0.5 * A[i,1])

du[i,2] = 100.0 * p[i,1] - 100.0 * p[i,2] * u[i,1] + 100.0 * u[i,1] ^2 - 10.0 * u[i,1] * u[i,2] - 100.0 * u[i,1] ^3 -
10.0 * u[i,1] ^2 * u[ i , 2]

end
end

function σ_STakens(du,u,p,t)
du[1:90, 1] .= 0.04
du[1:90, 2] .= 0.0
end

prob_sde_stakens = SDEProblem(stakens!,σ_STakens,u0,(t0,tend),p)

using LinearAlgebra
@benchmark solve(prob_sde_stakens,EM(),dt=0.01,save_everystep=false)
``````

The old timing was:

``````BenchmarkTools.Trial:
memory estimate:  307.65 MiB
allocs estimate:  720113
--------------
minimum time:     492.269 ms (10.34% GC)
median time:      538.707 ms (9.45% GC)
mean time:        568.998 ms (15.18% GC)
maximum time:     727.507 ms (32.93% GC)
--------------
samples:          9
evals/sample:     1
``````

The new timing is:

``````BenchmarkTools.Trial:
memory estimate:  22.01 MiB
allocs estimate:  480109
--------------
minimum time:     296.526 ms (0.00% GC)
median time:      308.266 ms (0.00% GC)
mean time:        318.210 ms (0.46% GC)
maximum time:     381.942 ms (0.00% GC)
--------------
samples:          16
evals/sample:     1
``````

About 60% of the remaining time is in the `mul!` which is quite small, so @Elrod or @YingboMa might have a small matrix matmul to demonstrate another speedup here. Or you can try MKL.jl. MATLAB uses MKL which performs matmuls a bit better for small matrices, so this might be where MATLAB is making up from some of the expected difference. Then about 25%-30% of the rest is computing `randn`, which there isn’t much you can do about it.

This is an intriguing problem though because it’s quite stiff but the noise is small, so it’s not so well-optimized for the methods made for SDEs. We have a new solver that will be specifically for stiff problems with small noise and this might be a good problem to demonstrate it on.

On my computer at these sizes (90x90), MKL > LoopVectorization > OpenBLAS:

``````julia> using LinearAlgebra, LoopVectorization, BenchmarkTools

:mkl

julia> A = Vector{Float64}(undef, 90);

julia> p = rand(90,93); u = rand(90,2);

julia> function jgemvavx!(𝐲, 𝐀, 𝐱)
@avx for i ∈ eachindex(𝐲)
𝐲ᵢ = zero(eltype(𝐲))
for j ∈ eachindex(𝐱)
𝐲ᵢ += 𝐀[i,j] * 𝐱[j]
end
𝐲[i] = 𝐲ᵢ
end
end
jgemvavx! (generic function with 1 method)

julia> @benchmark jgemvavx!(\$A, @view(\$p[:,3:end-1]), @view(\$u[:,1]))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     817.274 ns (0.00% GC)
median time:      819.643 ns (0.00% GC)
mean time:        820.530 ns (0.00% GC)
maximum time:     1.050 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     84

julia> @benchmark mul!(\$A, @view(\$p[:,3:end-1]), @view(\$u[:,1]))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     794.092 ns (0.00% GC)
median time:      796.908 ns (0.00% GC)
mean time:        797.829 ns (0.00% GC)
maximum time:     1.177 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     98
``````

While OpenBLAS is slower:

``````julia> using LinearAlgebra, LoopVectorization, BenchmarkTools
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

:openblas64

julia> A = Vector{Float64}(undef, 90);

julia> p = rand(90,93); u = rand(90,2);

julia> function jgemvavx!(𝐲, 𝐀, 𝐱)
@avx for i ∈ eachindex(𝐲)
𝐲ᵢ = zero(eltype(𝐲))
for j ∈ eachindex(𝐱)
𝐲ᵢ += 𝐀[i,j] * 𝐱[j]
end
𝐲[i] = 𝐲ᵢ
end
end
jgemvavx! (generic function with 1 method)

julia> @benchmark jgemvavx!(\$A, @view(\$p[:,3:end-1]), @view(\$u[:,1]))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     815.488 ns (0.00% GC)
median time:      817.837 ns (0.00% GC)
mean time:        818.713 ns (0.00% GC)
maximum time:     1.177 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     86

julia> @benchmark mul!(\$A, @view(\$p[:,3:end-1]), @view(\$u[:,1]))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     898.186 ns (0.00% GC)
median time:      902.186 ns (0.00% GC)
mean time:        903.214 ns (0.00% GC)
maximum time:     1.529 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     43
``````
``````julia> using VectorizedRNG, Random

julia> @benchmark randn!(\$A)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     224.943 ns (0.00% GC)
median time:      231.640 ns (0.00% GC)
mean time:        231.699 ns (0.00% GC)
maximum time:     271.341 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     545

julia> @benchmark randn!(local_rng(), \$A)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     136.820 ns (0.00% GC)
median time:      137.727 ns (0.00% GC)
mean time:        137.675 ns (0.00% GC)
maximum time:     185.754 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     878
``````

while `Random` uses the Ziggurat algorithm, VectorizedRNG uses a SIMD Box-Muller.
The `sin` approximation could use some performance work (right now it’s just using Remez’s algorithm).

Let’s continue this: https://github.com/SciML/StochasticDiffEq.jl/pull/304