Reduce allocations (increase the running speed) for large scale ODEs (136 equations)

Hi, I am wondering how to improve the speed of large scale ODEs, which includes 136 equations. I already use the staticarray, and for loop instead of broadcast to speed up the program and it’s already 10 times faster than Matlab. I also tested and found that without saving everystep, it only has 53 allocations, which is tspan independent, that means the program already quite good. I am just wondering anyone can give me more advices to improve it. Any suggestions are welcome. Here is my program, the Constants.jl doesn’t need to change, and the long Test3DMolasses.jl can be get in this link
https://github.com/xspeng1999/Test3DMolasses/blob/main/Test3DMolasses
I know it’s quite long, I just wondering whether I make obvious stupid mistakes, thanks in advance~

using DifferentialEquations, LinearAlgebra, StaticArrays
using Random, Statistics
include("Test3DMolasses.jl")
include("Constants.jl")

# parameters defination
vx, vy, vz = rand(3)
rx, ry, rz = rand(3)
theta1, theta2 = 2pi * rand(2)
para01 = @SVector [vx, vy, vz, rx, ry, rz, theta1, theta2]

delta = 20rand(4, 12)
delta = SMatrix{4,12,Float64}(delta)

s = @SVector [1.0, 1.0, 1.0, 1.0]
Polar = @SVector [1.0, 1.0, 1.0, -1.0]

Len = length(Polar)
a = complex(zeros(Len))
b = complex(zeros(Len))
c = complex(zeros(Len))

p = (para01, delta, s, Polar, a, b, c, Len)

# initial value
II = zeros(136)
II[1] = 1.0 / 12.0
II[13] = 1.0 / 12.0
II[24] = 1.0 / 12.0
II[34] = 1.0 / 12.0
II[43] = 1.0 / 12.0
II[51] = 1.0 / 12.0
II[58] = 1.0 / 12.0
II[64] = 1.0 / 12.0
II[69] = 1.0 / 12.0
II[73] = 1.0 / 12.0
II[76] = 1.0 / 12.0
II[78] = 1.0 / 12.0
u0 = complex(II)

tend = 1e-5
tstart = 0
tspan = (tstart * Gamma, tend * Gamma)

prob = ODEProblem(Test3DMolasses, u0, tspan, p)
@time y = solve(prob, Tsit5(), save_everystep=false)  # 53 allocations in my computer 


The following program is saved in a seperate file Constants.jl

## Physical constants
const h = 6.626e-34
Mu = 1.660538e-27
Mn = 59
const Mm = Mu * Mn     
const lamda = 606.3e-9
const k = 2pi / lamda
const Gamma = 2pi * 8.3e6
const velfact = Gamma / k
ZXtemp = [0, 0, 0, 76.2939, 122.9512, 122.9512, 122.9512, 147.8177, 147.8177, 147.8177, 147.8177, 147.8177]   # importdata('X_ZeemanCaF.mat');    # 0:1:1000G
const ZX = round.((ZXtemp .- ZXtemp[1]) / 8.3, digits=2)   # unit Gamma  relative to the lowest level round to 0.01 Gamma
ZBtemp = [-5 * 2 * pi * 1e6, 0, 0, 0] / Gamma
const ZB = round.(ZBtemp, digits=2)        # round(ZB*100)/100;     # round to 0.01Gamma

#=
#Fp  0           1            1        1
#    13          14          15       16
#mFp 0           1            0       -1
ftemp = [-0.1355 0.4649 -0.4645 0.0000    #  1  1  1
    -0.1355 0.4650 -0.0001 -0.4641    #  0  2  1
    -0.1354 -0.0000 0.4646 -0.4642    # -1  3  1
    -0.0020 -0.4705 0.4714 -0.4722    #  0  4  0
    0.5614 0.0000 -0.1853 0.1846     # -1  5  1
    -0.5612 0.1865 -0.0010 -0.1832    #  0  6  1
    -0.5610 0.1851 -0.1844 0.0000    #  1  7  1
    -0.0000 0.0000 -0.0000 0.4083    # -2  8  2
    0.0009 -0.0000 0.2880 0.2894     # -1  9  2
    -0.0011 -0.1659 -0.3334 -0.1675    #  0  10 2
    -0.0009 -0.2880 -0.2894 0.0000    #  1  11 2
    0.0000 0.4083 0.0000 0.0000]   #  2  12 2
=#

ftemp = [-0.1354 0.4646 0.4646 0
    0.1354 -0.4646 0 0.4646
    -0.1354 0 -0.4646 -0.4646
    0 0.4714 0.4714 0.4714
    0.5612 0 0.1849 0.1849
    -0.5612 0.1849 0 -0.1849
    0.5612 -0.1849 -0.1849 0
    0 0 0 0.4082
    0 0 0.2887 -0.2887
    0 0.1667 -0.3333 0.1667
    0 -0.2887 0.2887 0
    0 0.4082 0 0]   #  from Yang's method

for i = 1:4
    ftemp[:, i] = ftemp[:, i] ./ sqrt(sum(ftemp[:, i] .^ 2))
end

const f = ftemp
const r = ftemp   # the branching ratio

You have a lot of repeated computations, that could be done only once in your Test3DMolasses function. For example:

dy[1] += -factor * s[jj] * (f[1, 1] * a[jj] * cis((delta[jj, 1] + ZB[1]) * t) * conj(y[89]) + f[1, 2] * b[jj] * cis((delta[jj, 1] + ZB[2]) * t) * conj(y[90]) + f[1, 3] * a[jj] * cis((delta[jj, 1] + ZB[3]) * t) * conj(y[91]) - conj(f[1, 1] * a[jj]) * cis(-(delta[jj, 1] + ZB[1]) * t) * y[89] - conj(f[1, 2] * b[jj]) * cis(-(delta[jj, 1] + ZB[2]) * t) * y[90] - conj(f[1, 3] * a[jj]) * cis(-(delta[jj, 1] + ZB[3]) * t) * y[91])
dy[2] += -factor * s[jj] * (f[1, 1] * a[jj] * cis((delta[jj, 1] + ZB[1]) * t) * conj(y[93]) + f[1, 2] * b[jj] * cis((delta[jj, 1] + ZB[2]) * t) * conj(y[94]) + f[1, 3] * a[jj] * cis((delta[jj, 1] + ZB[3]) * t) * conj(y[95]) - conj(f[2, 1] * b[jj]) * cis(-(delta[jj, 2] + ZB[1]) * t) * y[89] - conj(f[2, 2] * c[jj]) * cis(-(delta[jj, 2] + ZB[2]) * t) * y[90] - conj(f[2, 3] * b[jj]) * cis(-(delta[jj, 2] + ZB[3]) * t) * y[91] - conj(f[2, 4] * a[jj]) * cis(-(delta[jj, 2] + ZB[4]) * t) * y[92])
...

If you compute the common parts into factors and do not repeat them, your computation will very likely be faster.

Good suggestion,I will try ~

Regarding the inital topic: “reducing allocations”: you already did a pretty good job here - I looked at the code and didn’t see where they could come from. After reading that i’ts only 53 independent of tspan, I think that allocation-wise it is optimal and I guess there are no allocations in your Test3DMolasses. Your 53 allocations have negligible influence on the runtime.

So yeah I would second @lmiq - the main reserves probably are in the pre-calculation of the common subexpressions.

1 Like