Increase the speed of For Loop in julia

When I solve the ODEs, I try to write the 136 equations with a for loop, which makes the program more clean. However, compare with the initial program (in which I expand each equation completely), the new program needs twice the time to run, the allocation are the same for both editions, and with save_everystep=false both have 53 allocations, so I think the allocation shouldn’t be a probrem. I am wondering how to increase its running speed, any suggestion is welcome.
Here, ee, gg, ge are defined to be const arrays to locate the y[Number], which should also not be a problem

using DifferentialEquations, Plots, LinearAlgebra, StaticArrays, Dates
using Random, LoopVectorization, Statistics, Distributed, Distributions
include("Test3DMolassesTest.jl")
include("Constants.jl")

const wmin = 1e-2          #  the common factor
const Power = 200.0        #  mW
const beamsize = 10.0      #  10 mm radius
const Is = 4.86            #  saturation intensity, X-A 4.86 mW/cm2;  X - B 5.5 mW/cm2
const Ip = 2Power / pi / (beamsize / 10)^2  # laser intensity
vp = 1
para = [0.15 69 74.5 34.83 73 1 2 2 1 7.56 6.56 -5.25 2.22]

P = zeros(5)
Polar = ones(5)
Delta = zeros(1, 5)
s = zeros(5)

P[1:4] = para[2:5]
P[5] = 0.0         # 1500.0  mW
Polar[1:4] = Polartemp[Int.(para[6:9])]  # Polartemp is constant array defined in Constants.jl
Polar[5] = -1.0
Delta[1:4] = para[10:13]
Delta[5] = 100.0   # 100Gamma relative to the F = 2 state
Delta = round.(Delta, digits=2)              #round(Delta*100)/100;   # normalized to 0.01 Gamma
s[1:4] = Ip / Is * P[1:4] / sum(P[1:4])
s[5] = 2P[5] / pi / (beamsize / 10)^2 / Is     # from the bottom up, F = 1 0 1 2, 2
s = sqrt.(s)

delta = deltatemp .- Delta
delta = delta'

delta = SMatrix{5,12,Float64}(delta)
s = SVector{5,Float64}(s)
Polar = SVector{5,Float64}(Polar)       # turn into Static arrays

Len = length(Polar)
atemp = MVector{Len,ComplexF64}(complex(zeros(Len)))
btemp = similar(atemp)
ctemp = similar(atemp)

ft = complex(zeros(Len, 12, 4))
ftc = similar(ft)

u0 = IIF # IIF is a constant array defined in Constants.jl
tend = 2pi / wmin / Gamma
tstart = -tend
tspan = (tstart * Gamma, tend * Gamma)

vx, vy, vz = [1 2 3]
rx, ry, rz = [1 2 3] * lamda
theta1, theta2 = [pi / 2 pi]
para01 = @SVector [vx, vy, vz, rx, ry, rz, theta1, theta2]
p = (para01, delta, s, Polar, Len, atemp, btemp, ctemp, ft, ftc)  #Len cannot be put into para01, that will make Len Float64 type
prob = ODEProblem(Test3DMolasses, u0, tspan, p)

@time y = solve(prob, Tsit5(), save_everystep=false)

And the Constants.jl here

## Physical constants
const h = 6.626e-34
Mu = 1.660538e-27
Mn = 59.0
const Mm = Mu * Mn
const lamda = 606.3e-9
const k = 2pi / lamda
const Gamma = 2pi * 8.3e6
const velfact = Gamma / k
ZXtemp = [0, 0, 0, 76.2939, 122.9512, 122.9512, 122.9512, 147.8177, 147.8177, 147.8177, 147.8177, 147.8177]   # importdata('X_ZeemanCaF.mat');    # 0:1:1000G
const ZX = round.((ZXtemp .- ZXtemp[1]) / 8.3, digits=2)   # unit Gamma  relative to the lowest level round to 0.01 Gamma
ZBtemp = [-5 * 2 * pi * 1e6, 0, 0, 0] / Gamma
const ZB = round.(ZBtemp, digits=2)        # round(ZB*100)/100;     # round to 0.01Gamma

#=
#Fp  0           1            1        1
#    13          14          15       16
#mFp 0           1            0       -1
ftemp = [-0.1355 0.4649 -0.4645 0.0000    #  1  1  1
    -0.1355 0.4650 -0.0001 -0.4641    #  0  2  1
    -0.1354 -0.0000 0.4646 -0.4642    # -1  3  1
    -0.0020 -0.4705 0.4714 -0.4722    #  0  4  0
    0.5614 0.0000 -0.1853 0.1846     # -1  5  1
    -0.5612 0.1865 -0.0010 -0.1832    #  0  6  1
    -0.5610 0.1851 -0.1844 0.0000    #  1  7  1
    -0.0000 0.0000 -0.0000 0.4083    # -2  8  2
    0.0009 -0.0000 0.2880 0.2894     # -1  9  2
    -0.0011 -0.1659 -0.3334 -0.1675    #  0  10 2
    -0.0009 -0.2880 -0.2894 0.0000    #  1  11 2
    0.0000 0.4083 0.0000 0.0000]   #  2  12 2
=#

ftemp = [-0.1354 0.4646 0.4646 0
    0.1354 -0.4646 0 0.4646
    -0.1354 0 -0.4646 -0.4646
    0 0.4714 0.4714 0.4714
    0.5612 0 0.1849 0.1849
    -0.5612 0.1849 0 -0.1849
    0.5612 -0.1849 -0.1849 0
    0 0 0 0.4082
    0 0 0.2887 -0.2887
    0 0.1667 -0.3333 0.1667
    0 -0.2887 0.2887 0
    0 0.4082 0 0]   #  from Yang's method

for i = 1:4
    ftemp[:, i] = ftemp[:, i] ./ sqrt(sum(ftemp[:, i] .^ 2))
end

const f = ftemp
const r = ftemp   # the branching ratio

const gl = [1, 0, -1, 0, -1, 0, 1, -2, -1, 0, 1, 2]
const el = [0, 1, 0, -1]

II = zeros(136)
II[1] = 1.0 / 12.0
II[13] = 1.0 / 12.0
II[24] = 1.0 / 12.0
II[34] = 1.0 / 12.0
II[43] = 1.0 / 12.0
II[51] = 1.0 / 12.0
II[58] = 1.0 / 12.0
II[64] = 1.0 / 12.0
II[69] = 1.0 / 12.0
II[73] = 1.0 / 12.0
II[76] = 1.0 / 12.0
II[78] = 1.0 / 12.0
const IIF = complex(II)

const Polartemp = [1.0, -1.0]  # use Float64 types can reduce a lot of problems

#laser      1                2           3               4               Level
const deltatemp = -[ZX[1]-ZX[2] ZX[1]-ZX[4] ZX[1]-ZX[6] ZX[1]-ZX[10] ZX[1]-ZX[10]  #  1  1   1
    ZX[2]-ZX[2] ZX[2]-ZX[4] ZX[2]-ZX[6] ZX[2]-ZX[10] ZX[2]-ZX[10]  #  0  2   1
    ZX[3]-ZX[2] ZX[3]-ZX[4] ZX[3]-ZX[6] ZX[3]-ZX[10] ZX[3]-ZX[10]  # -1  3   1
    ZX[4]-ZX[2] ZX[4]-ZX[4] ZX[4]-ZX[6] ZX[4]-ZX[10] ZX[4]-ZX[10]  #  0  4   0
    ZX[5]-ZX[2] ZX[5]-ZX[4] ZX[5]-ZX[6] ZX[5]-ZX[10] ZX[5]-ZX[10]  # -1  5   1
    ZX[6]-ZX[2] ZX[6]-ZX[4] ZX[6]-ZX[6] ZX[6]-ZX[10] ZX[6]-ZX[10]  #  0  6   1
    ZX[7]-ZX[2] ZX[7]-ZX[4] ZX[7]-ZX[6] ZX[7]-ZX[10] ZX[7]-ZX[10]  #  1  7   1
    ZX[8]-ZX[2] ZX[8]-ZX[4] ZX[8]-ZX[6] ZX[8]-ZX[10] ZX[8]-ZX[10]  # -2  8   2
    ZX[9]-ZX[2] ZX[9]-ZX[4] ZX[9]-ZX[6] ZX[9]-ZX[10] ZX[9]-ZX[10]  # -1  9   2
    ZX[10]-ZX[2] ZX[10]-ZX[4] ZX[10]-ZX[6] ZX[10]-ZX[10] ZX[10]-ZX[10]  #  0  10  2
    ZX[11]-ZX[2] ZX[11]-ZX[4] ZX[11]-ZX[6] ZX[11]-ZX[10] ZX[11]-ZX[10]  #  1  11  2
    ZX[12]-ZX[2] ZX[12]-ZX[4] ZX[12]-ZX[6] ZX[12]-ZX[10] ZX[12]-ZX[10]]  #  2  12  2

const q = [-1, 0, 1]   # define spherical vector component


ggtemp = zeros(12, 12)
eetemp = zeros(4, 4)
getemp = zeros(12, 4)

icount = 0
for ga = 1:12, gb = ga:12
    global icount += 1
    ggtemp[ga, gb] = icount
end
for ea = 1:4, eb = ea:4
    global icount += 1
    eetemp[ea, eb] = icount
end
for ga = 1:12, eb = 1:4
    global icount += 1
    getemp[ga, eb] = icount
end

const gg=Int.(ggtemp)
const ee=Int.(eetemp)
const ge=Int.(getemp)
function Test3DMolasses(dy, y, p, t)
    para01, delta, s, Polar, Len, a, b, c, ft, ftc = p
    vx, vy, vz, rx, ry, rz, theta1, theta2 = para01

    sx, cx = sincos(k * (rx + vx * (t / Gamma)))
    sy, cy = sincos(k * (ry + vy * (t / Gamma)))
    sz, cz = sincos(k * (rz + vz * (t / Gamma)))
    cisth1 = cis(theta1)
    cisth2 = cis(theta2)

    aplus = -1.0im * cisth1 * cx + cisth2 * sy + cz - 1.0im * sz
    bplus = sqrt(2.0) * cisth1 * sx + sqrt(2.0) * cisth2 * cy
    cplus = -1.0im * cisth1 * cx - cisth2 * sy - cz - 1.0im * sz      #Circularly��?1 0 -1 in epision��?-1 0 +1 in f_q
    # sigma- means the laser component polarization from the positive direction is +
    aminus = -1.0im * cisth1 * cx - cisth2 * sy + cz + 1.0im * sz
    bminus = -sqrt(2.0) * cisth1 * sx + sqrt(2.0) * cisth2 * cy
    cminus = -1.0im * cisth1 * cx + cisth2 * sy - cz + 1.0im * sz     #Circularly��?1 0 -1 in epision��?-1 0 +1 in f_q

    @inbounds for ii in 1:Len
        if Polar[ii] == 1.0
            a[ii] = aplus
            b[ii] = bplus
            c[ii] = cplus
        elseif Polar[ii] == -1.0
            a[ii] = aminus
            b[ii] = bminus
            c[ii] = cminus
        end
    end

    fill!(dy, false)  # both are OK
    factor = 1.0im / 2.0sqrt(2.0)

    @inbounds for jj in 1:Len
        for i = 1:12, j = 1:4
            if el[j] - gl[i] == -1
                ft[jj, i, j] = f[i, j] * a[jj] * cis((delta[jj, i] + ZB[j]) * t)
            elseif el[j] - gl[i] == 0
                ft[jj, i, j] = f[i, j] * b[jj] * cis((delta[jj, i] + ZB[j]) * t)
            elseif el[j] - gl[i] == 1
                ft[jj, i, j] = f[i, j] * c[jj] * cis((delta[jj, i] + ZB[j]) * t)
            end
        end
    end
    @. ftc = conj(ft)

    ## Main part of the ODEs
    @inbounds for jj in 1:Len

        icount = 0
        @inbounds for ga in 1:12, gb in ga:12
            icount += 1
            @inbounds for ecp in 1:4
                dy[icount] += -factor * s[jj] * (ft[jj, ga, ecp] * conj(y[ge[gb, ecp]]) - ftc[jj, gb, ecp] * y[ge[ga, ecp]])
            end
        end

        @inbounds for ea in 1:4, eb in ea:4
            icount += 1
            @inbounds for gc in 1:12
                dy[icount] += factor * s[jj] * (ft[jj, gc, eb] * conj(y[ge[gc, ea]]) - ftc[jj, gc, ea] * y[ge[gc, eb]])
            end
        end

        @inbounds for ga in 1:12, eb in 1:4
            icount += 1
            @inbounds for gc in 1:12
                if ga <= gc
                    dy[icount] += factor * s[jj] * ft[jj, gc, eb] * y[gg[ga, gc]]
                else
                    dy[icount] += factor * s[jj] * ft[jj, gc, eb] * conj(y[gg[gc, ga]])
                end
            end
            @inbounds for ecp in 1:4
                if ecp <= eb
                    dy[icount] -= factor * s[jj] * ft[jj, ga, ecp] * y[ee[ecp, eb]]
                else
                    dy[icount] -= factor * s[jj] * ft[jj, ga, ecp] * conj(y[ee[eb, ecp]])
                end
            end
        end

    end

    icount = 0
    @inbounds for ga in 1:12, gb in ga:12
        icount += 1
        @inbounds for ecp in 1:4, ecpp in 1:4
            if r[ga, ecp] != 0 && r[gb, ecpp] != 0 && el[ecp] - gl[ga] == el[ecpp] - gl[gb] && ecp <= ecpp
                dy[icount] += r[ga, ecp] * r[gb, ecpp] * cis((ZB[ecp] - ZB[ecpp] + ZX[gb] - ZX[ga]) * t) * y[ee[ecp, ecpp]]
            elseif r[ga, ecp] != 0 && r[gb, ecpp] != 0 && el[ecp] - gl[ga] == el[ecpp] - gl[gb] && ecp > ecpp
                dy[icount] += r[ga, ecp] * r[gb, ecpp] * cis((ZB[ecp] - ZB[ecpp] + ZX[gb] - ZX[ga]) * t) * conj(y[ee[ecpp, ecp]])
            end
        end
    end

    @inbounds for jj in 79:88
        dy[jj] -= y[jj]
    end

    @inbounds for jj in 89:136
        dy[jj] -= y[jj] / 2.0
    end
end

The initial program looks like this

function Test3DMolasses(dy, y, p, t)
    para01, delta, s, Polar, Len, a, b, c, ft, ftc = p
    vx, vy, vz, rx, ry, rz, theta1, theta2 = para01

    sx, cx = sincos(k * (rx + vx * (t / Gamma)))
    sy, cy = sincos(k * (ry + vy * (t / Gamma)))
    sz, cz = sincos(k * (rz + vz * (t / Gamma)))
    cisth1 = cis(theta1)
    cisth2 = cis(theta2)

    aplus = -1.0im * cisth1 * cx + cisth2 * sy + cz - 1.0im * sz
    bplus = sqrt(2.0) * cisth1 * sx + sqrt(2.0) * cisth2 * cy
    cplus = -1.0im * cisth1 * cx - cisth2 * sy - cz - 1.0im * sz      #Circularly��?1 0 -1 in epision��?-1 0 +1 in f_q
    # sigma- means the laser component polarization from the positive direction is +
    aminus = -1.0im * cisth1 * cx - cisth2 * sy + cz + 1.0im * sz
    bminus = -sqrt(2.0) * cisth1 * sx + sqrt(2.0) * cisth2 * cy
    cminus = -1.0im * cisth1 * cx + cisth2 * sy - cz + 1.0im * sz     #Circularly��?1 0 -1 in epision��?-1 0 +1 in f_q

    for ii = 1:Len
        if Polar[ii] == 1.0
            a[ii] = aplus
            b[ii] = bplus
            c[ii] = cplus
        elseif Polar[ii] == -1.0
            a[ii] = aminus
            b[ii] = bminus
            c[ii] = cminus
        end
    end

    # g g 1
    #dy .= 0.0
    fill!(dy, false)  # both are OK
    factor = 1.0im / 2.0sqrt(2.0)

    @inbounds for jj = 1:Len
        for i = 1:12, j = 1:4
            if el[j] - gl[i] == -1
                ft[jj, i, j] = f[i, j] * a[jj] * cis((delta[jj, i] + ZB[j]) * t)
            elseif el[j] - gl[i] == 0
                ft[jj, i, j] = f[i, j] * b[jj] * cis((delta[jj, i] + ZB[j]) * t)
            elseif el[j] - gl[i] == 1
                ft[jj, i, j] = f[i, j] * c[jj] * cis((delta[jj, i] + ZB[j]) * t)
            end
        end
    end
    @. ftc = conj(ft)

    @inbounds for jj = 1:Len
        dy[1] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[89]) + ft[jj, 1, 2] * conj(y[90]) + ft[jj, 1, 3] * conj(y[91]) - ftc[jj, 1, 1] * y[89] - ftc[jj, 1, 2] * y[90] - ftc[jj, 1, 3] * y[91])

        dy[2] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[93]) + ft[jj, 1, 2] * conj(y[94]) + ft[jj, 1, 3] * conj(y[95]) - ftc[jj, 2, 1] * y[89] - ftc[jj, 2, 2] * y[90] - ftc[jj, 2, 3] * y[91] - ftc[jj, 2, 4] * y[92])

        dy[3] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[97]) + ft[jj, 1, 2] * conj(y[98]) + ft[jj, 1, 3] * conj(y[99]) - ftc[jj, 3, 1] * y[89] - ftc[jj, 3, 3] * y[91] - ftc[jj, 3, 4] * y[92])

        dy[4] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[101]) + ft[jj, 1, 2] * conj(y[102]) + ft[jj, 1, 3] * conj(y[103]) - ftc[jj, 4, 1] * y[89] - ftc[jj, 4, 2] * y[90] - ftc[jj, 4, 3] * y[91] - ftc[jj, 4, 4] * y[92])

        dy[5] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[105]) + ft[jj, 1, 2] * conj(y[106]) + ft[jj, 1, 3] * conj(y[107]) - ftc[jj, 5, 1] * y[89] - ftc[jj, 5, 3] * y[91] - ftc[jj, 5, 4] * y[92])

        dy[6] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[109]) + ft[jj, 1, 2] * conj(y[110]) + ft[jj, 1, 3] * conj(y[111]) - ftc[jj, 6, 1] * y[89] - ftc[jj, 6, 2] * y[90] - ftc[jj, 6, 3] * y[91] - ftc[jj, 6, 4] * y[92])

        dy[7] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[113]) + ft[jj, 1, 2] * conj(y[114]) + ft[jj, 1, 3] * conj(y[115]) - ftc[jj, 7, 1] * y[89] - ftc[jj, 7, 2] * y[90] - ftc[jj, 7, 3] * y[91])

        dy[8] += -factor * s[jj] * (ft[jj, 1, 1] * conj(y[117]) + ft[jj, 1, 2] * conj(y[118]) + ft[jj, 1, 3] * conj(y[119]) - ftc[jj, 8, 4] * y[92])

I cannot show all of it since it’s too long

Did you try to profile it? Where is most of the time spent? Also, please provide enough info so that the problem can be reproduced. For example, what are the input values for the functions that you use?

One things I can see is:

   @inbounds for ecp in 1:4
                dy[icount] += -factor * s[jj] * (ft[jj, ga, ecp] * conj(y[ge[gb, ecp]]) - ftc[jj, gb, ecp] * y[ge[ga, ecp]])
            end

Looking at ft[jj, ga, ecp] you are looping in the wrong order. You want to loop over the firstmost index the fastest. See the performance tips: Performance Tips · The Julia Language

So you want it to be:

@inbounds for jj in 1:Len
    for ga in 1:12, gb in ga:12
        for ecp in 1:4
            ... ft[ecp, ga, jj] ...
        end
    end
end

Also, you don’t need to put @inbounds on all the inner loops, it is enough to put it on the most outer one.

2 Likes

Thanks for your advice, I will try and I also put the whole test program here so that you can run it, it’s just too long~

Can you provide a summary of the system you are solving?

It’s Optical Bloch Equations, typically used to solve the interaction between lasers and the atoms or moleucules so that we can extract the interaction force

did you try loopvectorization, some unrolling might be needed

Could you show more detail?

Maybe this was missed but fixing the loop order as described in Increase the speed of For Loop in julia - #2 by kristoffer.carlsson is crucial. No point doing anything else until that is done.

I tried that and it increases the speed just a litttle bit. Not big improvement~

I meant what are the equations and how do you discretize them and arrive at the code you have. Also, it might help to refactor the code into smaller functions and profile them.