Multithreading for nested loops

Hello everyone,

I am relatively new to Julia (from a Fortran background). I have a similar problem trying to use multithreads in Julia for 3 nested loops. Improving performance of Integral equations (3-Nested loops) (Mathematical details in previous post)
I had help from leandromartinez98 who suggested that I use Static Arrays for the inner-most loops which improved the memory allocation problem.
I am also using @simd @inbounds for the inner_most_loop which improved the computation time slightly.
Then, I wanted to use multithreading where I got stuck. I have some questions that I will be grateful if you can help answering.

  1. Why does @simd @inbounds not improve performance for the the two outer loops. Is it because Julia has already vectorised them?
  2. I am not sure if using @inbounds everywhere in front of the matrix RM in the outer_middle_loops is necessary.
  3. Incorrectly using @threads I think my matrix (RM) whose elements are being summed is being accessed by different threads inconsistently. Every time I run the program I get a different answer. This is the same problem that happens in Fortran when variables are not properly categorised into shared, private or reduction variables.
    PS: I also trying using @FLoops but I think it only works for scalar reduction (I could be very wrong).

I think this is similar to the problem in the threaded_sum1 function in this tutorial Multithreading | JuliaAcademy Since, I am populating a matrix rather a scalar as in this tutorial, I am not entirely sure how to do this correctly.

Here is the correct output without multithreading:

julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
  184.797 μs (62 allocations: 38.02 KiB)
0.009699171821160191

Here are two incorrect outputs from two different runs when using @threads:

julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
  112.016 μs (83 allocations: 40.05 KiB)
0.009692427660927668
julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
  109.603 μs (83 allocations: 40.05 KiB)
0.008853178975538143

Codes below :

using FastGaussQuadrature, Plots, LinearAlgebra, BenchmarkTools, TimerOutputs,
StaticArrays, .Threads

function sgf_3d_fs(x,y,z,x0,y0,z0)
    dx = x-x0
    dy = y-y0
    dz = z-z0
    dxx = dx^2
    dxy = dx*dy
    dxz = dx*dz
    dyy = dy^2
    dyz = dy*dz
    dzz = dz^2

    r  = sqrt(dxx+dyy+dzz);
    r3 = r*r*r
    ri  = 1.0/r
    ri3 = 1.0/r3

    g11 = ri  + dxx*ri3
    g12 = dxy*ri3
    g13 = dxz*ri3

    g21 = g12
    g22 = ri + dyy*ri3
    g23 = dyz*ri3

    g31 = g13
    g32 = g23
    g33 = ri + dzz*ri3

    GEk = @SMatrix [ g11 g12 g13
                     g21 g22 g23
                     g31 g32 g33 ]
    return GEk
end

function inner_loop(xc,yc,tcx,tcy,sc,x1,y1,x2,y2,s1,s2,xg,wg,ng)
    # x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
    # xc, yc are the coordinates of collocation point
    GE = zeros(SMatrix{3,3,Float64}) #
    GEself = zeros(SMatrix{3,3,Float64})
    d = 0.001 # regularisation parameter
     @inbounds @simd for k = 1:ng
        xi = xg[k]
        xint= 0.5*( (x2 + x1) + (x2 - x1)*xi)
        yint = 0.5*( (y2 + y1) + (y2 - y1)*xi)
        sint = 0.5*( (s2 + s1) + (s2 - s1)*xi)
        hl = 0.5*sqrt( (x2-x1)^2 + (y2-y1)^2)
       # Call Green's function for Stokes equation
        GEk = sgf_3d_fs(xint,yint,0,xc,yc,0) # two dimensional problem: z=0
        @inbounds GE += GEk*hl*wg[k]
        ds = abs(sc-sint)
        dss = sqrt(ds^2 + d^2)
        Gs11 = ((1.0 + tcx*tcx)/dss)*hl*wg[k]
        Gs12 = ((0.0 + tcx*tcy)/dss)*hl*wg[k]
        Gs21 = ((0.0 + tcy*tcx)/dss)*hl*wg[k]
        Gs22 = ((1.0 + tcy*tcy)/dss)*hl*wg[k]
        @inbounds  GEself += @SMatrix [ Gs11 Gs12 0
                              Gs21 Gs22 0
                              0    0    0 ]
    end
    return GE, GEself
end

function outer_middle_loops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
   # Initialize the matrix and right hand side
    RM = zeros(2*(nx-1) + 1,2*(nx-1) + 1)
    rhs = zeros(2*(nx-1) + 1,1)

    @threads for j = 1:nx-1   # using Threads
    #for j = 1:nx-1 # looping over columns first
        j1 = (j-1)*2 + 1
        j2 = (j-1)*2 + 2

        tcx = tmx[j]
        tcy = tmy[j]

        # Local operator terms( \Lambda [f_h] )
        @inbounds  RM[j1,j1] += -(1*(2-c) - tcx*tcx*(c+2))/8π
        @inbounds  RM[j1,j2] += -(0*(2-c) - tcx*tcy*(c+2))/8π
        @inbounds  RM[j2,j1] += -(0*(2-c) - tcy*tcx*(c+2))/8π
        @inbounds  RM[j2,j2] += -(1*(2-c) - tcy*tcy*(c+2))/8π

        @inbounds  rhs[j1] = 0
        @inbounds  rhs[j2] = vm[j]
        @inbounds  RM[2*(nx-1) + 1,j1] = ls[j]
        @inbounds  RM[j1,2*(nx-1) + 1] = 1

        for i = 1:nx-1
            i1 = (i-1)*2 + 1
            i2 = (i-1)*2 + 2
            xc = xm[i]
            yc = ym[i]
            tcx = tmx[i]
            tcy = tmy[i]
            sc = sm[i]
            # Single element k-Loop Integration using quadratures
            GE,GEself = inner_loop(xc,yc,tcx,tcy,sc,x[j],y[j],x[j+1],y[j+1],s[j],s[j+1],xg,wg,ng)

            # Non-local operator terms( K [f_h] )

            @inbounds  RM[i1,j1] += GE[1,1]/8π
            @inbounds  RM[i1,j2] += GE[1,2]/8π
            @inbounds  RM[i2,j1] += GE[2,1]/8π
            @inbounds  RM[i2,j2] += GE[2,2]/8π

            @inbounds  RM[i1,i1] += -GEself[1,1]/8π
            @inbounds  RM[i1,i2] += -GEself[1,2]/8π
            @inbounds  RM[i2,i1] += -GEself[2,1]/8π
            @inbounds  RM[i2,i2] += -GEself[2,2]/8π
        end
    end
    return RM, rhs
end

function filamentefficiency(nx,ng,A,t)
    # Empty grad vector
    # Specify the number of nodes
    # nx = 101
    # Note that the number of elements = nx-1
    # How many points needed for integration using Gauss Quadrature?
    # ng = 20
    # Specify the odd-numbered mo20
    # A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
    # time
    # t = 0
    # Points and weights "using FastGaussQuadrature"
    xg, wg = gausslegendre(ng)
    # Generate a sin curve
    # x and y are nodes while x_m and y_m are mid points of an element
    x = LinRange(-π,π,nx)
    nm = 1:1:length(A)
    y = (sin.((x.-t).*(2nm'.-1)))*A #(Only odd numbered modes)
    # Mid-points (collocation points) and tangents at the mid points.
    xm = 0.5*(x[2:nx] .+ x[1:nx-1])
    ym = 0.5*(y[2:nx] .+ y[1:nx-1])
    ls = sqrt.((x[2:nx] .- x[1:nx-1]).^2 .+ (y[2:nx] .- y[1:nx-1]).^2) # length of each element
    tmx = (x[2:nx] .- x[1:nx-1])./ls
    tmy = (y[2:nx] .- y[1:nx-1])./ls
    sqrt.(tmx.^2 + tmy.^2)
    # Arc length of node points
    s = zeros(nx,1)
    for i = 1:nx-1
        s[i+1] = s[i] + ls[i]
    end
    # Arc length of mid points
    sm = 0.5*(s[2:nx] .+ s[1:nx-1])
   # Filament velocity at mid points (forcing for the equations)
    vm = (-(2nm'.-1).*cos.((xm.-t).*(2nm'.-1)))*A
    ρ = 0.001
    c = 2*log(ρ/s[nx]) + 1
    RM, rhs = outer_middle_loops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
    fh = RM\rhs
    Effc = - (fh[2*(nx-1)+1])^2/(sum(fh[2:2:2*(nx-1)].*ls.*vm))
    return Effc
    # println("The x-velocity is ", fh[2*(nx-1)+1])
     display(plot(x,y, aspect_ratio=:equal))
    #display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end