Multithreading for nested loops

kristoffer.carlsson · March 24, 2021, 8:07am

Note the documentation for @threads:

A barrier is placed at the end of the loop which waits for all tasks to finish execution.

pbayer · March 24, 2021, 8:10am

ok, thank you! Therefore it is unnecessary!

tkf · March 24, 2021, 10:16pm

Note that, since @threads’ chunking is rather too simple-minded (i.e., it always uses linear indexing), it is very inefficient with multi-dimensional arrays. On the other hand, FLoops.jl efficiently supports various collections, including multi-dimensional arrays, via SplittablesBase.jl interface.

julia> function f!(ys, xs)
           @assert axes(ys) == axes(xs)
           Threads.@threads for i in CartesianIndices(xs)
               @inbounds ys[i] = xs[i]
           end
           ys
       end;

julia> using FLoops

julia> function g!(ys, xs)
           @assert axes(ys) == axes(xs)
           @floop ThreadedEx() for i in CartesianIndices(xs)
               @inbounds ys[i] = xs[i]
           end
           ys
       end;

julia> xs = randn(100, 100, 100, 100);

julia> @btime f!($(similar(xs)), $xs);
  231.146 ms (42 allocations: 3.70 KiB)

julia> @btime g!($(similar(xs)), $xs);
  33.356 ms (104 allocations: 8.67 KiB)

julia> Threads.nthreads()
8

Debasish · May 19, 2021, 10:15pm

Hello everyone,

I am relatively new to Julia (from a Fortran background). I have a similar problem trying to use multithreads in Julia for 3 nested loops. Improving performance of Integral equations (3-Nested loops) (Mathematical details in previous post)
I had help from leandromartinez98 who suggested that I use Static Arrays for the inner-most loops which improved the memory allocation problem.
I am also using @simd @inbounds for the inner_most_loop which improved the computation time slightly.
Then, I wanted to use multithreading where I got stuck. I have some questions that I will be grateful if you can help answering.

Why does @simd @inbounds not improve performance for the the two outer loops. Is it because Julia has already vectorised them?
I am not sure if using @inbounds everywhere in front of the matrix RM in the outer_middle_loops is necessary.
Incorrectly using @threads I think my matrix (RM) whose elements are being summed is being accessed by different threads inconsistently. Every time I run the program I get a different answer. This is the same problem that happens in Fortran when variables are not properly categorised into shared, private or reduction variables.
PS: I also trying using @FLoops but I think it only works for scalar reduction (I could be very wrong).

I think this is similar to the problem in the threaded_sum1 function in this tutorial Multithreading | JuliaAcademy Since, I am populating a matrix rather a scalar as in this tutorial, I am not entirely sure how to do this correctly.

Here is the correct output without multithreading:

julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
  184.797 μs (62 allocations: 38.02 KiB)
0.009699171821160191

Here are two incorrect outputs from two different runs when using @threads:

julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
  112.016 μs (83 allocations: 40.05 KiB)
0.009692427660927668
julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
  109.603 μs (83 allocations: 40.05 KiB)
0.008853178975538143

Codes below :

using FastGaussQuadrature, Plots, LinearAlgebra, BenchmarkTools, TimerOutputs,
StaticArrays, .Threads

function sgf_3d_fs(x,y,z,x0,y0,z0)
    dx = x-x0
    dy = y-y0
    dz = z-z0
    dxx = dx^2
    dxy = dx*dy
    dxz = dx*dz
    dyy = dy^2
    dyz = dy*dz
    dzz = dz^2

    r  = sqrt(dxx+dyy+dzz);
    r3 = r*r*r
    ri  = 1.0/r
    ri3 = 1.0/r3

    g11 = ri  + dxx*ri3
    g12 = dxy*ri3
    g13 = dxz*ri3

    g21 = g12
    g22 = ri + dyy*ri3
    g23 = dyz*ri3

    g31 = g13
    g32 = g23
    g33 = ri + dzz*ri3

    GEk = @SMatrix [ g11 g12 g13
                     g21 g22 g23
                     g31 g32 g33 ]
    return GEk
end

function inner_loop(xc,yc,tcx,tcy,sc,x1,y1,x2,y2,s1,s2,xg,wg,ng)
    # x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
    # xc, yc are the coordinates of collocation point
    GE = zeros(SMatrix{3,3,Float64}) #
    GEself = zeros(SMatrix{3,3,Float64})
    d = 0.001 # regularisation parameter
     @inbounds @simd for k = 1:ng
        xi = xg[k]
        xint= 0.5*( (x2 + x1) + (x2 - x1)*xi)
        yint = 0.5*( (y2 + y1) + (y2 - y1)*xi)
        sint = 0.5*( (s2 + s1) + (s2 - s1)*xi)
        hl = 0.5*sqrt( (x2-x1)^2 + (y2-y1)^2)
       # Call Green's function for Stokes equation
        GEk = sgf_3d_fs(xint,yint,0,xc,yc,0) # two dimensional problem: z=0
        @inbounds GE += GEk*hl*wg[k]
        ds = abs(sc-sint)
        dss = sqrt(ds^2 + d^2)
        Gs11 = ((1.0 + tcx*tcx)/dss)*hl*wg[k]
        Gs12 = ((0.0 + tcx*tcy)/dss)*hl*wg[k]
        Gs21 = ((0.0 + tcy*tcx)/dss)*hl*wg[k]
        Gs22 = ((1.0 + tcy*tcy)/dss)*hl*wg[k]
        @inbounds  GEself += @SMatrix [ Gs11 Gs12 0
                              Gs21 Gs22 0
                              0    0    0 ]
    end
    return GE, GEself
end

function outer_middle_loops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
   # Initialize the matrix and right hand side
    RM = zeros(2*(nx-1) + 1,2*(nx-1) + 1)
    rhs = zeros(2*(nx-1) + 1,1)

    @threads for j = 1:nx-1   # using Threads
    #for j = 1:nx-1 # looping over columns first
        j1 = (j-1)*2 + 1
        j2 = (j-1)*2 + 2

        tcx = tmx[j]
        tcy = tmy[j]

        # Local operator terms( \Lambda [f_h] )
        @inbounds  RM[j1,j1] += -(1*(2-c) - tcx*tcx*(c+2))/8π
        @inbounds  RM[j1,j2] += -(0*(2-c) - tcx*tcy*(c+2))/8π
        @inbounds  RM[j2,j1] += -(0*(2-c) - tcy*tcx*(c+2))/8π
        @inbounds  RM[j2,j2] += -(1*(2-c) - tcy*tcy*(c+2))/8π

        @inbounds  rhs[j1] = 0
        @inbounds  rhs[j2] = vm[j]
        @inbounds  RM[2*(nx-1) + 1,j1] = ls[j]
        @inbounds  RM[j1,2*(nx-1) + 1] = 1

        for i = 1:nx-1
            i1 = (i-1)*2 + 1
            i2 = (i-1)*2 + 2
            xc = xm[i]
            yc = ym[i]
            tcx = tmx[i]
            tcy = tmy[i]
            sc = sm[i]
            # Single element k-Loop Integration using quadratures
            GE,GEself = inner_loop(xc,yc,tcx,tcy,sc,x[j],y[j],x[j+1],y[j+1],s[j],s[j+1],xg,wg,ng)

            # Non-local operator terms( K [f_h] )

            @inbounds  RM[i1,j1] += GE[1,1]/8π
            @inbounds  RM[i1,j2] += GE[1,2]/8π
            @inbounds  RM[i2,j1] += GE[2,1]/8π
            @inbounds  RM[i2,j2] += GE[2,2]/8π

            @inbounds  RM[i1,i1] += -GEself[1,1]/8π
            @inbounds  RM[i1,i2] += -GEself[1,2]/8π
            @inbounds  RM[i2,i1] += -GEself[2,1]/8π
            @inbounds  RM[i2,i2] += -GEself[2,2]/8π
        end
    end
    return RM, rhs
end

function filamentefficiency(nx,ng,A,t)
    # Empty grad vector
    # Specify the number of nodes
    # nx = 101
    # Note that the number of elements = nx-1
    # How many points needed for integration using Gauss Quadrature?
    # ng = 20
    # Specify the odd-numbered mo20
    # A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
    # time
    # t = 0
    # Points and weights "using FastGaussQuadrature"
    xg, wg = gausslegendre(ng)
    # Generate a sin curve
    # x and y are nodes while x_m and y_m are mid points of an element
    x = LinRange(-π,π,nx)
    nm = 1:1:length(A)
    y = (sin.((x.-t).*(2nm'.-1)))*A #(Only odd numbered modes)
    # Mid-points (collocation points) and tangents at the mid points.
    xm = 0.5*(x[2:nx] .+ x[1:nx-1])
    ym = 0.5*(y[2:nx] .+ y[1:nx-1])
    ls = sqrt.((x[2:nx] .- x[1:nx-1]).^2 .+ (y[2:nx] .- y[1:nx-1]).^2) # length of each element
    tmx = (x[2:nx] .- x[1:nx-1])./ls
    tmy = (y[2:nx] .- y[1:nx-1])./ls
    sqrt.(tmx.^2 + tmy.^2)
    # Arc length of node points
    s = zeros(nx,1)
    for i = 1:nx-1
        s[i+1] = s[i] + ls[i]
    end
    # Arc length of mid points
    sm = 0.5*(s[2:nx] .+ s[1:nx-1])
   # Filament velocity at mid points (forcing for the equations)
    vm = (-(2nm'.-1).*cos.((xm.-t).*(2nm'.-1)))*A
    ρ = 0.001
    c = 2*log(ρ/s[nx]) + 1
    RM, rhs = outer_middle_loops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
    fh = RM\rhs
    Effc = - (fh[2*(nx-1)+1])^2/(sum(fh[2:2:2*(nx-1)].*ls.*vm))
    return Effc
    # println("The x-velocity is ", fh[2*(nx-1)+1])
     display(plot(x,y, aspect_ratio=:equal))
    #display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end

tkf · May 22, 2021, 12:10am

Indeed, lines like RM[i1,i1] += -GEself[1,1]/8π indicate that your code have data races. We can’t expect a deterministic behavior.

FYI, there are a couple of examples in FLoops documentation for writing reduction of mutable objects without data races. See: In-place mutation and Initialization with @reduce(acc = op(init, x)) syntax. There’s also private variables support.

Debasish · May 22, 2021, 11:36am

Dear Takafumi,

Thank you for the help. I am trying to learn FLoops. I tried using it directly in my big code but failed. So I am trying to learn its usage through minimal examples. As you can see in my main code, there are multiple reductions happening on the elements of my main matrix RM. So, I tried to make a minimal code that mimics this. Why is Floop3 wrong? Perhaps, I don’t understand how the @reduce() do end syntax works.

using FLoops

n = 2

# Serial 1
ys01 = zeros(3)
ys02 = zeros(3)
for x in 1:n
         xs = [x, 2x, 3x]
         ys01 .+= xs
         ys02 .+= 2xs
      end
println("Serial 1")
display([ys01 ys02])

# Floop 1
@floop for x in 1:n
          xs = [x, 2x, 3x]
          @reduce(ys1 .+= 1.0*xs)
          @reduce(ys2 .+= 2.0*xs)
      end
println("Floop 1")
display([ys1 ys2])

# Serial 2
ys01 = zeros(3)
ys02 = zeros(3)
for x in 1:n
         xs = [x, 2x, 3x]
         xs1 = 2*[x, 2x, 3x]
         ys01 .+= xs
         ys02 .+= xs1
      end
println("Serial 2")
display([ys01 ys02])


# Floop 3
@floop for x in 1:n
          xs = [x, 2x, 3x]
          xs1 = 2*[x, 2x, 3x]
          @reduce() do (ys3 = zeros(3); xs)
              ys3 .+= xs
          end
          @reduce() do (ys4 = zeros(3); xs1)
              ys4 .+= xs1
          end
      end
println("Floop 2")
display([ys3 ys4])

# Serial 3
ys5 = zeros(3)
ys6 = zeros(3)
for x in 1:n
          xs = [x, 2x, 3x]
          xs1 = [x, 2x, 3x]
          ys5 .+= xs
          ys6 .+= 2*xs1
          end
println("Serial 3")
display([ys5 ys6])

# Floop 3
@floop for x in 1:n
          xs = [x, 2x, 3x]
          xs1 = [x, 2x, 3x]
          @reduce() do (ys5 = zeros(3); xs), (ys6 = zeros(3); xs1)
              ys5 .+= xs
              ys6 .+= 2*xs1
          end
      end
println("Floop 3")
display([ys5 ys6])

The outputs are

Serial 1
3×2 Matrix{Float64}:
 3.0   6.0
 6.0  12.0
 9.0  18.0
Floop 1
3×2 Matrix{Float64}:
 3.0   6.0
 6.0  12.0
 9.0  18.0
Serial 2
3×2 Matrix{Float64}:
 3.0   6.0
 6.0  12.0
 9.0  18.0
Floop 2
3×2 Matrix{Float64}:
 3.0   6.0
 6.0  12.0
 9.0  18.0
Serial 3
3×2 Matrix{Float64}:
 3.0   6.0
 6.0  12.0
 9.0  18.0
Floop 3
3×2 Matrix{Float64}:
 3.0  10.0
 6.0  20.0
 9.0  30.0

tkf · May 23, 2021, 5:59am

This is a very good question! It produced unintended result because you do “too much” inside @reduce. A correct implementation is:

julia> @floop for x = 1:n
           xs = [x, 2x, 3x]
           xs1 = [x, 2x, 3x]
           zs = xs
           zs1 = 2 .* xs1
           @reduce() do (ys5 = zeros(3); zs), (ys6 = zeros(3); zs1)
               ys5 .+= zs
               ys6 .+= zs1
           end
       end
       [ys5 ys6]
3×2 Matrix{Float64}:
 3.0   6.0
 6.0  12.0
 9.0  18.0

Notice that 2 .* xs1 is outside @reduce. We need to pull out “per iteration” computation from @reduce in general. This is because @reduce is used for generating two pieces of code: (1) base case sequential code (i.e., the code you see when you strip off the lines @reduce() ... do and the corresponding end); (2) a function that merges results (ys5 and ys6 above) from two tasks to one result. In your FLoop 3 example, you were doubling ys6 from (say) Task 2 when merging it to the result ys6 of (say) Task 1. However, Task 2 has already doubled the elements before accumulating them to its ys6. This is why ys6 was larger than intended.

(A concise explanation: @reduce was not associative)

Debasish · May 31, 2021, 7:01pm

Hello Takafumi,

While I understand the toy models, I have been unable to modify my original code to do the reduction on the arrays GE, GEself, RM. Probably because I don’t understand the syntax completely well. I wish to use @floops for the j, i, and k loops in my code below. When I tried implement it like in the toy model, it gave me error saying j1 is not defined. Can you kindly point me in the right direction?

I will greatly appreciate any help.

using FastGaussQuadrature,
    Plots, LinearAlgebra, BenchmarkTools, TimerOutputs, StaticArrays, FLoops

function sgf_3d_fs(x, y, z, x0, y0, z0)
    dx = x - x0
    dy = y - y0
    dz = z - z0
    dxx = dx^2
    dxy = dx * dy
    dxz = dx * dz
    dyy = dy^2
    dyz = dy * dz
    dzz = dz^2

    r = sqrt(dxx + dyy + dzz)
    r3 = r * r * r
    ri = 1.0 / r
    ri3 = 1.0 / r3

    g11 = ri + dxx * ri3
    g12 = dxy * ri3
    g13 = dxz * ri3

    g21 = g12
    g22 = ri + dyy * ri3
    g23 = dyz * ri3

    g31 = g13
    g32 = g23
    g33 = ri + dzz * ri3

    GEk = @SMatrix [
        g11 g12 g13
        g21 g22 g23
        g31 g32 g33
    ]
    return GEk
end

function inner_loop(xc, yc, tcx, tcy, sc, x1, y1, x2, y2, s1, s2, xg, wg, ng)
    # x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
    # xc, yc are the coordinates of collocation point
    GE = zeros(SMatrix{3,3,Float64})
    GEself = zeros(SMatrix{3,3,Float64})
    d = 0.001 # regularisation parameter
    #@inbounds @simd for k = 1:ng
    @floop for k = 1:ng
        xi = xg[k]
        xint = 0.5 * ((x2 + x1) + (x2 - x1) * xi)
        yint = 0.5 * ((y2 + y1) + (y2 - y1) * xi)
        sint = 0.5 * ((s2 + s1) + (s2 - s1) * xi)
        hl = 0.5 * sqrt((x2 - x1)^2 + (y2 - y1)^2)
        # Call Green's function for Stokes equation
        GEk = sgf_3d_fs(xint, yint, 0, xc, yc, 0) # two dimensional problem: z=0
        # REDUCE ME
        GE += GEk * hl * wg[k]
        ds = abs(sc - sint)
        dss = sqrt(ds^2 + d^2)
        Gs11 = ((1.0 + tcx * tcx) / dss) * hl * wg[k]
        Gs12 = ((0.0 + tcx * tcy) / dss) * hl * wg[k]
        Gs21 = ((0.0 + tcy * tcx) / dss) * hl * wg[k]
        Gs22 = ((1.0 + tcy * tcy) / dss) * hl * wg[k]
        # REDUCE ME
        GEself += @SMatrix [
            Gs11 Gs12 0
            Gs21 Gs22 0
            0 0 0
        ]
    end
    return GE, GEself
end

function outer_middle_loops(
    xm,
    ym,
    tmx,
    tmy,
    sm,
    vm,
    ls,
    s,
    c,
    x,
    y,
    xg,
    wg,
    ng,
    nx,
)
    # Initialize the matrix and right hand side
    RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1)
    rhs = zeros(2 * (nx - 1) + 1, 1)

    @floop for j = 1:nx-1 # looping over columns first
        j1 = (j - 1) * 2 + 1
        j2 = (j - 1) * 2 + 2

        tcx = tmx[j]
        tcy = tmy[j]

        # Local operator terms( \Lambda [f_h] )
        # REDUCE ME
        RM[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
        RM[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
        RM[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
        RM[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π

        rhs[j1] = 0
        rhs[j2] = vm[j]
        RM[2*(nx-1)+1, j1] = ls[j]
        RM[j1, 2*(nx-1)+1] = 1

        @floop for i = 1:nx-1
            i1 = (i - 1) * 2 + 1
            i2 = (i - 1) * 2 + 2
            xc = xm[i]
            yc = ym[i]
            tcx = tmx[i]
            tcy = tmy[i]
            sc = sm[i]
            # Single element k-Loop Integration using quadratures
            GE, GEself = inner_loop(
                xc,
                yc,
                tcx,
                tcy,
                sc,
                x[j],
                y[j],
                x[j+1],
                y[j+1],
                s[j],
                s[j+1],
                xg,
                wg,
                ng,
            )

            # Non-local operator terms( K [f_h] )
            # REDUCE ME
            RM[i1, j1] += GE[1, 1] / 8π
            RM[i1, j2] += GE[1, 2] / 8π
            RM[i2, j1] += GE[2, 1] / 8π
            RM[i2, j2] += GE[2, 2] / 8π

            RM[i1, i1] += -GEself[1, 1] / 8π
            RM[i1, i2] += -GEself[1, 2] / 8π
            RM[i2, i1] += -GEself[2, 1] / 8π
            RM[i2, i2] += -GEself[2, 2] / 8π
        end
    end
    return RM, rhs
end

function filamentefficiency(nx, ng, A, t)
    # Empty grad vector
    # Specify the number of nodes
    # nx = 101
    # Note that the number of elements = nx-1
    # How many points needed for integration using Gauss Quadrature?
    # ng = 20
    # Specify the odd-numbered mo20
    # A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
    # time
    # t = 0
    # Points and weights "using FastGaussQuadrature"
    xg, wg = gausslegendre(ng)
    # Generate a sin curve
    # x and y are nodes while x_m and y_m are mid points of an element
    x = LinRange(-π, π, nx)
    nm = 1:1:length(A)
    y = (sin.((x .- t) .* (2nm' .- 1))) * A #(Only odd numbered modes)
    # Mid-points (collocation points) and tangents at the mid points.
    xm = 0.5 * (x[2:nx] .+ x[1:nx-1])
    ym = 0.5 * (y[2:nx] .+ y[1:nx-1])
    ls = sqrt.((x[2:nx] .- x[1:nx-1]) .^ 2 .+ (y[2:nx] .- y[1:nx-1]) .^ 2) # length of each element
    tmx = (x[2:nx] .- x[1:nx-1]) ./ ls
    tmy = (y[2:nx] .- y[1:nx-1]) ./ ls
    sqrt.(tmx .^ 2 + tmy .^ 2)
    # Arc length of node points
    s = zeros(nx, 1)
    for i = 1:nx-1
        s[i+1] = s[i] + ls[i]
    end
    # Arc length of mid points
    sm = 0.5 * (s[2:nx] .+ s[1:nx-1])
    # Filament velocity at mid points (forcing for the equations)
    vm = (-(2nm' .- 1) .* cos.((xm .- t) .* (2nm' .- 1))) * A
    ρ = 0.001
    c = 2 * log(ρ / s[nx]) + 1
    RM, rhs = outer_middle_loops(
        xm,
        ym,
        tmx,
        tmy,
        sm,
        vm,
        ls,
        s,
        c,
        x,
        y,
        xg,
        wg,
        ng,
        nx,
    )

    fh = RM \ rhs
    Effc = -(fh[2*(nx-1)+1])^2 / (sum(fh[2:2:2*(nx-1)] .* ls .* vm))
    return Effc
    # println("The x-velocity is ", fh[2*(nx-1)+1])
    # display(plot(x,y, aspect_ratio=:equal))
    #display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end

tkf · May 31, 2021, 10:23pm

Are nx = 101 and ng = 20 the typical size? If so, I’m not sure if it is worth parallelize inner_loop and maybe the for i = 1:nx-1 loop. It’d be helpful to know typical problem size and the number of CPUs.

That said, inner_loop is easy to parallelize. You can just slap @reduce in front of GE += ... etc. To improve performance, you might want to configure the initial value explicitly as in

    # No need:
    # GE = zeros(SMatrix{3,3,Float64})
    # GEself = zeros(SMatrix{3,3,Float64})
    ...
    @floop for k = 1:ng
        ...
        @reduce GE = zeros(SMatrix{3,3,Float64}) + GEk * hl * wg[k]
        ...
        @reduce GEself = zeros(SMatrix{3,3,Float64}) + @SMatrix [
            Gs11 Gs12 0
            Gs21 Gs22 0
            0 0 0
        ]
    end

The outer loops are very tricky. There might be a better way to do this, but I think the easiest approach is to use a separate matrix for each j iteration:

    @floop for j = 1:nx-1 # looping over columns first
        ...

        @init RMj = Matrix{Float64}(undef, (2 * (nx - 1) + 1, 2 * (nx - 1) + 1))
        RMj .= 0

        RMj[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
        RMj[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
        RMj[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
        RMj[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π

        ...
        RMj[2*(nx-1)+1, j1] = ls[j]
        RMj[j1, 2*(nx-1)+1] = 1

        # Since there is no `@reduce` inside, you'd need to specify
        # `ThreadedEx()` to use threading
        @floop ThreadedEx() for i = 1:nx-1
            ...

            RMj[i1, j1] += GE[1, 1] / 8π
            RMj[i1, j2] += GE[1, 2] / 8π
            RMj[i2, j1] += GE[2, 1] / 8π
            RMj[i2, j2] += GE[2, 2] / 8π

            RMj[i1, i1] += -GEself[1, 1] / 8π
            RMj[i1, i2] += -GEself[1, 2] / 8π
            RMj[i2, i1] += -GEself[2, 1] / 8π
            RMj[i2, i2] += -GEself[2, 2] / 8π
        end

        @reduce() do (RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1); RMj)
            RM .+= RMj
        end
    end

You can also shave off RM .+= RMj for each j iteration with

@init RMj = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1)
# no `RMj .= 0`
...

RMj′ = Some(RMj)
@reduce() do (RM = nothing; RMj′)
    if RMj′ isa Some
        RM = something(RMj′)
    elseif RM === nothing
        RM = RMj′
    els
        RM .+= RMj′
    end
end

(It’s a bit ugly so I probably should add a syntax for this in FLoops.jl)

Debasish · May 31, 2021, 10:47pm

@tkf thank you very much.

Number of cores on my desktop is 48.

Well, the size is determined by the physics. This is a small problem consisting of one filament in a fluid. In this problem, this filament will show kinks, so I would like to go up to nx=501.

(However, we can generalize this for many filaments interacting in a fluid. Think of many sedimenting spaghetti, in which case N_filaments may be 100 or so, in that case, nx = 100*501.)

ng = 20 here, but can be decreased to about 10.

Debasish · June 1, 2021, 10:51am

I am getting a hasboxedvariableerror. Did it work for you ?

julia> include("filamentstokesserial.jl"); filamentefficiency(2^8,20,[1.00, -0.01, 0.01, 0.005, 0.015, -0.008],0)
ERROR: HasBoxedVariableError: Closure __##reducing_function#400 (defined in Main) has 3 boxed variables: RMj, tcy, tcx
HINT: Consider adding declarations such as `local RMj` at the narrowest possible scope required.
NOTE: This is very likely required for avoiding data races. If boxing the variables is intended, use `Ref{Any}(...)` instead.
Stacktrace:
  [1] verify_no_boxes(f::var"#__##reducing_function#400#56"{Int64, Int64, Int64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64})
    @ FLoops ~/.julia/packages/FLoops/yli1G/src/checkboxes.jl:5
  [2] macro expansion
    @ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:630 [inlined]
  [3] macro expansion
    @ ~/Desktop/mydocuments/RESEARCH/juliacodes/gausslegendre/filamentstokesserial.jl:123 [inlined]
  [4] __
    @ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:620 [inlined]
  [5] (::InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}})(x::Tuple{FLoops._FLoopInit, FLoops._FLoopInit}, y::Int64)
    @ InitialValues ~/.julia/packages/InitialValues/EPz1F/src/InitialValues.jl:305
  [6] next
    @ ~/.julia/packages/Transducers/oLdU1/src/combinators.jl:261 [inlined]
  [7] next
    @ ~/.julia/packages/Transducers/oLdU1/src/core.jl:289 [inlined]
  [8] macro expansion
    @ ~/.julia/packages/Transducers/oLdU1/src/core.jl:181 [inlined]
  [9] _foldl_array
    @ ~/.julia/packages/Transducers/oLdU1/src/processes.jl:187 [inlined]
 [10] __foldl__
    @ ~/.julia/packages/Transducers/oLdU1/src/processes.jl:182 [inlined]
 [11] foldl_nocomplete
    @ ~/.julia/packages/Transducers/oLdU1/src/processes.jl:356 [inlined]
 [12] _reduce_basecase(rf::Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
    @ Transducers ~/.julia/packages/Transducers/oLdU1/src/threading_utils.jl:56
 [13] _reduce(ctx::Transducers.NoopDACContext, rf::Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
    @ Transducers ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:170
 [14] _reduce(ctx::Transducers.NoopDACContext, rf::Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
    @ Transducers ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:179
 [15] _transduce_assoc_nocomplete
    @ ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:159 [inlined]
 [16] transduce_assoc(xform::Transducers.IdentityTransducer, step::Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, coll0::UnitRange{Int64}; simd::Val{false}, basesize::Nothing, stoppable::Nothing, nestlevel::Nothing)
    @ Transducers ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:128
 [17] transduce
    @ ~/.julia/packages/Transducers/oLdU1/src/executors.jl:50 [inlined]
 [18] transduce
    @ ~/.julia/packages/Transducers/oLdU1/src/executors.jl:62 [inlined]
 [19] _fold
    @ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:653 [inlined]
 [20] _fold
    @ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:651 [inlined]
 [21] macro expansion
    @ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:631 [inlined]
 [22] outer_middle_loops(xm::LinRange{Float64}, ym::Vector{Float64}, tmx::Vector{Float64}, tmy::Vector{Float64}, sm::Vector{Float64}, vm::Vector{Float64}, ls::Vector{Float64}, s::Matrix{Float64}, c::Float64, x::LinRange{Float64}, y::Vector{Float64}, xg::Vector{Float64}, wg::Vector{Float64}, ng::Int64, nx::Int64)
    @ Main ~/Desktop/mydocuments/RESEARCH/juliacodes/gausslegendre/filamentstokesserial.jl:94
 [23] filamentefficiency(nx::Int64, ng::Int64, A::Vector{Float64}, t::Int64)
    @ Main ~/Desktop/mydocuments/RESEARCH/juliacodes/gausslegendre/filamentstokesserial.jl:214
 [24] top-level scope
    @ none:1

using FastGaussQuadrature,
    Plots, LinearAlgebra, BenchmarkTools, TimerOutputs, StaticArrays, FLoops

function sgf_3d_fs(x, y, z, x0, y0, z0)
    dx = x - x0
    dy = y - y0
    dz = z - z0
    dxx = dx^2
    dxy = dx * dy
    dxz = dx * dz
    dyy = dy^2
    dyz = dy * dz
    dzz = dz^2

    r = sqrt(dxx + dyy + dzz)
    r3 = r * r * r
    ri = 1.0 / r
    ri3 = 1.0 / r3

    g11 = ri + dxx * ri3
    g12 = dxy * ri3
    g13 = dxz * ri3

    g21 = g12
    g22 = ri + dyy * ri3
    g23 = dyz * ri3

    g31 = g13
    g32 = g23
    g33 = ri + dzz * ri3

    GEk = @SMatrix [
        g11 g12 g13
        g21 g22 g23
        g31 g32 g33
    ]
    return GEk
end

function inner_loop(xc, yc, tcx, tcy, sc, x1, y1, x2, y2, s1, s2, xg, wg, ng)
    # x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
    # xc, yc are the coordinates of collocation point
    #GE = zeros(SMatrix{3,3,Float64})
    #GEself = zeros(SMatrix{3,3,Float64})
    d = 0.001 # regularisation parameter
    #@inbounds @simd for k = 1:ng
    @floop for k = 1:ng
        xi = xg[k]
        xint = 0.5 * ((x2 + x1) + (x2 - x1) * xi)
        yint = 0.5 * ((y2 + y1) + (y2 - y1) * xi)
        sint = 0.5 * ((s2 + s1) + (s2 - s1) * xi)
        hl = 0.5 * sqrt((x2 - x1)^2 + (y2 - y1)^2)
        # Call Green's function for Stokes equation
        GEk = sgf_3d_fs(xint, yint, 0, xc, yc, 0) # two dimensional problem: z=0
        # REDUCE ME
        @reduce GE = zeros(SMatrix{3,3,Float64}) + GEk * hl * wg[k]
        ds = abs(sc - sint)
        dss = sqrt(ds^2 + d^2)
        Gs11 = ((1.0 + tcx * tcx) / dss) * hl * wg[k]
        Gs12 = ((0.0 + tcx * tcy) / dss) * hl * wg[k]
        Gs21 = ((0.0 + tcy * tcx) / dss) * hl * wg[k]
        Gs22 = ((1.0 + tcy * tcy) / dss) * hl * wg[k]
        # REDUCE ME
        @reduce GEself = zeros(SMatrix{3,3,Float64}) + @SMatrix [
            Gs11 Gs12 0
            Gs21 Gs22 0
            0 0 0
        ]
    end
    return GE, GEself
end

function outer_middle_loops(
    xm,
    ym,
    tmx,
    tmy,
    sm,
    vm,
    ls,
    s,
    c,
    x,
    y,
    xg,
    wg,
    ng,
    nx,
)
    # Initialize the matrix and right hand side
    #RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1)
    rhs = zeros(2 * (nx - 1) + 1, 1)

    @floop for j = 1:nx-1 # looping over columns first
        j1 = (j - 1) * 2 + 1
        j2 = (j - 1) * 2 + 2

        tcx = tmx[j]
        tcy = tmy[j]

        # Local operator terms( \Lambda [f_h] )
        # REDUCE ME
        @init RMj = Matrix{Float64}(undef, (2 * (nx - 1) + 1, 2 * (nx - 1) + 1))
        RMj .= 0

        #RM[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
        #RM[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
        #RM[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
        #RM[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π
        #RM[2*(nx-1)+1, j1] = ls[j]
        #RM[j1, 2*(nx-1)+1] = 1

        RMj[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
        RMj[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
        RMj[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
        RMj[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π
        RMj[2*(nx-1)+1, j1] = ls[j]
        RMj[j1, 2*(nx-1)+1] = 1

        rhs[j1] = 0
        rhs[j2] = vm[j]

        @floop ThreadedEx() for i = 1:nx-1
            i1 = (i - 1) * 2 + 1
            i2 = (i - 1) * 2 + 2
            xc = xm[i]
            yc = ym[i]
            tcx = tmx[i]
            tcy = tmy[i]
            sc = sm[i]
            # Single element k-Loop Integration using quadratures
            GE, GEself = inner_loop(
                xc,
                yc,
                tcx,
                tcy,
                sc,
                x[j],
                y[j],
                x[j+1],
                y[j+1],
                s[j],
                s[j+1],
                xg,
                wg,
                ng,
            )

            # Non-local operator terms( K [f_h] )
            # REDUCE ME
            #RM[i1, j1] += GE[1, 1] / 8π
            #RM[i1, j2] += GE[1, 2] / 8π
            #RM[i2, j1] += GE[2, 1] / 8π
            #RM[i2, j2] += GE[2, 2] / 8π

            #RM[i1, i1] += -GEself[1, 1] / 8π
            #RM[i1, i2] += -GEself[1, 2] / 8π
            #RM[i2, i1] += -GEself[2, 1] / 8π
            #RM[i2, i2] += -GEself[2, 2] / 8π

            RMj[i1, j1] += GE[1, 1] / 8π
            RMj[i1, j2] += GE[1, 2] / 8π
            RMj[i2, j1] += GE[2, 1] / 8π
            RMj[i2, j2] += GE[2, 2] / 8π

            RMj[i1, i1] += -GEself[1, 1] / 8π
            RMj[i1, i2] += -GEself[1, 2] / 8π
            RMj[i2, i1] += -GEself[2, 1] / 8π
            RMj[i2, i2] += -GEself[2, 2] / 8π
        end
        @reduce() do (RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1); RMj)
            RM .+= RMj
        end
    end
    return RM, rhs
end

function filamentefficiency(nx, ng, A, t)
    # Empty grad vector
    # Specify the number of nodes
    # nx = 101
    # Note that the number of elements = nx-1
    # How many points needed for integration using Gauss Quadrature?
    # ng = 20
    # Specify the odd-numbered mo20
    # A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
    # time
    # t = 0
    # Points and weights "using FastGaussQuadrature"
    xg, wg = gausslegendre(ng)
    # Generate a sin curve
    # x and y are nodes while x_m and y_m are mid points of an element
    x = LinRange(-π, π, nx)
    nm = 1:1:length(A)
    y = (sin.((x .- t) .* (2nm' .- 1))) * A #(Only odd numbered modes)
    # Mid-points (collocation points) and tangents at the mid points.
    xm = 0.5 * (x[2:nx] .+ x[1:nx-1])
    ym = 0.5 * (y[2:nx] .+ y[1:nx-1])
    ls = sqrt.((x[2:nx] .- x[1:nx-1]) .^ 2 .+ (y[2:nx] .- y[1:nx-1]) .^ 2) # length of each element
    tmx = (x[2:nx] .- x[1:nx-1]) ./ ls
    tmy = (y[2:nx] .- y[1:nx-1]) ./ ls
    sqrt.(tmx .^ 2 + tmy .^ 2)
    # Arc length of node points
    s = zeros(nx, 1)
    for i = 1:nx-1
        s[i+1] = s[i] + ls[i]
    end
    # Arc length of mid points
    sm = 0.5 * (s[2:nx] .+ s[1:nx-1])
    # Filament velocity at mid points (forcing for the equations)
    vm = (-(2nm' .- 1) .* cos.((xm .- t) .* (2nm' .- 1))) * A
    ρ = 0.001
    c = 2 * log(ρ / s[nx]) + 1
    RM, rhs = outer_middle_loops(
        xm,
        ym,
        tmx,
        tmy,
        sm,
        vm,
        ls,
        s,
        c,
        x,
        y,
        xg,
        wg,
        ng,
        nx,
    )

    fh = RM \ rhs
    Effc = -(fh[2*(nx-1)+1])^2 / (sum(fh[2:2:2*(nx-1)] .* ls .* vm))
    return Effc
    # println("The x-velocity is ", fh[2*(nx-1)+1])
    # display(plot(x,y, aspect_ratio=:equal))
    #display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end

Joris_Pinkse · June 27, 2021, 3:22am

henry2004y:

# export JULIA_NUM_THREADS=2

function nestedloops1(nx, ny, nz)

   state = ones(nx,ny,nz)

   for k = 1:nz
      for j = 1:ny
         for i = 1:nx
            state[i,j,k] *= sin(i*j*k)
         end
      end
   end

   #println(state[2,2,2])

   return
end


function nestedloops2(nx, ny, nz)

   state = ones(nx,ny,nz)

   @inbounds for k = 1:nz
      @inbounds for j = 1:ny
         @inbounds for i = 1:nx
            state[i,j,k] *= sin(i*j*k)
         end
      end
   end

   #println(state[2,2,2])

   return
end

function nestedloops3(nx, ny, nz)

   state = ones(nx,ny,nz)

   Threads.@threads for k = 1:nz
      for j = 1:ny
         for i = 1:nx
            state[i,j,k] *= sin(i*j*k)
         end
      end
   end

   #println(state[2,2,2])

   return

end

function nestedloops4(nx, ny, nz)

   state = ones(nx,ny,nz)

   @sync Threads.@spawn for k = 1:nz
      for j = 1:ny
         for i = 1:nx
            state[i,j,k] *= sin(i*j*k)
         end
      end
   end

   #println(state[2,2,2])

   return

end

function nestedloops5(nx, ny, nz)

   state = ones(nx,ny,nz)

   for k = 1:nz, j = 1:ny, i = 1:nx
      state[i,j,k] *= sin(i*j*k)
   end

   #println(state[2,2,2])

   return

end

function nestedloops6(nx, ny, nz)

   state = ones(nx,ny,nz)

   Threads.@threads for k = 1:nz
      Threads.@threads for j = 1:ny
         Threads.@threads for i = 1:nx
            state[i,j,k] *= sin(i*j*k)
         end
      end
   end

   #println(state[2,2,2])

   return

end

function nestedloops7(nx, ny, nz)

   state = ones(nx,ny,nz)

   @sync Threads.@spawn for k = 1:nz
      @sync Threads.@spawn for j = 1:ny
         @sync Threads.@spawn for i = 1:nx
            state[i,j,k] *= sin(i*j*k)
         end
      end
   end

   #println(state[2,2,2])

   return

end

function nestedloops8(nx, ny, nz)

   state = ones(nx,ny,nz)

   @sync Threads.@spawn for k = 1:nz, j = 1:ny, i = 1:nx
      state[i,j,k] *= sin(i*j*k)
   end

   #println(state[2,2,2])

   return

end



##
nx, ny, nk = 200, 200, 200
nestedloops1(nx, ny, nk)
nestedloops2(nx, ny, nk)
nestedloops3(nx, ny, nk)
nestedloops4(nx, ny, nk)
nestedloops5(nx, ny, nk)
nestedloops6(nx, ny, nk)
nestedloops7(nx, ny, nk)
nestedloops8(nx, ny, nk)

println("Number of threads = ",Threads.nthreads())
println("base line:")
@time nestedloops1(nx, ny, nk)
println("explicit @inbound:")
@time nestedloops2(nx, ny, nk)
println("@threads on the outer loop:")
@time nestedloops3(nx, ny, nk)
println("@spawn on the outer loop:")
@time nestedloops4(nx, ny, nk)
println("nested loop:")
@time nestedloops5(nx, ny, nk)
println("@threads on the triple loops:")
@time nestedloops6(nx, ny, nk)
println("@spawn on the triple loops:")
@time nestedloops7(nx, ny, nk)
println("@spawn on the nested loops:")
@time nestedloops8(nx, ny, nk)

fwiw, Tullio appears to beat the competition handily, but the ranking of the remaining alternatives is different than yours:

Number of threads = 2
base line:
127.600 ms (2 allocations: 61.04 MiB)
explicit @inbound:
126.731 ms (2 allocations: 61.04 MiB)
@threads on the outer loop:
75.504 ms (14 allocations: 61.04 MiB)
@spawn on the outer loop:
125.014 ms (20 allocations: 61.04 MiB)
nested loop:
125.452 ms (2 allocations: 61.04 MiB)
@threads on the triple loops:
75.657 ms (40214 allocations: 62.88 MiB)
@spawn on the triple loops:
264.148 ms (728323 allocations: 99.82 MiB)
@spawn on the nested loops:
125.592 ms (20 allocations: 61.04 MiB)
@tullio
31.222 ms (22 allocations: 61.04 MiB)
@tullio no fast math
30.692 ms (22 allocations: 61.04 MiB)

henry2004y · June 27, 2021, 7:27am

That’s quite interesting! How can you get 4x speed up with only 2 threads? On my current Ubuntu with Julia 1.6.1, the only one that is close to linear scaling is the @threads on the outer loop case. Since in your test results it is roughly 2x speed up, I guess you were using 2 threads. Can you share the Tullio expression you use for the test?

Seif_Shebl · June 27, 2021, 9:30am

This gives me 8X straight. The number of threads is 8 on my system. Tullio with LoopVectorization is at least 2X faster than @threads on the outer loop for me.

using Tullio, LoopVectorization

function nestedloops1(nx, ny, nz)
   state = ones(nx,ny,nz)
   @tullio state[i,j,k] *= sin(i*j*k)
end

nx = ny = nz = 200
@btime nestedloops1($nx, $ny, $nz)
  20.237 ms (102 allocations: 61.04 MiB)

Joris_Pinkse · June 27, 2021, 10:37am

@henry2004y I used exactly the same code as @Seif_Shebl albeit that I forced the number of threads to be two and I also ran it while turning off fast math via the fastmath=false flag.

lmiq · June 27, 2021, 10:46am

Benchmark with @btime and interpolate the input variables ($nx,…). Also, it is better to return the matrix from the function, or the compiler sometimes may realize that nothing needs to be done.

Also, these benchmarks are probably bounded to the access of memory. Putting something more expensive to be computed inside may be a better measure for an actual application (if the application operation is more expensive).

Joris_Pinkse · June 27, 2021, 11:49am

fwiw:

for nx=ny=nk=1000 I get with two threads:

Number of threads = 2
base line:
  24.104 s (2 allocations: 7.45 GiB)
explicit @inbound:
  23.428 s (2 allocations: 7.45 GiB)
@threads on the outer loop:
  12.612 s (14 allocations: 7.45 GiB)
@spawn on the outer loop:
  23.658 s (20 allocations: 7.45 GiB)
nested loop:
  24.129 s (2 allocations: 7.45 GiB)
@threads on the triple loops:
  12.667 s (1001014 allocations: 7.50 GiB)
@spawn on the triple loops:
  26.815 s (18022351 allocations: 8.39 GiB)
@spawn on the nested loops:
  23.626 s (20 allocations: 7.45 GiB)
@tullio
  2.999 s (18 allocations: 7.45 GiB)
@tullio no fast math
  2.969 s (18 allocations: 7.45 GiB)

and with 32 threads:

Number of threads = 32
base line:
  23.963 s (2 allocations: 7.45 GiB)
explicit @inbound:
  23.522 s (2 allocations: 7.45 GiB)
@threads on the outer loop:
  2.142 s (168 allocations: 7.45 GiB)
@spawn on the outer loop:
  23.598 s (21 allocations: 7.45 GiB)
nested loop:
  23.986 s (2 allocations: 7.45 GiB)
@threads on the triple loops:
  2.138 s (1001169 allocations: 7.50 GiB)
@spawn on the triple loops:
  35.182 s (18448748 allocations: 8.40 GiB)
@spawn on the nested loops:
  23.706 s (21 allocations: 7.45 GiB)
@tullio
  1.704 s (454 allocations: 7.45 GiB)
@tullio no fast math
  1.707 s (456 allocations: 7.45 GiB)

henry2004y · June 27, 2021, 12:25pm

Thanks! In my first attempt with Tullio, I did not import LoopVectorization. In fact, if we use avx macro for this micro example, without any threading we can immediately get 4x speedup. However, that is not what I want to test initially (and probably not a fair comparison?). avx looks like the simd clause in OpenMP, if I interpret it correctly.

Now with 2 threads and no avx, using Tullio gives me 1.5x speed up compared with base case, but @threads on the outer loop is slightly faster than that:

Number of threads = 2
base line:
  158.373 ms (2 allocations: 61.04 MiB)
@threads on the outer loop:
  99.353 ms (14 allocations: 61.04 MiB)
@tullio on the nested loops:
  109.440 ms (18 allocations: 61.04 MiB)
@tullio avx on the nested loops:
  34.556 ms (17 allocations: 61.04 MiB)

Seif_Shebl · June 27, 2021, 2:17pm

Yes, I see the same ratios as you do. Interestingly, avx/avxt seems so good for this micro example.

# base case (1.0X) : 159.463 ms (  2 allocations: 61.04 MiB)
# 2 threads (1.5X) : 102.046 ms ( 18 allocations: 61.04 MiB)
# 4 threads (2.4X) :  66.887 ms ( 46 allocations: 61.04 MiB)
# 8 threads (2.8X) :  56.180 ms (102 allocations: 61.04 MiB)
# avx       (4.0X) :  39.161 ms (  2 allocations: 61.04 MiB)
# tullio    (7.5X) :  21.349 ms (102 allocations: 61.04 MiB)
# avxt      (8.2X) :  19.382 ms (  2 allocations: 61.04 MiB)

e3c6 · January 20, 2022, 1:08pm

I realize this is an old post, but it’s still one of the first results on Google for me and I have some questions on nested Threads.@spawn.

Re the above quote: The original blog post (Announcing composable multi-threaded parallelism in Julia) claims that “potentially millions” of tasks can be spawned and that’s okay. So why do you say it’s problem that too many tasks get spawned here?

More generally, any suggestions on how to make the nestedloops8 function more efficient?
I am writing something like this right now and could use some tips.

Topic		Replies	Views
Parallel for nested loop with inner loop first, and then outer loop General Usage	2	218	May 23, 2023
Limit number of threads from Threads.@threads General Usage multithreading	4	587	April 6, 2022
Threads parallelization on different nested loops Performance	0	332	June 21, 2021
Nested parallelization in Julia? General Usage	4	695	November 1, 2019
Multithreading for nested for loops General Usage parallel , multithreading , threads	13	1727	August 16, 2023

Multithreading for nested loops

Related topics