Note the documentation for @threads
:
A barrier is placed at the end of the loop which waits for all tasks to finish execution.
Note the documentation for @threads
:
A barrier is placed at the end of the loop which waits for all tasks to finish execution.
ok, thank you! Therefore it is unnecessary!
Note that, since @threads
’ chunking is rather too simple-minded (i.e., it always uses linear indexing), it is very inefficient with multi-dimensional arrays. On the other hand, FLoops.jl efficiently supports various collections, including multi-dimensional arrays, via SplittablesBase.jl interface.
julia> function f!(ys, xs)
@assert axes(ys) == axes(xs)
Threads.@threads for i in CartesianIndices(xs)
@inbounds ys[i] = xs[i]
end
ys
end;
julia> using FLoops
julia> function g!(ys, xs)
@assert axes(ys) == axes(xs)
@floop ThreadedEx() for i in CartesianIndices(xs)
@inbounds ys[i] = xs[i]
end
ys
end;
julia> xs = randn(100, 100, 100, 100);
julia> @btime f!($(similar(xs)), $xs);
231.146 ms (42 allocations: 3.70 KiB)
julia> @btime g!($(similar(xs)), $xs);
33.356 ms (104 allocations: 8.67 KiB)
julia> Threads.nthreads()
8
Hello everyone,
I am relatively new to Julia (from a Fortran background). I have a similar problem trying to use multithreads in Julia for 3 nested loops. Improving performance of Integral equations (3-Nested loops) (Mathematical details in previous post)
I had help from leandromartinez98 who suggested that I use Static Arrays for the inner-most loops which improved the memory allocation problem.
I am also using @simd @inbounds for the inner_most_loop which improved the computation time slightly.
Then, I wanted to use multithreading where I got stuck. I have some questions that I will be grateful if you can help answering.
I think this is similar to the problem in the threaded_sum1 function in this tutorial Multithreading | JuliaAcademy Since, I am populating a matrix rather a scalar as in this tutorial, I am not entirely sure how to do this correctly.
Here is the correct output without multithreading:
julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
184.797 μs (62 allocations: 38.02 KiB)
0.009699171821160191
Here are two incorrect outputs from two different runs when using @threads:
julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
112.016 μs (83 allocations: 40.05 KiB)
0.009692427660927668
julia> @btime filamentefficiency(21,10,[1, 0.01, -0.01, 0.001, - 0.001],0)
109.603 μs (83 allocations: 40.05 KiB)
0.008853178975538143
Codes below :
using FastGaussQuadrature, Plots, LinearAlgebra, BenchmarkTools, TimerOutputs,
StaticArrays, .Threads
function sgf_3d_fs(x,y,z,x0,y0,z0)
dx = x-x0
dy = y-y0
dz = z-z0
dxx = dx^2
dxy = dx*dy
dxz = dx*dz
dyy = dy^2
dyz = dy*dz
dzz = dz^2
r = sqrt(dxx+dyy+dzz);
r3 = r*r*r
ri = 1.0/r
ri3 = 1.0/r3
g11 = ri + dxx*ri3
g12 = dxy*ri3
g13 = dxz*ri3
g21 = g12
g22 = ri + dyy*ri3
g23 = dyz*ri3
g31 = g13
g32 = g23
g33 = ri + dzz*ri3
GEk = @SMatrix [ g11 g12 g13
g21 g22 g23
g31 g32 g33 ]
return GEk
end
function inner_loop(xc,yc,tcx,tcy,sc,x1,y1,x2,y2,s1,s2,xg,wg,ng)
# x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
# xc, yc are the coordinates of collocation point
GE = zeros(SMatrix{3,3,Float64}) #
GEself = zeros(SMatrix{3,3,Float64})
d = 0.001 # regularisation parameter
@inbounds @simd for k = 1:ng
xi = xg[k]
xint= 0.5*( (x2 + x1) + (x2 - x1)*xi)
yint = 0.5*( (y2 + y1) + (y2 - y1)*xi)
sint = 0.5*( (s2 + s1) + (s2 - s1)*xi)
hl = 0.5*sqrt( (x2-x1)^2 + (y2-y1)^2)
# Call Green's function for Stokes equation
GEk = sgf_3d_fs(xint,yint,0,xc,yc,0) # two dimensional problem: z=0
@inbounds GE += GEk*hl*wg[k]
ds = abs(sc-sint)
dss = sqrt(ds^2 + d^2)
Gs11 = ((1.0 + tcx*tcx)/dss)*hl*wg[k]
Gs12 = ((0.0 + tcx*tcy)/dss)*hl*wg[k]
Gs21 = ((0.0 + tcy*tcx)/dss)*hl*wg[k]
Gs22 = ((1.0 + tcy*tcy)/dss)*hl*wg[k]
@inbounds GEself += @SMatrix [ Gs11 Gs12 0
Gs21 Gs22 0
0 0 0 ]
end
return GE, GEself
end
function outer_middle_loops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
# Initialize the matrix and right hand side
RM = zeros(2*(nx-1) + 1,2*(nx-1) + 1)
rhs = zeros(2*(nx-1) + 1,1)
@threads for j = 1:nx-1 # using Threads
#for j = 1:nx-1 # looping over columns first
j1 = (j-1)*2 + 1
j2 = (j-1)*2 + 2
tcx = tmx[j]
tcy = tmy[j]
# Local operator terms( \Lambda [f_h] )
@inbounds RM[j1,j1] += -(1*(2-c) - tcx*tcx*(c+2))/8π
@inbounds RM[j1,j2] += -(0*(2-c) - tcx*tcy*(c+2))/8π
@inbounds RM[j2,j1] += -(0*(2-c) - tcy*tcx*(c+2))/8π
@inbounds RM[j2,j2] += -(1*(2-c) - tcy*tcy*(c+2))/8π
@inbounds rhs[j1] = 0
@inbounds rhs[j2] = vm[j]
@inbounds RM[2*(nx-1) + 1,j1] = ls[j]
@inbounds RM[j1,2*(nx-1) + 1] = 1
for i = 1:nx-1
i1 = (i-1)*2 + 1
i2 = (i-1)*2 + 2
xc = xm[i]
yc = ym[i]
tcx = tmx[i]
tcy = tmy[i]
sc = sm[i]
# Single element k-Loop Integration using quadratures
GE,GEself = inner_loop(xc,yc,tcx,tcy,sc,x[j],y[j],x[j+1],y[j+1],s[j],s[j+1],xg,wg,ng)
# Non-local operator terms( K [f_h] )
@inbounds RM[i1,j1] += GE[1,1]/8π
@inbounds RM[i1,j2] += GE[1,2]/8π
@inbounds RM[i2,j1] += GE[2,1]/8π
@inbounds RM[i2,j2] += GE[2,2]/8π
@inbounds RM[i1,i1] += -GEself[1,1]/8π
@inbounds RM[i1,i2] += -GEself[1,2]/8π
@inbounds RM[i2,i1] += -GEself[2,1]/8π
@inbounds RM[i2,i2] += -GEself[2,2]/8π
end
end
return RM, rhs
end
function filamentefficiency(nx,ng,A,t)
# Empty grad vector
# Specify the number of nodes
# nx = 101
# Note that the number of elements = nx-1
# How many points needed for integration using Gauss Quadrature?
# ng = 20
# Specify the odd-numbered mo20
# A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
# time
# t = 0
# Points and weights "using FastGaussQuadrature"
xg, wg = gausslegendre(ng)
# Generate a sin curve
# x and y are nodes while x_m and y_m are mid points of an element
x = LinRange(-π,π,nx)
nm = 1:1:length(A)
y = (sin.((x.-t).*(2nm'.-1)))*A #(Only odd numbered modes)
# Mid-points (collocation points) and tangents at the mid points.
xm = 0.5*(x[2:nx] .+ x[1:nx-1])
ym = 0.5*(y[2:nx] .+ y[1:nx-1])
ls = sqrt.((x[2:nx] .- x[1:nx-1]).^2 .+ (y[2:nx] .- y[1:nx-1]).^2) # length of each element
tmx = (x[2:nx] .- x[1:nx-1])./ls
tmy = (y[2:nx] .- y[1:nx-1])./ls
sqrt.(tmx.^2 + tmy.^2)
# Arc length of node points
s = zeros(nx,1)
for i = 1:nx-1
s[i+1] = s[i] + ls[i]
end
# Arc length of mid points
sm = 0.5*(s[2:nx] .+ s[1:nx-1])
# Filament velocity at mid points (forcing for the equations)
vm = (-(2nm'.-1).*cos.((xm.-t).*(2nm'.-1)))*A
ρ = 0.001
c = 2*log(ρ/s[nx]) + 1
RM, rhs = outer_middle_loops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
fh = RM\rhs
Effc = - (fh[2*(nx-1)+1])^2/(sum(fh[2:2:2*(nx-1)].*ls.*vm))
return Effc
# println("The x-velocity is ", fh[2*(nx-1)+1])
display(plot(x,y, aspect_ratio=:equal))
#display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end
Indeed, lines like RM[i1,i1] += -GEself[1,1]/8π
indicate that your code have data races. We can’t expect a deterministic behavior.
FYI, there are a couple of examples in FLoops documentation for writing reduction of mutable objects without data races. See: In-place mutation and Initialization with @reduce(acc = op(init, x))
syntax. There’s also private variables support.
Dear Takafumi,
Thank you for the help. I am trying to learn FLoops. I tried using it directly in my big code but failed. So I am trying to learn its usage through minimal examples. As you can see in my main code, there are multiple reductions happening on the elements of my main matrix RM. So, I tried to make a minimal code that mimics this. Why is Floop3 wrong? Perhaps, I don’t understand how the @reduce() do end syntax works.
using FLoops
n = 2
# Serial 1
ys01 = zeros(3)
ys02 = zeros(3)
for x in 1:n
xs = [x, 2x, 3x]
ys01 .+= xs
ys02 .+= 2xs
end
println("Serial 1")
display([ys01 ys02])
# Floop 1
@floop for x in 1:n
xs = [x, 2x, 3x]
@reduce(ys1 .+= 1.0*xs)
@reduce(ys2 .+= 2.0*xs)
end
println("Floop 1")
display([ys1 ys2])
# Serial 2
ys01 = zeros(3)
ys02 = zeros(3)
for x in 1:n
xs = [x, 2x, 3x]
xs1 = 2*[x, 2x, 3x]
ys01 .+= xs
ys02 .+= xs1
end
println("Serial 2")
display([ys01 ys02])
# Floop 3
@floop for x in 1:n
xs = [x, 2x, 3x]
xs1 = 2*[x, 2x, 3x]
@reduce() do (ys3 = zeros(3); xs)
ys3 .+= xs
end
@reduce() do (ys4 = zeros(3); xs1)
ys4 .+= xs1
end
end
println("Floop 2")
display([ys3 ys4])
# Serial 3
ys5 = zeros(3)
ys6 = zeros(3)
for x in 1:n
xs = [x, 2x, 3x]
xs1 = [x, 2x, 3x]
ys5 .+= xs
ys6 .+= 2*xs1
end
println("Serial 3")
display([ys5 ys6])
# Floop 3
@floop for x in 1:n
xs = [x, 2x, 3x]
xs1 = [x, 2x, 3x]
@reduce() do (ys5 = zeros(3); xs), (ys6 = zeros(3); xs1)
ys5 .+= xs
ys6 .+= 2*xs1
end
end
println("Floop 3")
display([ys5 ys6])
The outputs are
Serial 1
3×2 Matrix{Float64}:
3.0 6.0
6.0 12.0
9.0 18.0
Floop 1
3×2 Matrix{Float64}:
3.0 6.0
6.0 12.0
9.0 18.0
Serial 2
3×2 Matrix{Float64}:
3.0 6.0
6.0 12.0
9.0 18.0
Floop 2
3×2 Matrix{Float64}:
3.0 6.0
6.0 12.0
9.0 18.0
Serial 3
3×2 Matrix{Float64}:
3.0 6.0
6.0 12.0
9.0 18.0
Floop 3
3×2 Matrix{Float64}:
3.0 10.0
6.0 20.0
9.0 30.0
This is a very good question! It produced unintended result because you do “too much” inside @reduce
. A correct implementation is:
julia> @floop for x = 1:n
xs = [x, 2x, 3x]
xs1 = [x, 2x, 3x]
zs = xs
zs1 = 2 .* xs1
@reduce() do (ys5 = zeros(3); zs), (ys6 = zeros(3); zs1)
ys5 .+= zs
ys6 .+= zs1
end
end
[ys5 ys6]
3×2 Matrix{Float64}:
3.0 6.0
6.0 12.0
9.0 18.0
Notice that 2 .* xs1
is outside @reduce
. We need to pull out “per iteration” computation from @reduce
in general. This is because @reduce
is used for generating two pieces of code: (1) base case sequential code (i.e., the code you see when you strip off the lines @reduce() ... do
and the corresponding end
); (2) a function that merges results (ys5
and ys6
above) from two tasks to one result. In your FLoop 3 example, you were doubling ys6
from (say) Task 2 when merging it to the result ys6
of (say) Task 1. However, Task 2 has already doubled the elements before accumulating them to its ys6
. This is why ys6
was larger than intended.
(A concise explanation: @reduce
was not associative)
Hello Takafumi,
While I understand the toy models, I have been unable to modify my original code to do the reduction on the arrays GE, GEself, RM
. Probably because I don’t understand the syntax completely well. I wish to use @floops for the j, i, and k loops in my code below. When I tried implement it like in the toy model, it gave me error saying j1
is not defined. Can you kindly point me in the right direction?
I will greatly appreciate any help.
using FastGaussQuadrature,
Plots, LinearAlgebra, BenchmarkTools, TimerOutputs, StaticArrays, FLoops
function sgf_3d_fs(x, y, z, x0, y0, z0)
dx = x - x0
dy = y - y0
dz = z - z0
dxx = dx^2
dxy = dx * dy
dxz = dx * dz
dyy = dy^2
dyz = dy * dz
dzz = dz^2
r = sqrt(dxx + dyy + dzz)
r3 = r * r * r
ri = 1.0 / r
ri3 = 1.0 / r3
g11 = ri + dxx * ri3
g12 = dxy * ri3
g13 = dxz * ri3
g21 = g12
g22 = ri + dyy * ri3
g23 = dyz * ri3
g31 = g13
g32 = g23
g33 = ri + dzz * ri3
GEk = @SMatrix [
g11 g12 g13
g21 g22 g23
g31 g32 g33
]
return GEk
end
function inner_loop(xc, yc, tcx, tcy, sc, x1, y1, x2, y2, s1, s2, xg, wg, ng)
# x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
# xc, yc are the coordinates of collocation point
GE = zeros(SMatrix{3,3,Float64})
GEself = zeros(SMatrix{3,3,Float64})
d = 0.001 # regularisation parameter
#@inbounds @simd for k = 1:ng
@floop for k = 1:ng
xi = xg[k]
xint = 0.5 * ((x2 + x1) + (x2 - x1) * xi)
yint = 0.5 * ((y2 + y1) + (y2 - y1) * xi)
sint = 0.5 * ((s2 + s1) + (s2 - s1) * xi)
hl = 0.5 * sqrt((x2 - x1)^2 + (y2 - y1)^2)
# Call Green's function for Stokes equation
GEk = sgf_3d_fs(xint, yint, 0, xc, yc, 0) # two dimensional problem: z=0
# REDUCE ME
GE += GEk * hl * wg[k]
ds = abs(sc - sint)
dss = sqrt(ds^2 + d^2)
Gs11 = ((1.0 + tcx * tcx) / dss) * hl * wg[k]
Gs12 = ((0.0 + tcx * tcy) / dss) * hl * wg[k]
Gs21 = ((0.0 + tcy * tcx) / dss) * hl * wg[k]
Gs22 = ((1.0 + tcy * tcy) / dss) * hl * wg[k]
# REDUCE ME
GEself += @SMatrix [
Gs11 Gs12 0
Gs21 Gs22 0
0 0 0
]
end
return GE, GEself
end
function outer_middle_loops(
xm,
ym,
tmx,
tmy,
sm,
vm,
ls,
s,
c,
x,
y,
xg,
wg,
ng,
nx,
)
# Initialize the matrix and right hand side
RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1)
rhs = zeros(2 * (nx - 1) + 1, 1)
@floop for j = 1:nx-1 # looping over columns first
j1 = (j - 1) * 2 + 1
j2 = (j - 1) * 2 + 2
tcx = tmx[j]
tcy = tmy[j]
# Local operator terms( \Lambda [f_h] )
# REDUCE ME
RM[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
RM[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
RM[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
RM[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π
rhs[j1] = 0
rhs[j2] = vm[j]
RM[2*(nx-1)+1, j1] = ls[j]
RM[j1, 2*(nx-1)+1] = 1
@floop for i = 1:nx-1
i1 = (i - 1) * 2 + 1
i2 = (i - 1) * 2 + 2
xc = xm[i]
yc = ym[i]
tcx = tmx[i]
tcy = tmy[i]
sc = sm[i]
# Single element k-Loop Integration using quadratures
GE, GEself = inner_loop(
xc,
yc,
tcx,
tcy,
sc,
x[j],
y[j],
x[j+1],
y[j+1],
s[j],
s[j+1],
xg,
wg,
ng,
)
# Non-local operator terms( K [f_h] )
# REDUCE ME
RM[i1, j1] += GE[1, 1] / 8π
RM[i1, j2] += GE[1, 2] / 8π
RM[i2, j1] += GE[2, 1] / 8π
RM[i2, j2] += GE[2, 2] / 8π
RM[i1, i1] += -GEself[1, 1] / 8π
RM[i1, i2] += -GEself[1, 2] / 8π
RM[i2, i1] += -GEself[2, 1] / 8π
RM[i2, i2] += -GEself[2, 2] / 8π
end
end
return RM, rhs
end
function filamentefficiency(nx, ng, A, t)
# Empty grad vector
# Specify the number of nodes
# nx = 101
# Note that the number of elements = nx-1
# How many points needed for integration using Gauss Quadrature?
# ng = 20
# Specify the odd-numbered mo20
# A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
# time
# t = 0
# Points and weights "using FastGaussQuadrature"
xg, wg = gausslegendre(ng)
# Generate a sin curve
# x and y are nodes while x_m and y_m are mid points of an element
x = LinRange(-π, π, nx)
nm = 1:1:length(A)
y = (sin.((x .- t) .* (2nm' .- 1))) * A #(Only odd numbered modes)
# Mid-points (collocation points) and tangents at the mid points.
xm = 0.5 * (x[2:nx] .+ x[1:nx-1])
ym = 0.5 * (y[2:nx] .+ y[1:nx-1])
ls = sqrt.((x[2:nx] .- x[1:nx-1]) .^ 2 .+ (y[2:nx] .- y[1:nx-1]) .^ 2) # length of each element
tmx = (x[2:nx] .- x[1:nx-1]) ./ ls
tmy = (y[2:nx] .- y[1:nx-1]) ./ ls
sqrt.(tmx .^ 2 + tmy .^ 2)
# Arc length of node points
s = zeros(nx, 1)
for i = 1:nx-1
s[i+1] = s[i] + ls[i]
end
# Arc length of mid points
sm = 0.5 * (s[2:nx] .+ s[1:nx-1])
# Filament velocity at mid points (forcing for the equations)
vm = (-(2nm' .- 1) .* cos.((xm .- t) .* (2nm' .- 1))) * A
ρ = 0.001
c = 2 * log(ρ / s[nx]) + 1
RM, rhs = outer_middle_loops(
xm,
ym,
tmx,
tmy,
sm,
vm,
ls,
s,
c,
x,
y,
xg,
wg,
ng,
nx,
)
fh = RM \ rhs
Effc = -(fh[2*(nx-1)+1])^2 / (sum(fh[2:2:2*(nx-1)] .* ls .* vm))
return Effc
# println("The x-velocity is ", fh[2*(nx-1)+1])
# display(plot(x,y, aspect_ratio=:equal))
#display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end
Are nx = 101
and ng = 20
the typical size? If so, I’m not sure if it is worth parallelize inner_loop
and maybe the for i = 1:nx-1
loop. It’d be helpful to know typical problem size and the number of CPUs.
That said, inner_loop
is easy to parallelize. You can just slap @reduce
in front of GE += ...
etc. To improve performance, you might want to configure the initial value explicitly as in
# No need:
# GE = zeros(SMatrix{3,3,Float64})
# GEself = zeros(SMatrix{3,3,Float64})
...
@floop for k = 1:ng
...
@reduce GE = zeros(SMatrix{3,3,Float64}) + GEk * hl * wg[k]
...
@reduce GEself = zeros(SMatrix{3,3,Float64}) + @SMatrix [
Gs11 Gs12 0
Gs21 Gs22 0
0 0 0
]
end
The outer loops are very tricky. There might be a better way to do this, but I think the easiest approach is to use a separate matrix for each j
iteration:
@floop for j = 1:nx-1 # looping over columns first
...
@init RMj = Matrix{Float64}(undef, (2 * (nx - 1) + 1, 2 * (nx - 1) + 1))
RMj .= 0
RMj[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
RMj[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
RMj[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
RMj[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π
...
RMj[2*(nx-1)+1, j1] = ls[j]
RMj[j1, 2*(nx-1)+1] = 1
# Since there is no `@reduce` inside, you'd need to specify
# `ThreadedEx()` to use threading
@floop ThreadedEx() for i = 1:nx-1
...
RMj[i1, j1] += GE[1, 1] / 8π
RMj[i1, j2] += GE[1, 2] / 8π
RMj[i2, j1] += GE[2, 1] / 8π
RMj[i2, j2] += GE[2, 2] / 8π
RMj[i1, i1] += -GEself[1, 1] / 8π
RMj[i1, i2] += -GEself[1, 2] / 8π
RMj[i2, i1] += -GEself[2, 1] / 8π
RMj[i2, i2] += -GEself[2, 2] / 8π
end
@reduce() do (RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1); RMj)
RM .+= RMj
end
end
You can also shave off RM .+= RMj
for each j
iteration with
@init RMj = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1)
# no `RMj .= 0`
...
RMj′ = Some(RMj)
@reduce() do (RM = nothing; RMj′)
if RMj′ isa Some
RM = something(RMj′)
elseif RM === nothing
RM = RMj′
els
RM .+= RMj′
end
end
(It’s a bit ugly so I probably should add a syntax for this in FLoops.jl)
@tkf thank you very much.
Number of cores on my desktop is 48.
Well, the size is determined by the physics. This is a small problem consisting of one filament in a fluid. In this problem, this filament will show kinks, so I would like to go up to nx=501.
(However, we can generalize this for many filaments interacting in a fluid. Think of many sedimenting spaghetti, in which case N_filaments may be 100 or so, in that case, nx = 100*501.)
ng = 20 here, but can be decreased to about 10.
I am getting a hasboxedvariableerror. Did it work for you ?
julia> include("filamentstokesserial.jl"); filamentefficiency(2^8,20,[1.00, -0.01, 0.01, 0.005, 0.015, -0.008],0)
ERROR: HasBoxedVariableError: Closure __##reducing_function#400 (defined in Main) has 3 boxed variables: RMj, tcy, tcx
HINT: Consider adding declarations such as `local RMj` at the narrowest possible scope required.
NOTE: This is very likely required for avoiding data races. If boxing the variables is intended, use `Ref{Any}(...)` instead.
Stacktrace:
[1] verify_no_boxes(f::var"#__##reducing_function#400#56"{Int64, Int64, Int64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64})
@ FLoops ~/.julia/packages/FLoops/yli1G/src/checkboxes.jl:5
[2] macro expansion
@ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:630 [inlined]
[3] macro expansion
@ ~/Desktop/mydocuments/RESEARCH/juliacodes/gausslegendre/filamentstokesserial.jl:123 [inlined]
[4] __
@ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:620 [inlined]
[5] (::InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}})(x::Tuple{FLoops._FLoopInit, FLoops._FLoopInit}, y::Int64)
@ InitialValues ~/.julia/packages/InitialValues/EPz1F/src/InitialValues.jl:305
[6] next
@ ~/.julia/packages/Transducers/oLdU1/src/combinators.jl:261 [inlined]
[7] next
@ ~/.julia/packages/Transducers/oLdU1/src/core.jl:289 [inlined]
[8] macro expansion
@ ~/.julia/packages/Transducers/oLdU1/src/core.jl:181 [inlined]
[9] _foldl_array
@ ~/.julia/packages/Transducers/oLdU1/src/processes.jl:187 [inlined]
[10] __foldl__
@ ~/.julia/packages/Transducers/oLdU1/src/processes.jl:182 [inlined]
[11] foldl_nocomplete
@ ~/.julia/packages/Transducers/oLdU1/src/processes.jl:356 [inlined]
[12] _reduce_basecase(rf::Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
@ Transducers ~/.julia/packages/Transducers/oLdU1/src/threading_utils.jl:56
[13] _reduce(ctx::Transducers.NoopDACContext, rf::Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
@ Transducers ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:170
[14] _reduce(ctx::Transducers.NoopDACContext, rf::Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
@ Transducers ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:179
[15] _transduce_assoc_nocomplete
@ ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:159 [inlined]
[16] transduce_assoc(xform::Transducers.IdentityTransducer, step::Transducers.AdHocRF{var"#__##oninit_function#407#52", typeof(identity), var"#__##reducing_function#408#53"{LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Float64, LinRange{Float64}, Vector{Float64}, Vector{Float64}, Vector{Float64}, Int64, Int64, Matrix{Float64}}, typeof(identity), var"#__##combine_function#409#58"{Int64}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, coll0::UnitRange{Int64}; simd::Val{false}, basesize::Nothing, stoppable::Nothing, nestlevel::Nothing)
@ Transducers ~/.julia/packages/Transducers/oLdU1/src/reduce.jl:128
[17] transduce
@ ~/.julia/packages/Transducers/oLdU1/src/executors.jl:50 [inlined]
[18] transduce
@ ~/.julia/packages/Transducers/oLdU1/src/executors.jl:62 [inlined]
[19] _fold
@ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:653 [inlined]
[20] _fold
@ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:651 [inlined]
[21] macro expansion
@ ~/.julia/packages/FLoops/yli1G/src/reduce.jl:631 [inlined]
[22] outer_middle_loops(xm::LinRange{Float64}, ym::Vector{Float64}, tmx::Vector{Float64}, tmy::Vector{Float64}, sm::Vector{Float64}, vm::Vector{Float64}, ls::Vector{Float64}, s::Matrix{Float64}, c::Float64, x::LinRange{Float64}, y::Vector{Float64}, xg::Vector{Float64}, wg::Vector{Float64}, ng::Int64, nx::Int64)
@ Main ~/Desktop/mydocuments/RESEARCH/juliacodes/gausslegendre/filamentstokesserial.jl:94
[23] filamentefficiency(nx::Int64, ng::Int64, A::Vector{Float64}, t::Int64)
@ Main ~/Desktop/mydocuments/RESEARCH/juliacodes/gausslegendre/filamentstokesserial.jl:214
[24] top-level scope
@ none:1
using FastGaussQuadrature,
Plots, LinearAlgebra, BenchmarkTools, TimerOutputs, StaticArrays, FLoops
function sgf_3d_fs(x, y, z, x0, y0, z0)
dx = x - x0
dy = y - y0
dz = z - z0
dxx = dx^2
dxy = dx * dy
dxz = dx * dz
dyy = dy^2
dyz = dy * dz
dzz = dz^2
r = sqrt(dxx + dyy + dzz)
r3 = r * r * r
ri = 1.0 / r
ri3 = 1.0 / r3
g11 = ri + dxx * ri3
g12 = dxy * ri3
g13 = dxz * ri3
g21 = g12
g22 = ri + dyy * ri3
g23 = dyz * ri3
g31 = g13
g32 = g23
g33 = ri + dzz * ri3
GEk = @SMatrix [
g11 g12 g13
g21 g22 g23
g31 g32 g33
]
return GEk
end
function inner_loop(xc, yc, tcx, tcy, sc, x1, y1, x2, y2, s1, s2, xg, wg, ng)
# x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
# xc, yc are the coordinates of collocation point
#GE = zeros(SMatrix{3,3,Float64})
#GEself = zeros(SMatrix{3,3,Float64})
d = 0.001 # regularisation parameter
#@inbounds @simd for k = 1:ng
@floop for k = 1:ng
xi = xg[k]
xint = 0.5 * ((x2 + x1) + (x2 - x1) * xi)
yint = 0.5 * ((y2 + y1) + (y2 - y1) * xi)
sint = 0.5 * ((s2 + s1) + (s2 - s1) * xi)
hl = 0.5 * sqrt((x2 - x1)^2 + (y2 - y1)^2)
# Call Green's function for Stokes equation
GEk = sgf_3d_fs(xint, yint, 0, xc, yc, 0) # two dimensional problem: z=0
# REDUCE ME
@reduce GE = zeros(SMatrix{3,3,Float64}) + GEk * hl * wg[k]
ds = abs(sc - sint)
dss = sqrt(ds^2 + d^2)
Gs11 = ((1.0 + tcx * tcx) / dss) * hl * wg[k]
Gs12 = ((0.0 + tcx * tcy) / dss) * hl * wg[k]
Gs21 = ((0.0 + tcy * tcx) / dss) * hl * wg[k]
Gs22 = ((1.0 + tcy * tcy) / dss) * hl * wg[k]
# REDUCE ME
@reduce GEself = zeros(SMatrix{3,3,Float64}) + @SMatrix [
Gs11 Gs12 0
Gs21 Gs22 0
0 0 0
]
end
return GE, GEself
end
function outer_middle_loops(
xm,
ym,
tmx,
tmy,
sm,
vm,
ls,
s,
c,
x,
y,
xg,
wg,
ng,
nx,
)
# Initialize the matrix and right hand side
#RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1)
rhs = zeros(2 * (nx - 1) + 1, 1)
@floop for j = 1:nx-1 # looping over columns first
j1 = (j - 1) * 2 + 1
j2 = (j - 1) * 2 + 2
tcx = tmx[j]
tcy = tmy[j]
# Local operator terms( \Lambda [f_h] )
# REDUCE ME
@init RMj = Matrix{Float64}(undef, (2 * (nx - 1) + 1, 2 * (nx - 1) + 1))
RMj .= 0
#RM[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
#RM[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
#RM[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
#RM[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π
#RM[2*(nx-1)+1, j1] = ls[j]
#RM[j1, 2*(nx-1)+1] = 1
RMj[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
RMj[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
RMj[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
RMj[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π
RMj[2*(nx-1)+1, j1] = ls[j]
RMj[j1, 2*(nx-1)+1] = 1
rhs[j1] = 0
rhs[j2] = vm[j]
@floop ThreadedEx() for i = 1:nx-1
i1 = (i - 1) * 2 + 1
i2 = (i - 1) * 2 + 2
xc = xm[i]
yc = ym[i]
tcx = tmx[i]
tcy = tmy[i]
sc = sm[i]
# Single element k-Loop Integration using quadratures
GE, GEself = inner_loop(
xc,
yc,
tcx,
tcy,
sc,
x[j],
y[j],
x[j+1],
y[j+1],
s[j],
s[j+1],
xg,
wg,
ng,
)
# Non-local operator terms( K [f_h] )
# REDUCE ME
#RM[i1, j1] += GE[1, 1] / 8π
#RM[i1, j2] += GE[1, 2] / 8π
#RM[i2, j1] += GE[2, 1] / 8π
#RM[i2, j2] += GE[2, 2] / 8π
#RM[i1, i1] += -GEself[1, 1] / 8π
#RM[i1, i2] += -GEself[1, 2] / 8π
#RM[i2, i1] += -GEself[2, 1] / 8π
#RM[i2, i2] += -GEself[2, 2] / 8π
RMj[i1, j1] += GE[1, 1] / 8π
RMj[i1, j2] += GE[1, 2] / 8π
RMj[i2, j1] += GE[2, 1] / 8π
RMj[i2, j2] += GE[2, 2] / 8π
RMj[i1, i1] += -GEself[1, 1] / 8π
RMj[i1, i2] += -GEself[1, 2] / 8π
RMj[i2, i1] += -GEself[2, 1] / 8π
RMj[i2, i2] += -GEself[2, 2] / 8π
end
@reduce() do (RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1); RMj)
RM .+= RMj
end
end
return RM, rhs
end
function filamentefficiency(nx, ng, A, t)
# Empty grad vector
# Specify the number of nodes
# nx = 101
# Note that the number of elements = nx-1
# How many points needed for integration using Gauss Quadrature?
# ng = 20
# Specify the odd-numbered mo20
# A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
# time
# t = 0
# Points and weights "using FastGaussQuadrature"
xg, wg = gausslegendre(ng)
# Generate a sin curve
# x and y are nodes while x_m and y_m are mid points of an element
x = LinRange(-π, π, nx)
nm = 1:1:length(A)
y = (sin.((x .- t) .* (2nm' .- 1))) * A #(Only odd numbered modes)
# Mid-points (collocation points) and tangents at the mid points.
xm = 0.5 * (x[2:nx] .+ x[1:nx-1])
ym = 0.5 * (y[2:nx] .+ y[1:nx-1])
ls = sqrt.((x[2:nx] .- x[1:nx-1]) .^ 2 .+ (y[2:nx] .- y[1:nx-1]) .^ 2) # length of each element
tmx = (x[2:nx] .- x[1:nx-1]) ./ ls
tmy = (y[2:nx] .- y[1:nx-1]) ./ ls
sqrt.(tmx .^ 2 + tmy .^ 2)
# Arc length of node points
s = zeros(nx, 1)
for i = 1:nx-1
s[i+1] = s[i] + ls[i]
end
# Arc length of mid points
sm = 0.5 * (s[2:nx] .+ s[1:nx-1])
# Filament velocity at mid points (forcing for the equations)
vm = (-(2nm' .- 1) .* cos.((xm .- t) .* (2nm' .- 1))) * A
ρ = 0.001
c = 2 * log(ρ / s[nx]) + 1
RM, rhs = outer_middle_loops(
xm,
ym,
tmx,
tmy,
sm,
vm,
ls,
s,
c,
x,
y,
xg,
wg,
ng,
nx,
)
fh = RM \ rhs
Effc = -(fh[2*(nx-1)+1])^2 / (sum(fh[2:2:2*(nx-1)] .* ls .* vm))
return Effc
# println("The x-velocity is ", fh[2*(nx-1)+1])
# display(plot(x,y, aspect_ratio=:equal))
#display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end
fwiw, Tullio appears to beat the competition handily, but the ranking of the remaining alternatives is different than yours:
Number of threads = 2
base line:
127.600 ms (2 allocations: 61.04 MiB)
explicit @inbound:
126.731 ms (2 allocations: 61.04 MiB)
@threads on the outer loop:
75.504 ms (14 allocations: 61.04 MiB)
@spawn on the outer loop:
125.014 ms (20 allocations: 61.04 MiB)
nested loop:
125.452 ms (2 allocations: 61.04 MiB)
@threads on the triple loops:
75.657 ms (40214 allocations: 62.88 MiB)
@spawn on the triple loops:
264.148 ms (728323 allocations: 99.82 MiB)
@spawn on the nested loops:
125.592 ms (20 allocations: 61.04 MiB)
@tullio
31.222 ms (22 allocations: 61.04 MiB)
@tullio no fast math
30.692 ms (22 allocations: 61.04 MiB)
That’s quite interesting! How can you get 4x speed up with only 2 threads? On my current Ubuntu with Julia 1.6.1, the only one that is close to linear scaling is the @threads on the outer loop
case. Since in your test results it is roughly 2x speed up, I guess you were using 2 threads. Can you share the Tullio expression you use for the test?
This gives me 8X straight. The number of threads is 8 on my system. Tullio with LoopVectorization is at least 2X faster than @threads on the outer loop for me.
using Tullio, LoopVectorization
function nestedloops1(nx, ny, nz)
state = ones(nx,ny,nz)
@tullio state[i,j,k] *= sin(i*j*k)
end
nx = ny = nz = 200
@btime nestedloops1($nx, $ny, $nz)
20.237 ms (102 allocations: 61.04 MiB)
@henry2004y I used exactly the same code as @Seif_Shebl albeit that I forced the number of threads to be two and I also ran it while turning off fast math via the fastmath=false flag.
Benchmark with @btime
and interpolate the input variables ($nx,…). Also, it is better to return the matrix from the function, or the compiler sometimes may realize that nothing needs to be done.
Also, these benchmarks are probably bounded to the access of memory. Putting something more expensive to be computed inside may be a better measure for an actual application (if the application operation is more expensive).
fwiw:
for nx=ny=nk=1000 I get with two threads:
Number of threads = 2
base line:
24.104 s (2 allocations: 7.45 GiB)
explicit @inbound:
23.428 s (2 allocations: 7.45 GiB)
@threads on the outer loop:
12.612 s (14 allocations: 7.45 GiB)
@spawn on the outer loop:
23.658 s (20 allocations: 7.45 GiB)
nested loop:
24.129 s (2 allocations: 7.45 GiB)
@threads on the triple loops:
12.667 s (1001014 allocations: 7.50 GiB)
@spawn on the triple loops:
26.815 s (18022351 allocations: 8.39 GiB)
@spawn on the nested loops:
23.626 s (20 allocations: 7.45 GiB)
@tullio
2.999 s (18 allocations: 7.45 GiB)
@tullio no fast math
2.969 s (18 allocations: 7.45 GiB)
and with 32 threads:
Number of threads = 32
base line:
23.963 s (2 allocations: 7.45 GiB)
explicit @inbound:
23.522 s (2 allocations: 7.45 GiB)
@threads on the outer loop:
2.142 s (168 allocations: 7.45 GiB)
@spawn on the outer loop:
23.598 s (21 allocations: 7.45 GiB)
nested loop:
23.986 s (2 allocations: 7.45 GiB)
@threads on the triple loops:
2.138 s (1001169 allocations: 7.50 GiB)
@spawn on the triple loops:
35.182 s (18448748 allocations: 8.40 GiB)
@spawn on the nested loops:
23.706 s (21 allocations: 7.45 GiB)
@tullio
1.704 s (454 allocations: 7.45 GiB)
@tullio no fast math
1.707 s (456 allocations: 7.45 GiB)
Thanks! In my first attempt with Tullio, I did not import LoopVectorization. In fact, if we use avx
macro for this micro example, without any threading we can immediately get 4x speedup. However, that is not what I want to test initially (and probably not a fair comparison?). avx
looks like the simd
clause in OpenMP, if I interpret it correctly.
Now with 2 threads and no avx
, using Tullio gives me 1.5x speed up compared with base case, but @threads on the outer loop
is slightly faster than that:
Number of threads = 2
base line:
158.373 ms (2 allocations: 61.04 MiB)
@threads on the outer loop:
99.353 ms (14 allocations: 61.04 MiB)
@tullio on the nested loops:
109.440 ms (18 allocations: 61.04 MiB)
@tullio avx on the nested loops:
34.556 ms (17 allocations: 61.04 MiB)
Yes, I see the same ratios as you do. Interestingly, avx/avxt
seems so good for this micro example.
# base case (1.0X) : 159.463 ms ( 2 allocations: 61.04 MiB)
# 2 threads (1.5X) : 102.046 ms ( 18 allocations: 61.04 MiB)
# 4 threads (2.4X) : 66.887 ms ( 46 allocations: 61.04 MiB)
# 8 threads (2.8X) : 56.180 ms (102 allocations: 61.04 MiB)
# avx (4.0X) : 39.161 ms ( 2 allocations: 61.04 MiB)
# tullio (7.5X) : 21.349 ms (102 allocations: 61.04 MiB)
# avxt (8.2X) : 19.382 ms ( 2 allocations: 61.04 MiB)
I realize this is an old post, but it’s still one of the first results on Google for me and I have some questions on nested Threads.@spawn
.
Re the above quote: The original blog post (Announcing composable multi-threaded parallelism in Julia) claims that “potentially millions” of tasks can be spawned and that’s okay. So why do you say it’s problem that too many tasks get spawned here?
More generally, any suggestions on how to make the nestedloops8
function more efficient?
I am writing something like this right now and could use some tips.