# Improving performance of Integral equations (3-Nested loops)

Dear Julia experts,

This is my first ever code in Julia.

Why I am switching to Julia? I have used Fortran and Matlab before to write boundary element codes previously. Here is an example Active particles powered by Quincke rotation in a bulk fluid - YouTube
To scale up my problems, I have been learning MPI and C++ for the last few months. I am just bogged down by the amount of “boilerplate code” required for this. I decided to give Julia a try and wrote a minimal boundary element code for a waving filament in two-dimensions. The same program in Matlab is much slower, hence this is already an improvement. The physics of the problem is that a waving filament in the y-direction will propel in the x-direction. The governing equations are Stokes equation for low Reynold’s number flows (see file attached).

Main bottleneck The three nested loops where the integration is done over all the collocation points takes the longest time and uses a lot of memory allocation. So this is the bottleneck of my code. The outer loop (i-loop) runs over all the collocation points which are equal to the number of elements = nx-1, the middle loop (j-loop) runs over all nx-1 elements to perform the integration (the K[f_h] part) and the innermost k-loop performs integration over only 1 element running ng times equal to the number of quadrature points. Some things that I have tried (and probably incorrectly) are views, inbounds, FLoops, and Threads. @views and @inbounds didn’t improve performance for me. @FLoops does not work for reducing arrays and only scalars to avoid data race. Threads.@threads does not improve performance and memory alllocation is the same. I have not tried distributed computing yet.

Long term goals: My long term goal would be to move to many particles in three dimensions which can give rise to matrices as big as (30000,30000) invariably requiring distributed computing or GPU.

But first things first. The test I have done below is for 500 elements (=501 nodes). The matrix size is (1000+1, 1000+1).

`@btime filamentstokes(501,20,[1, 0.01, -0.01, 0.001, - 0.001],0)`
1.855 s (20500107 allocations: 3.07 GiB)
Using threads for the outermost loop
1.921 s (20500182 allocations: 3.07 GiB)

Thank you: If I can get any ideas from the community on how to improve the performance of my minimal code, I will be more encouraged to translate my much longer three-dimensional codes to Julia. Thank you for your time. I am grateful for any help I can get.

``````using FastGaussQuadrature, LinearAlgebra, BenchmarkTools, TimerOutputs
``````
``````function filamentstokes(nx,ng,A,t)
# Specify the number of nodes
# nx = 501
# Note that the number of elements = nx-1
# Specify the odd-numbered mo20
# A = [1, 0.01, -0.01, 0.001, - 0.001]
# How many points needed for integration using Gauss Quadrature?
# ng = 20
# Points and weights "using FastGaussQuadrature"
xg, wg = gausslegendre(ng)
# time
# t = 0
# Generate a sin curve
# x and y are nodes while x_m and y_m are mid points of an element
x = LinRange(-π,π,nx)
nm = 1:1:length(A)
y = (sin.((x.-t).*(2nm'.-1)))*A #(Only odd numbered modes)
# Mid-points (collocation points) and tangents at the mid points.
xm = 0.5*(x[2:nx] .+ x[1:nx-1])
ym = 0.5*(y[2:nx] .+ y[1:nx-1])
ls = sqrt.((x[2:nx] .- x[1:nx-1]).^2 .+ (y[2:nx] .- y[1:nx-1]).^2) # length of each element
tmx = (x[2:nx] .- x[1:nx-1])./ls
tmy = (y[2:nx] .- y[1:nx-1])./ls
# Arc length of node points
s = zeros(nx,1)
for i = 1:nx-1
s[i+1] = s[i] + ls[i]
end
# Arc length of mid points
sm = 0.5*(s[2:nx] .+ s[1:nx-1])
# Filament velocity at mid points (forcing for the equations)
vm = (-(2nm'.-1).*cos.((xm.-t).*(2nm'.-1)))*A
ρ = 0.001
c = 2*log(ρ/s[nx]) + 1
RM, rhs = threenestedloops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
fh = RM\rhs
# println("The x-velocity is ", fh[2*(nx-1)+1])
#display(plot(x,y, aspect_ratio=:equal))
#display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end

function threenestedloops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
# Initialize the matrix and right hand side
RM = zeros(2*(nx-1) + 1,2*(nx-1) + 1)
rhs = zeros(2*(nx-1) + 1,1)
for j = 1:nx-1 # looping over columns first
j1 = (j-1)*2 + 1
j2 = (j-1)*2 + 2

tcx = tmx[j]
tcy = tmy[j]

# Local operator terms( \Lambda [f_h] )
RM[j1,j1] += -(1*(2-c) - tcx*tcx*(c+2))/8π
RM[j1,j2] += -(0*(2-c) - tcx*tcy*(c+2))/8π
RM[j2,j1] += -(0*(2-c) - tcy*tcx*(c+2))/8π
RM[j2,j2] += -(1*(2-c) - tcy*tcy*(c+2))/8π

rhs[j1] = 0
rhs[j2] = vm[j]
RM[2*(nx-1) + 1,j1] = ls[j]
RM[j1,2*(nx-1) + 1] = 1

# trying the @view macro
#RMj1 = @view RM[:,j1] # just a vector with j1 column
#RMj2 = @view RM[:,j2] # just a vector with j2 column

for i = 1:nx-1
i1 = (i-1)*2 + 1
i2 = (i-1)*2 + 2
xc = xm[i]
yc = ym[i]
tcx = tmx[i]
tcy = tmy[i]
sc = sm[i]
# Single element k-Loop Integration using quadratures
GE,GEself = elementintegrate(xc,yc,tcx,tcy,sc,x[j],y[j],x[j+1],y[j+1],s[j],s[j+1],xg,wg,ng)

# Non-local operator terms( K [f_h] )

#RMj1[i1] += GE[1,1]/8π
#RMj2[i1] += GE[1,2]/8π
#RMj1[i2] += GE[2,1]/8π
#RMj2[i2] += GE[2,2]/8π

RM[i1,j1] += GE[1,1]/8π
RM[i1,j2] += GE[1,2]/8π
RM[i2,j1] += GE[2,1]/8π
RM[i2,j2] += GE[2,2]/8π

RM[i1,i1] += -GEself[1,1]/8π
RM[i1,i2] += -GEself[1,2]/8π
RM[i2,i1] += -GEself[2,1]/8π
RM[i2,i2] += -GEself[2,2]/8π
end
end
return RM, rhs
end

function elementintegrate(xc,yc,tcx,tcy,sc,x1,y1,x2,y2,s1,s2,xg,wg,ng)
# x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
# xc, yc are the coordinates of collocation point
GE = zeros(3,3) #
GEself = zeros(3,3)
d = 0.001 # regularisation parameter
for k = 1:ng
xi = xg[k]
xint= 0.5*( (x2 + x1) + (x2 - x1)*xi)
yint = 0.5*( (y2 + y1) + (y2 - y1)*xi)
sint = 0.5*( (s2 + s1) + (s2 - s1)*xi)
hl = 0.5*sqrt( (x2-x1)^2 + (y2-y1)^2)
# Call Green's function for Stokes equation
GEk = sgf_3d_fs(xint,yint,0,xc,yc,0) # two dimensional problem: z=0
GE += GEk*hl*wg[k]
ds = abs(sc-sint)
dss = sqrt(ds^2 + d^2)
GEself[1,1] += ((1.0 + tcx*tcx)/dss)*hl*wg[k]
GEself[1,2] += ((0.0 + tcx*tcy)/dss)*hl*wg[k]
GEself[2,1] += ((0.0 + tcy*tcx)/dss)*hl*wg[k]
GEself[2,2] += ((1.0 + tcy*tcy)/dss)*hl*wg[k]
end
return GE, GEself
end

function sgf_3d_fs(x,y,z,x0,y0,z0)
GEk = zeros(3,3)

dx = x-x0
dy = y-y0
dz = z-z0
dxx = dx^2
dxy = dx*dy
dxz = dx*dz
dyy = dy^2
dyz = dy*dz
dzz = dz^2

r  = sqrt(dxx+dyy+dzz);
r3 = r*r*r
ri  = 1.0/r
ri3 = 1.0/r3

GEk[1,1] = ri  + dxx*ri3
GEk[1,2] = 0.0 + dxy*ri3
GEk[1,3] = 0.0 + dxz*ri3

GEk[2,1] = GEk[1,2]
GEk[2,2] = 1.0 + dyy*ri3
GEk[2,3] = 0.0 + dyz*ri3

GEk[3,1] = GEk[1,3]
GEk[3,2] = GEk[2,3]
GEk[3,3] = 1.0 + dzz*ri3
return GEk
end
``````
1 Like

Use static arrays, and take a close look to the allocations in inner loops. Here is an example where I have replaced some of your inner matrices with static matrices:

Result:

``````julia> include("./deb.jl")
0.182349 seconds (60 allocations: 15.426 MiB)

``````

(vs ~1s and 3GB allocs from the original code)

``````

using StaticArrays, Test, Setfield

function filamentstokes(nx,ng,A,t)
# Specify the number of nodes
# nx = 501
# Note that the number of elements = nx-1
# Specify the odd-numbered mo20
# A = [1, 0.01, -0.01, 0.001, - 0.001]
# How many points needed for integration using Gauss Quadrature?
# ng = 20
# Points and weights "using FastGaussQuadrature"
xg, wg = gausslegendre(ng)
# time
# t = 0
# Generate a sin curve
# x and y are nodes while x_m and y_m are mid points of an element
x = LinRange(-π,π,nx)
nm = 1:1:length(A)
y = (sin.((x.-t).*(2nm'.-1)))*A #(Only odd numbered modes)
# Mid-points (collocation points) and tangents at the mid points.
xm = 0.5*(x[2:nx] .+ x[1:nx-1])
ym = 0.5*(y[2:nx] .+ y[1:nx-1])
ls = sqrt.((x[2:nx] .- x[1:nx-1]).^2 .+ (y[2:nx] .- y[1:nx-1]).^2) # length of each element
tmx = (x[2:nx] .- x[1:nx-1])./ls
tmy = (y[2:nx] .- y[1:nx-1])./ls
# Arc length of node points
s = zeros(nx,1)
for i = 1:nx-1
s[i+1] = s[i] + ls[i]
end
# Arc length of mid points
sm = 0.5*(s[2:nx] .+ s[1:nx-1])
# Filament velocity at mid points (forcing for the equations)
vm = (-(2nm'.-1).*cos.((xm.-t).*(2nm'.-1)))*A
ρ = 0.001
c = 2*log(ρ/s[nx]) + 1
RM, rhs = threenestedloops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
fh = RM\rhs
# println("The x-velocity is ", fh[2*(nx-1)+1])
#display(plot(x,y, aspect_ratio=:equal))
#display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end

function threenestedloops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
# Initialize the matrix and right hand side
RM = zeros(2*(nx-1) + 1,2*(nx-1) + 1)
rhs = zeros(2*(nx-1) + 1,1)
for j = 1:nx-1 # looping over columns first
j1 = (j-1)*2 + 1
j2 = (j-1)*2 + 2

tcx = tmx[j]
tcy = tmy[j]

# Local operator terms( \Lambda [f_h] )
RM[j1,j1] += -(1*(2-c) - tcx*tcx*(c+2))/8π
RM[j1,j2] += -(0*(2-c) - tcx*tcy*(c+2))/8π
RM[j2,j1] += -(0*(2-c) - tcy*tcx*(c+2))/8π
RM[j2,j2] += -(1*(2-c) - tcy*tcy*(c+2))/8π

rhs[j1] = 0
rhs[j2] = vm[j]
RM[2*(nx-1) + 1,j1] = ls[j]
RM[j1,2*(nx-1) + 1] = 1

# trying the @view macro
#RMj1 = @view RM[:,j1] # just a vector with j1 column
#RMj2 = @view RM[:,j2] # just a vector with j2 column

for i = 1:nx-1
i1 = (i-1)*2 + 1
i2 = (i-1)*2 + 2
xc = xm[i]
yc = ym[i]
tcx = tmx[i]
tcy = tmy[i]
sc = sm[i]
# Single element k-Loop Integration using quadratures
GE,GEself = elementintegrate(xc,yc,tcx,tcy,sc,x[j],y[j],x[j+1],y[j+1],s[j],s[j+1],xg,wg,ng)

# Non-local operator terms( K [f_h] )

#RMj1[i1] += GE[1,1]/8π
#RMj2[i1] += GE[1,2]/8π
#RMj1[i2] += GE[2,1]/8π
#RMj2[i2] += GE[2,2]/8π

RM[i1,j1] += GE[1,1]/8π
RM[i1,j2] += GE[1,2]/8π
RM[i2,j1] += GE[2,1]/8π
RM[i2,j2] += GE[2,2]/8π

RM[i1,i1] += -GEself[1,1]/8π
RM[i1,i2] += -GEself[1,2]/8π
RM[i2,i1] += -GEself[2,1]/8π
RM[i2,i2] += -GEself[2,2]/8π
end
end
return RM, rhs
end

function elementintegrate(xc,yc,tcx,tcy,sc,x1,y1,x2,y2,s1,s2,xg,wg,ng)
# x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
# xc, yc are the coordinates of collocation point
GE = zeros(SMatrix{3,3,Float64}) #
GEself = zeros(SMatrix{3,3,Float64}) #
d = 0.001 # regularisation parameter
for k = 1:ng
xi = xg[k]
xint= 0.5*( (x2 + x1) + (x2 - x1)*xi)
yint = 0.5*( (y2 + y1) + (y2 - y1)*xi)
sint = 0.5*( (s2 + s1) + (s2 - s1)*xi)
hl = 0.5*sqrt( (x2-x1)^2 + (y2-y1)^2)
# Call Green's function for Stokes equation
GEk = sgf_3d_fs(xint,yint,0,xc,yc,0) # two dimensional problem: z=0
GE += GEk*hl*wg[k]
ds = abs(sc-sint)
dss = sqrt(ds^2 + d^2)
@set! GEself[1,1] += ((1.0 + tcx*tcx)/dss)*hl*wg[k]
@set! GEself[1,2] += ((0.0 + tcx*tcy)/dss)*hl*wg[k]
@set! GEself[2,1] += ((0.0 + tcy*tcx)/dss)*hl*wg[k]
@set! GEself[2,2] += ((1.0 + tcy*tcy)/dss)*hl*wg[k]
end
return GE, GEself
end

function sgf_3d_fs(x,y,z,x0,y0,z0)

dx = x-x0
dy = y-y0
dz = z-z0
dxx = dx^2
dxy = dx*dy
dxz = dx*dz
dyy = dy^2
dyz = dy*dz
dzz = dz^2

r  = sqrt(dxx+dyy+dzz);
r3 = r*r*r
ri  = 1.0/r
ri3 = 1.0/r3

g11 = ri  + dxx*ri3
g12 = 0.0 + dxy*ri3
g13 = 0.0 + dxz*ri3

g21 = g12
g22 = 1.0 + dyy*ri3
g23 = 0.0 + dyz*ri3

g31 = g13
g32 = g23
g33 = 1.0 + dzz*ri3

GEk = @SMatrix [ g11 g12 g13
g21 g22 g23
g31 g32 g33 ]

return GEk
end

# save original result to check
#result = filamentstokes(501,20,[1, 0.01, -0.01, 0.001, - 0.001],0)

@test result ≈ filamentstokes(501,20,[1, 0.01, -0.01, 0.001, - 0.001],0)

@time filamentstokes(501,20,[1, 0.01, -0.01, 0.001, - 0.001],0)
nothing

``````

Having your inner loops completely allocation free is the main thing that you need to pursue to make the code fast. Take a look at these notes on how to track the allocations.

Using static arrays will avoid allocations of those small vectors or matrices in the inner loops (because, being static, the compiler can optimize them to be used only on the stack memory or the processor cache). (these other notes may help as well, maybe, but certainly better texts exist out there).

7 Likes

Thank you very much Leandro. This is very helpful and amazing speed-up by using StaticArrays!!
(I had tried SMatrix but had some syntax error as I wasn’t using it correctly. I also didn’t know where exactly I should use it.)

I wasn’t aware of @set! I guess it is needed for updating the small matrices. In that case, should we use it for this line as well ?

``````GE += GEk*hl*wg[k]
``````
``````@set! GE += GEk*hl*wg[k]
``````

No, because there you are updating the complete matrix.

`@set!` is a syntax sugar to update one of the elements of an immutable (static) matrix (which will just create a new matrix anyway, probably).

It would be more elegant, I think, to replace these lines:

by something like

``````GESelf += @SMatrix [ x11 x12
x21 x22 ]
``````

and then you would not need `Setfield` at all in the example.

1 Like

Perfect. Thank you again

1 Like