Dear Julia experts,
This is my first ever code in Julia.
Why I am switching to Julia? I have used Fortran and Matlab before to write boundary element codes previously. Here is an example https://www.youtube.com/watch?v=AQiAHKOf-Vs&t=2s
To scale up my problems, I have been learning MPI and C++ for the last few months. I am just bogged down by the amount of “boilerplate code” required for this. I decided to give Julia a try and wrote a minimal boundary element code for a waving filament in two-dimensions. The same program in Matlab is much slower, hence this is already an improvement. The physics of the problem is that a waving filament in the y-direction will propel in the x-direction. The governing equations are Stokes equation for low Reynold’s number flows (see file attached).
Main bottleneck The three nested loops where the integration is done over all the collocation points takes the longest time and uses a lot of memory allocation. So this is the bottleneck of my code. The outer loop (i-loop) runs over all the collocation points which are equal to the number of elements = nx-1, the middle loop (j-loop) runs over all nx-1 elements to perform the integration (the K[f_h] part) and the innermost k-loop performs integration over only 1 element running ng times equal to the number of quadrature points. Some things that I have tried (and probably incorrectly) are views, inbounds, FLoops, and Threads. @views and @inbounds didn’t improve performance for me. @FLoops does not work for reducing arrays and only scalars to avoid data race. Threads.@threads does not improve performance and memory alllocation is the same. I have not tried distributed computing yet.
Long term goals: My long term goal would be to move to many particles in three dimensions which can give rise to matrices as big as (30000,30000) invariably requiring distributed computing or GPU.
But first things first. The test I have done below is for 500 elements (=501 nodes). The matrix size is (1000+1, 1000+1).
Without threads
@btime filamentstokes(501,20,[1, 0.01, -0.01, 0.001, - 0.001],0)
1.855 s (20500107 allocations: 3.07 GiB)
Using threads for the outermost loop
1.921 s (20500182 allocations: 3.07 GiB)
Thank you: If I can get any ideas from the community on how to improve the performance of my minimal code, I will be more encouraged to translate my much longer three-dimensional codes to Julia. Thank you for your time. I am grateful for any help I can get.
using FastGaussQuadrature, LinearAlgebra, BenchmarkTools, TimerOutputs
function filamentstokes(nx,ng,A,t)
# Specify the number of nodes
# nx = 501
# Note that the number of elements = nx-1
# Specify the odd-numbered mo20
# A = [1, 0.01, -0.01, 0.001, - 0.001]
# How many points needed for integration using Gauss Quadrature?
# ng = 20
# Points and weights "using FastGaussQuadrature"
xg, wg = gausslegendre(ng)
# time
# t = 0
# Generate a sin curve
# x and y are nodes while x_m and y_m are mid points of an element
x = LinRange(-π,π,nx)
nm = 1:1:length(A)
y = (sin.((x.-t).*(2nm'.-1)))*A #(Only odd numbered modes)
# Mid-points (collocation points) and tangents at the mid points.
xm = 0.5*(x[2:nx] .+ x[1:nx-1])
ym = 0.5*(y[2:nx] .+ y[1:nx-1])
ls = sqrt.((x[2:nx] .- x[1:nx-1]).^2 .+ (y[2:nx] .- y[1:nx-1]).^2) # length of each element
tmx = (x[2:nx] .- x[1:nx-1])./ls
tmy = (y[2:nx] .- y[1:nx-1])./ls
# Arc length of node points
s = zeros(nx,1)
for i = 1:nx-1
s[i+1] = s[i] + ls[i]
end
# Arc length of mid points
sm = 0.5*(s[2:nx] .+ s[1:nx-1])
# Filament velocity at mid points (forcing for the equations)
vm = (-(2nm'.-1).*cos.((xm.-t).*(2nm'.-1)))*A
ρ = 0.001
c = 2*log(ρ/s[nx]) + 1
RM, rhs = threenestedloops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
fh = RM\rhs
# println("The x-velocity is ", fh[2*(nx-1)+1])
#display(plot(x,y, aspect_ratio=:equal))
#display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end
function threenestedloops(xm,ym,tmx,tmy,sm,vm,ls,s,c,x,y,xg,wg,ng,nx)
# Initialize the matrix and right hand side
RM = zeros(2*(nx-1) + 1,2*(nx-1) + 1)
rhs = zeros(2*(nx-1) + 1,1)
#Threads.@threads for j = 1:nx-1 # use Threads if needed
for j = 1:nx-1 # looping over columns first
j1 = (j-1)*2 + 1
j2 = (j-1)*2 + 2
tcx = tmx[j]
tcy = tmy[j]
# Local operator terms( \Lambda [f_h] )
RM[j1,j1] += -(1*(2-c) - tcx*tcx*(c+2))/8π
RM[j1,j2] += -(0*(2-c) - tcx*tcy*(c+2))/8π
RM[j2,j1] += -(0*(2-c) - tcy*tcx*(c+2))/8π
RM[j2,j2] += -(1*(2-c) - tcy*tcy*(c+2))/8π
rhs[j1] = 0
rhs[j2] = vm[j]
RM[2*(nx-1) + 1,j1] = ls[j]
RM[j1,2*(nx-1) + 1] = 1
# trying the @view macro
#RMj1 = @view RM[:,j1] # just a vector with j1 column
#RMj2 = @view RM[:,j2] # just a vector with j2 column
for i = 1:nx-1
i1 = (i-1)*2 + 1
i2 = (i-1)*2 + 2
xc = xm[i]
yc = ym[i]
tcx = tmx[i]
tcy = tmy[i]
sc = sm[i]
# Single element k-Loop Integration using quadratures
GE,GEself = elementintegrate(xc,yc,tcx,tcy,sc,x[j],y[j],x[j+1],y[j+1],s[j],s[j+1],xg,wg,ng)
# Non-local operator terms( K [f_h] )
#RMj1[i1] += GE[1,1]/8π
#RMj2[i1] += GE[1,2]/8π
#RMj1[i2] += GE[2,1]/8π
#RMj2[i2] += GE[2,2]/8π
RM[i1,j1] += GE[1,1]/8π
RM[i1,j2] += GE[1,2]/8π
RM[i2,j1] += GE[2,1]/8π
RM[i2,j2] += GE[2,2]/8π
RM[i1,i1] += -GEself[1,1]/8π
RM[i1,i2] += -GEself[1,2]/8π
RM[i2,i1] += -GEself[2,1]/8π
RM[i2,i2] += -GEself[2,2]/8π
end
end
return RM, rhs
end
function elementintegrate(xc,yc,tcx,tcy,sc,x1,y1,x2,y2,s1,s2,xg,wg,ng)
# x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
# xc, yc are the coordinates of collocation point
GE = zeros(3,3) #
GEself = zeros(3,3)
d = 0.001 # regularisation parameter
for k = 1:ng
xi = xg[k]
xint= 0.5*( (x2 + x1) + (x2 - x1)*xi)
yint = 0.5*( (y2 + y1) + (y2 - y1)*xi)
sint = 0.5*( (s2 + s1) + (s2 - s1)*xi)
hl = 0.5*sqrt( (x2-x1)^2 + (y2-y1)^2)
# Call Green's function for Stokes equation
GEk = sgf_3d_fs(xint,yint,0,xc,yc,0) # two dimensional problem: z=0
GE += GEk*hl*wg[k]
ds = abs(sc-sint)
dss = sqrt(ds^2 + d^2)
GEself[1,1] += ((1.0 + tcx*tcx)/dss)*hl*wg[k]
GEself[1,2] += ((0.0 + tcx*tcy)/dss)*hl*wg[k]
GEself[2,1] += ((0.0 + tcy*tcx)/dss)*hl*wg[k]
GEself[2,2] += ((1.0 + tcy*tcy)/dss)*hl*wg[k]
end
return GE, GEself
end
function sgf_3d_fs(x,y,z,x0,y0,z0)
GEk = zeros(3,3)
dx = x-x0
dy = y-y0
dz = z-z0
dxx = dx^2
dxy = dx*dy
dxz = dx*dz
dyy = dy^2
dyz = dy*dz
dzz = dz^2
r = sqrt(dxx+dyy+dzz);
r3 = r*r*r
ri = 1.0/r
ri3 = 1.0/r3
GEk[1,1] = ri + dxx*ri3
GEk[1,2] = 0.0 + dxy*ri3
GEk[1,3] = 0.0 + dxz*ri3
GEk[2,1] = GEk[1,2]
GEk[2,2] = 1.0 + dyy*ri3
GEk[2,3] = 0.0 + dyz*ri3
GEk[3,1] = GEk[1,3]
GEk[3,2] = GEk[2,3]
GEk[3,3] = 1.0 + dzz*ri3
return GEk
end