 Parallelizing a force assembly function

Hello, I am writing a physical simulation code that I’m trying to optimize.

The most critical function, called at every integration step, is the force assembly: for every triangle of the mesh, given the nodal positions, do a bit of linear algebra and compute the forces at each node of this triangle. The total nodal forces are then the contributions of all the forces of all triangles. A minimal working example of this function is as follows:

function distribute_seq!(force, coords, tris, coefs)
for t_idx in 1:length(tris) # Loop for all triangles
tri = tris[t_idx] # Contains the 3 nodal indices for the triangle
# Current position of these 3 points
pt1 = coords[tri]
pt2 = coords[tri]
pt3 = coords[tri]
# Some dummy operations involving the coordinates
val = coefs[t_idx]*norm(pt1+pt2+pt3)
# Nodal forces are distributed for this triangle
force[tri] += (pt3-pt1)*val
force[tri] += (pt1-pt2)*val
force[tri] += (pt2-pt3)*val
end
end

This functions runs very fast (~300 microsec on my machine for 10k nodes, 15k triangles), but I’m trying to parallelize it by using graph coloring techniques to avoid race conditions (see https://paulino.ce.gatech.edu/education_resources/GreedyColoring/GreedyColoring.pdf).

So I’ve used a greedy coloring algorithm to group the triangles that do not share any nodes, and I’ve written this function:

function distribute_par!(force, coords, tris, coefs, colors)
for col_tri in colors
# Each color is safe to assemble threaded
tri = tris[t_idx]
pt1 = coords[tri]
pt2 = coords[tri]
pt3 = coords[tri]
val = coefs[t_idx]*norm(pt1+pt2+pt3)
force[tri] += (pt3-pt1)*val
force[tri] += (pt1-pt2)*val
force[tri] += (pt2-pt3)*val
end
end
end

However the performance is not on par with the sequential version. With 100 threads, this is about 6 times slower:

sequential version: 334.567 μs (0 allocations: 0 bytes)
threaded version:  1.842 ms (2416 allocations: 233.58 KiB)

I feel like allocations are the issue, as each thread needs to have access to part of the position array and update part of the force array. I’ve attempted to put the used values in buffers specific to each thread but it only made things worse.

Two questions:

• can the sequential version be optimized further?
• is there a chance that the parallel version could outperform the sequential version?

The complete minimal working example is below (note that I’m only interested in optimizing the distribute functions.

Thanks!

using StaticArrays
using LinearAlgebra
using DataStructures
using LightGraphs

using BenchmarkTools

const Vec3D = SVector{3,Float64}
const TriIdxs = SVector{3,Int64}

cell_containing_node = OrderedDict{Int, Set{Int}}()
let idx = 1
for nodes in elements
for v in nodes
cell_containing_node[v] = Set{Int}()
end
push!(cell_containing_node[v], idx)
end
idx+=1
end
end

G = SimpleGraph(length(elements))

for (node, cells) in cell_containing_node
for cell1 in cells # All these cells have a neighboring node
for cell2 in cells
if cell1 != cell2
end
end
end
end

v = nv(G)
colors = OrderedDict{Int, Vector{Int}}()
colors = 
elcolor = zeros(Int, v)
elcolor = 1
available = BitArray(undef, v)
colorcount = zeros(Int, v)
colorcount = 1
for i = 2:v
for j in inneighbors(G, i)
if elcolor[j] != 0
available[elcolor[j]] = false
end
end
for cr = 1:v
if (available[cr] == true) & (colorcount[cr] < n_threads)
colors[cr] = []
end
push!(colors[cr], i)
elcolor[i] = cr
colorcount[cr] += 1
break
end
end
fill!(available, true)
end
return [els for (cridx, els) in colors]

end

function distribute_seq!(force, coords, tris, coefs)
for t_idx in 1:length(tris) # Loop for all triangles
tri = tris[t_idx] # Contains the 3 nodal indices of the triangle
# Current position of these 3 points
pt1 = coords[tri]
pt2 = coords[tri]
pt3 = coords[tri]
# Some operations involving the coordinates
val = coefs[t_idx]*norm(pt1+pt2+pt3)
# Nodal forces are distributed for this triangle
force[tri] += (pt3-pt1)*val
force[tri] += (pt1-pt2)*val
force[tri] += (pt2-pt3)*val
end
end

function distribute_par!(force, coords, tris, coefs, colors)
for col_tri in colors
# Each color is safe to assemble threaded
tri = tris[t_idx]
pt1 = coords[tri]
pt2 = coords[tri]
pt3 = coords[tri]
val = coefs[t_idx]*norm(pt1+pt2+pt3)
force[tri] += (pt3-pt1)*val
force[tri] += (pt1-pt2)*val
force[tri] += (pt2-pt3)*val
end
end
end

Nn = 10000 # Number of nodes
Nels = 15000 # Number of triangles

coords = [Vec3D(rand(3)) for _ in 1:Nn]
tris = [TriIdxs(rand(1:Nn,3)) for _ in 1:Nels] # Connectivity (node indices for each triangle)

force = [Vec3D(zeros(3)) for _ in 1:Nn]
coefs = [rand() for _ in 1:Nels]

colors = greedy_coloring(tris, n_threads) # Each color contains triangle indices that do not share nodes

@btime distribute_seq!(\$force, \$coords, \$tris, \$coefs)
@btime distribute_par!(\$force, \$coords, \$tris, \$coefs, \$colors)

Do you have 100 cores on your machine?

1 Like

No only 2 on my laptop, however with 2 threads performance is even worse. Strangely enough, I get this behaviour:

2

julia> for th in [1,2,4,10,100,1000]
colors = greedy_coloring(tris, th)
@btime distribute_par!(\$force, \$coords, \$tris, \$coefs, \$colors)
end
305.755 ms (240136 allocations: 22.66 MiB)
110.130 ms (120012 allocations: 11.33 MiB)
59.405 ms (60000 allocations: 5.66 MiB)
24.423 ms (24000 allocations: 2.27 MiB)
2.608 ms (2416 allocations: 233.58 KiB)
797.967 μs (320 allocations: 30.94 KiB)

it seems that you’re always running with 2 threads and only the coloring is effected by th? in that case just increase th to even bigger number may just help judging from your benchmark.

Yes sorry it wasn’t clear, the variable n_threads or th in my code is not the number of threads but rather the maximum allowable number of triangles in each color.

I can set a very large value for this number, however there is a limit after which the number of colors can’t increase due to the connectivity of the mesh. In this example, I can’t have less than ~15 colors, which means that the outer loop will run 15 times, with the threaded inner loop running on ~1k triangles. Still, performance is ~2x worse than the sequential version.

What I don’t understand is why this threaded pseudo-code:

for all 15 colors
for ~1000 triangles in this color, running in parallel
compute and distribute the force on the triangle nodes
end
end

is slower than:

for 15000 triangles
compute and distribute the force on the triangle nodes
end

But you do want as few colors as possible. Anyway, the takeaway message here is that you are doing too little work per “color” for the overhead in threading to be worth it. You can see this since when reducing the number of colors the performance greatly improved.

Just to give each thread a bit more work I inserted

val = coefs[t_idx]*norm(pt1+pt2+pt3)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)
val += sin(val)

into the force calculation and now I get (with 15 colors):

3.051 ms (0 allocations: 0 bytes)
635.864 μs (870 allocations: 90.23 KiB) # 8 threads

A force assembly like this is so extremely simple that it is unlikely that it will bottleneck anything. In real applications, you typically have some quadrature loop, some local shape functions that are getting evaluated etc which will push up the amount of work in the force function to make threading worth it.

3 Likes

Also, this example of threaded assembly using greedy coloring might be interesting:

Thanks, this is extremely useful!

You’re right that in my case the threaded version is only useful when there is more work to do, but as I use an explicit integrator on linear triangle elements, there is actually not so much to do in the force assembly, even in my more complete version.

Following your example, do you think having scratch values in my case would be beneficial to the threaded version or that it will not change much due to the simplicity of each assembly?

Also, regarding the memory location of my arrays for the threaded version, is recommended to try to reorder the elements of each color so that they are contiguous in memory?

I wonder if having an assembly buffer separate for each thread would reduce memory contention?

So I’ve run a few more tests on this with the following changes:

• modified the force distribution functions so that they perform the same operations as in my explicit FE code
• always use the minimum number of colors (or the maximum number of elements in each color)
• added another threaded distribution function with buffers for each thread (did I do this right?)
• used a machine with 12 cores

I get the following:

for n in 1 2 4 8 12
do
done

Timing sequential distribute:   550.281 μs (0 allocations: 0 bytes)
Timing threaded distribute:   1.090 ms (126 allocations: 12.03 KiB)
Timing threaded distribute with buffers:   1.227 ms (126 allocations: 11.16 KiB)

Timing sequential distribute:   570.101 μs (0 allocations: 0 bytes)
Timing threaded distribute:   1.259 ms (240 allocations: 23.91 KiB)
Timing threaded distribute with buffers:   714.997 μs (242 allocations: 23.00 KiB)

Timing sequential distribute:   561.593 μs (0 allocations: 0 bytes)
Timing threaded distribute:   792.371 μs (450 allocations: 46.17 KiB)
Timing threaded distribute with buffers:   510.388 μs (450 allocations: 45.23 KiB)

Timing sequential distribute:   490.096 μs (0 allocations: 0 bytes)
Timing threaded distribute:   632.889 μs (812 allocations: 84.66 KiB)
Timing threaded distribute with buffers:   502.181 μs (812 allocations: 83.78 KiB)

Timing sequential distribute:   496.732 μs (0 allocations: 0 bytes)
Timing threaded distribute:   690.010 μs (1204 allocations: 126.22 KiB)
Timing threaded distribute with buffers:   603.200 μs (1204 allocations: 125.34 KiB)

Having buffers for the threads helps a bit but the performance is still barely on par with the sequential version. Is it really hopeless to parallelize this kind of function when the work is not that high?

Code for reference:

using StaticArrays
using LinearAlgebra
using DataStructures
using LightGraphs

using BenchmarkTools

const Vec3D = SVector{3,Float64}
const Mat33 = SArray{Tuple{3,3},Float64,2,9}
const Idxs = SVector{3,Int64}

function greedy_coloring(elements)

cell_containing_node = OrderedDict{Int, Set{Int}}()
let idx = 1
for nodes in elements
for v in nodes
cell_containing_node[v] = Set{Int}()
end
push!(cell_containing_node[v], idx)
end
idx+=1
end
end

G = SimpleGraph(length(elements))

for (node, cells) in cell_containing_node
for cell1 in cells # All these cells have a neighboring node
for cell2 in cells
if cell1 != cell2
end
end
end
end

v = nv(G)
colors = OrderedDict{Int, Vector{Int}}()
colors = 
elcolor = zeros(Int, v)
elcolor = 1
available = BitArray(undef, v)
colorcount = zeros(Int, v)
colorcount = 1
for i = 2:v
for j in inneighbors(G, i)
if elcolor[j] != 0
available[elcolor[j]] = false
end
end
for cr = 1:v
if available[cr] == true
colors[cr] = []
end
push!(colors[cr], i)
elcolor[i] = cr
colorcount[cr] += 1
break
end
end
fill!(available, true)
end
return [els for (cridx, els) in colors]

end

function distribute_seq!(forces, positions, velocities, pw1, pw2, visco, fac, area, stiffness, node_idxs)

for t_idx in 1:length(node_idxs)

ni = node_idxs[t_idx]

pos = positions[ni]
vel = velocities[ni]

p1 = pw1[t_idx]
p2 = pw2[t_idx]

U = p1*pos + p1*pos + p1*pos
V = p2*pos + p2*pos + p2*pos

Ur = p1*vel + p1*vel + p1*vel
Vr = p2*vel + p2*vel + p2*vel

sr11 = dot(U,Ur)
sr22 = dot(V,Vr)

bvpress = visco*fac[t_idx]*(sr11+sr22)

eps = SVector{3,Float64}(0.5*(dot(U,U)-1.0), 0.5*(dot(V,V)-1.0), dot(U,V))
sig = stiffness[t_idx]*eps

for i in 1:3
forces[ni[i]] -= area[t_idx]*(sig+bvpress)*p1[i]*U + (sig+bvpress)*p2[i]*V + sig*(p1[i]*V + p2[i]*U)
end

end
end

function distribute_par!(forces, positions, velocities, pw1, pw2, visco, fac, area, stiffness, node_idxs, colors)

for col_tri in colors
# Each color is safe to assemble threaded

ni = node_idxs[t_idx]

pos = positions[ni]
vel = velocities[ni]

p1 = pw1[t_idx]
p2 = pw2[t_idx]

U = p1*pos + p1*pos + p1*pos
V = p2*pos + p2*pos + p2*pos

Ur = p1*vel + p1*vel + p1*vel
Vr = p2*vel + p2*vel + p2*vel

sr11 = dot(U,Ur)
sr22 = dot(V,Vr)

bvpress = visco*fac[t_idx]*(sr11+sr22)

eps = SVector{3,Float64}(0.5*(dot(U,U)-1.0), 0.5*(dot(V,V)-1.0), dot(U,V))
sig = stiffness[t_idx]*eps

for i in 1:3
forces[ni[i]] -= area[t_idx]*(sig+bvpress)*p1[i]*U + (sig+bvpress)*p2[i]*V + sig*(p1[i]*V + p2[i]*U)
end

end
end
end

struct ScratchValues
forces::Vector{Vec3D}
positions::Vector{Vec3D}
velocities::Vector{Vec3D}
pw1::Vector{Vec3D}
pw2::Vector{Vec3D}
visco::Float64
fac::Vector{Float64}
stiffness::Vector{Mat33}
area::Vector{Float64}
end

function create_scratchvalues(positions, velocities, pw1, pw2, visco, fac, stiffness, area)
nn = length(positions)
forces = [[zeros(Vec3D) for i in 1:nn] for i in 1:nthreads]
positions = [positions for i in 1:nthreads]
velocities = [velocities for i in 1:nthreads]
pw1 = [pw1 for i in 1:nthreads]
pw2 = [pw2 for i in 1:nthreads]
visco = [visco for i in 1:nthreads]
fac = [fac for i in 1:nthreads]
stiffness = [stiffness for i in 1:nthreads]
area = [area for i in 1:nthreads]
return [ScratchValues(forces[i], positions[i], velocities[i], pw1[i], pw2[i], visco[i], fac[i], stiffness[i], area[i]) for i in 1:nthreads]
end

function distribute_tri!(scratch, ni, t_idx)

f, pos, vel, pw1, pw2, visco, fac, stiffness, area = scratch.forces, scratch.positions, scratch.velocities, scratch.pw1, scratch.pw2, scratch.visco, scratch.fac, scratch.stiffness, scratch.area

U = pw1[t_idx]*pos[ni] + pw1[t_idx]*pos[ni] + pw1[t_idx]*pos[ni]
V = pw2[t_idx]*pos[ni] + pw2[t_idx]*pos[ni] + pw2[t_idx]*pos[ni]

Ur = pw1[t_idx]*vel[ni] + pw1[t_idx]*vel[ni] + pw1[t_idx]*vel[ni]
Vr = pw2[t_idx]*vel[ni] + pw2[t_idx]*vel[ni] + pw2[t_idx]*vel[ni]

sr11 = dot(U,Ur)
sr22 = dot(V,Vr)

bvpress = visco*fac[t_idx]*(sr11+sr22)

eps = SVector{3,Float64}(0.5*(dot(U,U)-1.0), 0.5*(dot(V,V)-1.0), dot(U,V))
sig = stiffness[t_idx]*eps

for i in 1:3
f[ni[i]] -= area[t_idx]*(sig+bvpress)*pw1[t_idx][i]*U + (sig+bvpress)*pw2[t_idx][i]*V + sig*(pw1[t_idx][i]*V + pw2[t_idx][i]*U)
end

end

function distribute_par_buf!(colors, scratches, node_idxs)
for col_tri in colors
# Each color is safe to assemble threaded
end
end
end

Nn = 10000 # Number of nodes
Nels = 15000 # Number of triangles

# Node fields
positions = [Vec3D(rand(3)) for _ in 1:Nn]
velocities = [Vec3D(rand(3)) for _ in 1:Nn]
forces = [Vec3D(zeros(3)) for _ in 1:Nn]

# Connectivity (node indices for each triangle)
node_idxs = [Idxs(rand(1:Nn,3)) for _ in 1:Nels]

# Triangle fields
stiffness = [Mat33(rand(3,3)) for _ in 1:Nels]
pw1 = [Vec3D(rand(3)) for _ in 1:Nels]
pw2 = [Vec3D(rand(3)) for _ in 1:Nels]
visco = 0.1
area = [rand() for _ in 1:Nels]
fac = [rand() for _ in 1:Nels]

# Each color contains triangle indices that do not share nodes
colors = greedy_coloring(node_idxs)

# Buffer values for each thread
scratches = create_scratchvalues(positions, velocities, pw1, pw2, visco, fac, stiffness, area)

@btime distribute_seq!(\$forces, \$positions, \$velocities, \$pw1, \$pw2, \$visco, \$fac, \$area, \$stiffness, \$node_idxs)
@btime distribute_par!(\$forces, \$positions, \$velocities, \$pw1, \$pw2, \$visco, \$fac, \$area, \$stiffness, \$node_idxs, \$colors)
@btime distribute_par_buf!(\$colors, \$scratches, \$node_idxs)

When you have per-thread assembly buffer, you don’t care whether the elements write to the “same” locations. The coloring is not needed. The price to pay is the need to add those buffers together, but that again can be parallelized with threads.

Hi, thanks for the feedback!
You’re right, I managed to get a nice speedup by using buffers without the colors, even with the cost of adding the buffers together. I noticed that there is no benefit of putting arrays that are only read by the function in the buffers ( pw1, pw2, visco, fac, area, stiffness, …), however there is a huge benefit of putting the force array in it because threads actively write to this array.

So by having force buffers as:

scratches = [[zeros(Vec3D) for i in 1:Nn] for i in 1:Threads.nthreads()]

And the buffered assembly as:

function distribute_par_buf!(scratches, forces, positions, velocities, pw1, pw2, visco, fac, area, stiffness, node_idxs)

# Distribute forces
ni = node_idxs[t_idx]
distribute_tri!(scratches[Threads.threadid()], ni, positions[ni], velocities[ni], pw1[t_idx], pw2[t_idx], visco, fac[t_idx], area[t_idx], stiffness[t_idx])
end

# Sum forces from all scratches
forces[i] += scratches[j][i]
end
end

end

I now have these timings:

Timing sequential distribute:  545.207 μs (0 allocations: 0 bytes)
Timing threaded distribute:    1.094 ms (126 allocations: 12.03 KiB)
Timing threaded distribute with buffers:    584.261 μs (18 allocations: 1.69 KiB)

Timing sequential distribute:  497.919 μs (0 allocations: 0 bytes)
Timing threaded distribute:   1.237 ms (240 allocations: 23.91 KiB)
Timing threaded distribute with buffers:  434.119 μs (32 allocations: 3.16 KiB)

Timing sequential distribute:    504.540 μs (0 allocations: 0 bytes)
Timing threaded distribute:   765.982 μs (420 allocations: 43.09 KiB)
Timing threaded distribute with buffers:   388.860 μs (60 allocations: 6.13 KiB)

Timing sequential distribute:   496.583 μs (0 allocations: 0 bytes)
Timing threaded distribute:   667.276 μs (870 allocations: 90.70 KiB)
Timing threaded distribute with buffers:  307.065 μs (118 allocations: 12.09 KiB)