Hello, I am writing a physical simulation code that I’m trying to optimize.

The most critical function, called at every integration step, is the force assembly: for every triangle of the mesh, given the nodal positions, do a bit of linear algebra and compute the forces at each node of this triangle. The total nodal forces are then the contributions of all the forces of all triangles. A minimal working example of this function is as follows:

```
function distribute_seq!(force, coords, tris, coefs)
for t_idx in 1:length(tris) # Loop for all triangles
tri = tris[t_idx] # Contains the 3 nodal indices for the triangle
# Current position of these 3 points
pt1 = coords[tri[1]]
pt2 = coords[tri[2]]
pt3 = coords[tri[3]]
# Some dummy operations involving the coordinates
val = coefs[t_idx]*norm(pt1+pt2+pt3)
# Nodal forces are distributed for this triangle
force[tri[1]] += (pt3-pt1)*val
force[tri[2]] += (pt1-pt2)*val
force[tri[3]] += (pt2-pt3)*val
end
end
```

This functions runs very fast (~300 microsec on my machine for 10k nodes, 15k triangles), but I’m trying to parallelize it by using graph coloring techniques to avoid race conditions (see https://paulino.ce.gatech.edu/education_resources/GreedyColoring/GreedyColoring.pdf).

So I’ve used a greedy coloring algorithm to group the triangles that do not share any nodes, and I’ve written this function:

```
function distribute_par!(force, coords, tris, coefs, colors)
for col_tri in colors
# Each color is safe to assemble threaded
@Threads.threads for t_idx in col_tri
tri = tris[t_idx]
pt1 = coords[tri[1]]
pt2 = coords[tri[2]]
pt3 = coords[tri[3]]
val = coefs[t_idx]*norm(pt1+pt2+pt3)
force[tri[1]] += (pt3-pt1)*val
force[tri[2]] += (pt1-pt2)*val
force[tri[3]] += (pt2-pt3)*val
end
end
end
```

However the performance is not on par with the sequential version. With 100 threads, this is about 6 times slower:

```
sequential version: 334.567 μs (0 allocations: 0 bytes)
threaded version: 1.842 ms (2416 allocations: 233.58 KiB)
```

I feel like allocations are the issue, as each thread needs to have access to part of the position array and update part of the force array. I’ve attempted to put the used values in buffers specific to each thread but it only made things worse.

Two questions:

- can the sequential version be optimized further?
- is there a chance that the parallel version could outperform the sequential version?

The complete minimal working example is below (note that I’m only interested in optimizing the distribute functions.

Thanks!

```
using StaticArrays
using LinearAlgebra
using DataStructures
using LightGraphs
using BenchmarkTools
const Vec3D = SVector{3,Float64}
const TriIdxs = SVector{3,Int64}
function greedy_coloring(elements, n_threads::Int64)
cell_containing_node = OrderedDict{Int, Set{Int}}()
let idx = 1
for nodes in elements
for v in nodes
if !haskey(cell_containing_node, v)
cell_containing_node[v] = Set{Int}()
end
push!(cell_containing_node[v], idx)
end
idx+=1
end
end
G = SimpleGraph(length(elements))
for (node, cells) in cell_containing_node
for cell1 in cells # All these cells have a neighboring node
for cell2 in cells
if cell1 != cell2
add_edge!(G, cell1, cell2)
end
end
end
end
v = nv(G)
colors = OrderedDict{Int, Vector{Int}}()
colors[1] = [1]
elcolor = zeros(Int, v)
elcolor[1] = 1
available = BitArray(undef, v)
colorcount = zeros(Int, v)
colorcount[1] = 1
for i = 2:v
for j in inneighbors(G, i)
if elcolor[j] != 0
available[elcolor[j]] = false
end
end
for cr = 1:v
if (available[cr] == true) & (colorcount[cr] < n_threads)
if !haskey(colors, cr)
colors[cr] = []
end
push!(colors[cr], i)
elcolor[i] = cr
colorcount[cr] += 1
break
end
end
fill!(available, true)
end
return [els for (cridx, els) in colors]
end
function distribute_seq!(force, coords, tris, coefs)
for t_idx in 1:length(tris) # Loop for all triangles
tri = tris[t_idx] # Contains the 3 nodal indices of the triangle
# Current position of these 3 points
pt1 = coords[tri[1]]
pt2 = coords[tri[2]]
pt3 = coords[tri[3]]
# Some operations involving the coordinates
val = coefs[t_idx]*norm(pt1+pt2+pt3)
# Nodal forces are distributed for this triangle
force[tri[1]] += (pt3-pt1)*val
force[tri[2]] += (pt1-pt2)*val
force[tri[3]] += (pt2-pt3)*val
end
end
function distribute_par!(force, coords, tris, coefs, colors)
for col_tri in colors
# Each color is safe to assemble threaded
@Threads.threads for t_idx in col_tri
tri = tris[t_idx]
pt1 = coords[tri[1]]
pt2 = coords[tri[2]]
pt3 = coords[tri[3]]
val = coefs[t_idx]*norm(pt1+pt2+pt3)
force[tri[1]] += (pt3-pt1)*val
force[tri[2]] += (pt1-pt2)*val
force[tri[3]] += (pt2-pt3)*val
end
end
end
Nn = 10000 # Number of nodes
Nels = 15000 # Number of triangles
n_threads = 100
coords = [Vec3D(rand(3)) for _ in 1:Nn]
tris = [TriIdxs(rand(1:Nn,3)) for _ in 1:Nels] # Connectivity (node indices for each triangle)
force = [Vec3D(zeros(3)) for _ in 1:Nn]
coefs = [rand() for _ in 1:Nels]
colors = greedy_coloring(tris, n_threads) # Each color contains triangle indices that do not share nodes
@btime distribute_seq!($force, $coords, $tris, $coefs)
@btime distribute_par!($force, $coords, $tris, $coefs, $colors)
```