Parallelizing a force assembly function

Hello, I am writing a physical simulation code that I’m trying to optimize.

The most critical function, called at every integration step, is the force assembly: for every triangle of the mesh, given the nodal positions, do a bit of linear algebra and compute the forces at each node of this triangle. The total nodal forces are then the contributions of all the forces of all triangles. A minimal working example of this function is as follows:

function distribute_seq!(force, coords, tris, coefs)
    for t_idx in 1:length(tris) # Loop for all triangles
		tri = tris[t_idx] # Contains the 3 nodal indices for the triangle
		# Current position of these 3 points
		pt1 = coords[tri[1]]
		pt2 = coords[tri[2]]
		pt3 = coords[tri[3]]
		# Some dummy operations involving the coordinates
		val = coefs[t_idx]*norm(pt1+pt2+pt3)
		# Nodal forces are distributed for this triangle
		force[tri[1]] += (pt3-pt1)*val
		force[tri[2]] += (pt1-pt2)*val
		force[tri[3]] += (pt2-pt3)*val
    end
end

This functions runs very fast (~300 microsec on my machine for 10k nodes, 15k triangles), but I’m trying to parallelize it by using graph coloring techniques to avoid race conditions (see https://paulino.ce.gatech.edu/education_resources/GreedyColoring/GreedyColoring.pdf).

So I’ve used a greedy coloring algorithm to group the triangles that do not share any nodes, and I’ve written this function:

function distribute_par!(force, coords, tris, coefs, colors)
	for col_tri in colors
            # Each color is safe to assemble threaded
	    @Threads.threads for t_idx in col_tri
			tri = tris[t_idx]
			pt1 = coords[tri[1]]
			pt2 = coords[tri[2]]
			pt3 = coords[tri[3]]
			val = coefs[t_idx]*norm(pt1+pt2+pt3)
			force[tri[1]] += (pt3-pt1)*val
			force[tri[2]] += (pt1-pt2)*val
			force[tri[3]] += (pt2-pt3)*val
	    end
	end
end

However the performance is not on par with the sequential version. With 100 threads, this is about 6 times slower:

sequential version: 334.567 μs (0 allocations: 0 bytes)
threaded version:  1.842 ms (2416 allocations: 233.58 KiB)

I feel like allocations are the issue, as each thread needs to have access to part of the position array and update part of the force array. I’ve attempted to put the used values in buffers specific to each thread but it only made things worse.

Two questions:

  • can the sequential version be optimized further?
  • is there a chance that the parallel version could outperform the sequential version?

The complete minimal working example is below (note that I’m only interested in optimizing the distribute functions.

Thanks!

using StaticArrays
using LinearAlgebra
using DataStructures
using LightGraphs

using BenchmarkTools


const Vec3D = SVector{3,Float64}
const TriIdxs = SVector{3,Int64}


function greedy_coloring(elements, n_threads::Int64)

	cell_containing_node = OrderedDict{Int, Set{Int}}()
	let idx = 1
    for nodes in elements
        for v in nodes
            if !haskey(cell_containing_node, v)
                cell_containing_node[v] = Set{Int}()
            end
            push!(cell_containing_node[v], idx)
        end
		idx+=1
    end
	end

	G = SimpleGraph(length(elements))

    for (node, cells) in cell_containing_node
        for cell1 in cells # All these cells have a neighboring node
            for cell2 in cells
                if cell1 != cell2
					add_edge!(G, cell1, cell2)
                end
            end
        end
    end

	v = nv(G)
	colors = OrderedDict{Int, Vector{Int}}()
	colors[1] = [1]
    elcolor = zeros(Int, v)
    elcolor[1] = 1
    available = BitArray(undef, v)
	colorcount = zeros(Int, v)
	colorcount[1] = 1
    for i = 2:v
        for j in inneighbors(G, i)
            if elcolor[j] != 0
                available[elcolor[j]] = false
            end
        end
        for cr = 1:v
            if (available[cr] == true) & (colorcount[cr] < n_threads)
				if !haskey(colors, cr)
	                colors[cr] = []
	            end
				push!(colors[cr], i)
                elcolor[i] = cr
				colorcount[cr] += 1
                break
            end
        end
        fill!(available, true)
    end
    return [els for (cridx, els) in colors]

end



function distribute_seq!(force, coords, tris, coefs)
    for t_idx in 1:length(tris) # Loop for all triangles
		tri = tris[t_idx] # Contains the 3 nodal indices of the triangle
		# Current position of these 3 points
		pt1 = coords[tri[1]]
		pt2 = coords[tri[2]]
		pt3 = coords[tri[3]]
		# Some operations involving the coordinates
		val = coefs[t_idx]*norm(pt1+pt2+pt3)
		# Nodal forces are distributed for this triangle
		force[tri[1]] += (pt3-pt1)*val
		force[tri[2]] += (pt1-pt2)*val
		force[tri[3]] += (pt2-pt3)*val
    end
end


function distribute_par!(force, coords, tris, coefs, colors)
	for col_tri in colors
		# Each color is safe to assemble threaded
	    @Threads.threads for t_idx in col_tri
			tri = tris[t_idx]
			pt1 = coords[tri[1]]
			pt2 = coords[tri[2]]
			pt3 = coords[tri[3]]
			val = coefs[t_idx]*norm(pt1+pt2+pt3)
			force[tri[1]] += (pt3-pt1)*val
			force[tri[2]] += (pt1-pt2)*val
			force[tri[3]] += (pt2-pt3)*val
	    end
	end
end


Nn = 10000 # Number of nodes
Nels = 15000 # Number of triangles

n_threads = 100

coords = [Vec3D(rand(3)) for _ in 1:Nn]
tris = [TriIdxs(rand(1:Nn,3)) for _ in 1:Nels] # Connectivity (node indices for each triangle)

force = [Vec3D(zeros(3)) for _ in 1:Nn]
coefs = [rand() for _ in 1:Nels]

colors = greedy_coloring(tris, n_threads) # Each color contains triangle indices that do not share nodes

@btime distribute_seq!($force, $coords, $tris, $coefs)
@btime distribute_par!($force, $coords, $tris, $coefs, $colors)

Do you have 100 cores on your machine?

1 Like

No only 2 on my laptop, however with 2 threads performance is even worse. Strangely enough, I get this behaviour:

julia> Threads.nthreads()
2

julia> for th in [1,2,4,10,100,1000]
       colors = greedy_coloring(tris, th)
       @btime distribute_par!($force, $coords, $tris, $coefs, $colors)
       end
  305.755 ms (240136 allocations: 22.66 MiB)
  110.130 ms (120012 allocations: 11.33 MiB)
  59.405 ms (60000 allocations: 5.66 MiB)
  24.423 ms (24000 allocations: 2.27 MiB)
  2.608 ms (2416 allocations: 233.58 KiB)
  797.967 μs (320 allocations: 30.94 KiB)

it seems that you’re always running with 2 threads and only the coloring is effected by th? in that case just increase th to even bigger number may just help judging from your benchmark.

Yes sorry it wasn’t clear, the variable n_threads or th in my code is not the number of threads but rather the maximum allowable number of triangles in each color.

I can set a very large value for this number, however there is a limit after which the number of colors can’t increase due to the connectivity of the mesh. In this example, I can’t have less than ~15 colors, which means that the outer loop will run 15 times, with the threaded inner loop running on ~1k triangles. Still, performance is ~2x worse than the sequential version.

What I don’t understand is why this threaded pseudo-code:

for all 15 colors
    for ~1000 triangles in this color, running in parallel
        compute and distribute the force on the triangle nodes
    end
end

is slower than:

for 15000 triangles
    compute and distribute the force on the triangle nodes
end

But you do want as few colors as possible. Anyway, the takeaway message here is that you are doing too little work per “color” for the overhead in threading to be worth it. You can see this since when reducing the number of colors the performance greatly improved.

Just to give each thread a bit more work I inserted

    val = coefs[t_idx]*norm(pt1+pt2+pt3)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)
    val += sin(val)

into the force calculation and now I get (with 15 colors):

  3.051 ms (0 allocations: 0 bytes)
  635.864 μs (870 allocations: 90.23 KiB) # 8 threads

A force assembly like this is so extremely simple that it is unlikely that it will bottleneck anything. In real applications, you typically have some quadrature loop, some local shape functions that are getting evaluated etc which will push up the amount of work in the force function to make threading worth it.

3 Likes

Also, this example of threaded assembly using greedy coloring might be interesting:

http://kristofferc.github.io/JuAFEM.jl/dev/examples/threaded_assembly/

Thanks, this is extremely useful!

You’re right that in my case the threaded version is only useful when there is more work to do, but as I use an explicit integrator on linear triangle elements, there is actually not so much to do in the force assembly, even in my more complete version.

Following your example, do you think having scratch values in my case would be beneficial to the threaded version or that it will not change much due to the simplicity of each assembly?

Also, regarding the memory location of my arrays for the threaded version, is recommended to try to reorder the elements of each color so that they are contiguous in memory?

I wonder if having an assembly buffer separate for each thread would reduce memory contention?

So I’ve run a few more tests on this with the following changes:

  • modified the force distribution functions so that they perform the same operations as in my explicit FE code
  • always use the minimum number of colors (or the maximum number of elements in each color)
  • added another threaded distribution function with buffers for each thread (did I do this right?)
  • used a machine with 12 cores

I get the following:

for n in 1 2 4 8 12
do    
JULIA_NUM_THREADS=$n julia mwe.jl;
done

1 Thread:
Timing sequential distribute:   550.281 μs (0 allocations: 0 bytes)
Timing threaded distribute:   1.090 ms (126 allocations: 12.03 KiB)
Timing threaded distribute with buffers:   1.227 ms (126 allocations: 11.16 KiB)

2 Threads:
Timing sequential distribute:   570.101 μs (0 allocations: 0 bytes)
Timing threaded distribute:   1.259 ms (240 allocations: 23.91 KiB)
Timing threaded distribute with buffers:   714.997 μs (242 allocations: 23.00 KiB)

4 Threads:
Timing sequential distribute:   561.593 μs (0 allocations: 0 bytes)
Timing threaded distribute:   792.371 μs (450 allocations: 46.17 KiB)
Timing threaded distribute with buffers:   510.388 μs (450 allocations: 45.23 KiB)

8 Threads:
Timing sequential distribute:   490.096 μs (0 allocations: 0 bytes)
Timing threaded distribute:   632.889 μs (812 allocations: 84.66 KiB)
Timing threaded distribute with buffers:   502.181 μs (812 allocations: 83.78 KiB)

12 Threads:
Timing sequential distribute:   496.732 μs (0 allocations: 0 bytes)
Timing threaded distribute:   690.010 μs (1204 allocations: 126.22 KiB)
Timing threaded distribute with buffers:   603.200 μs (1204 allocations: 125.34 KiB)

Having buffers for the threads helps a bit but the performance is still barely on par with the sequential version. Is it really hopeless to parallelize this kind of function when the work is not that high?

Code for reference:

using StaticArrays
using LinearAlgebra
using DataStructures
using LightGraphs

using BenchmarkTools


const Vec3D = SVector{3,Float64}
const Mat33 = SArray{Tuple{3,3},Float64,2,9}
const Idxs = SVector{3,Int64}

function greedy_coloring(elements)

	cell_containing_node = OrderedDict{Int, Set{Int}}()
	let idx = 1
    for nodes in elements
        for v in nodes
            if !haskey(cell_containing_node, v)
                cell_containing_node[v] = Set{Int}()
            end
            push!(cell_containing_node[v], idx)
        end
		idx+=1
    end
	end

	G = SimpleGraph(length(elements))

    for (node, cells) in cell_containing_node
        for cell1 in cells # All these cells have a neighboring node
            for cell2 in cells
                if cell1 != cell2
					add_edge!(G, cell1, cell2)
                end
            end
        end
    end

	v = nv(G)
	colors = OrderedDict{Int, Vector{Int}}()
	colors[1] = [1]
    elcolor = zeros(Int, v)
    elcolor[1] = 1
    available = BitArray(undef, v)
	colorcount = zeros(Int, v)
	colorcount[1] = 1
    for i = 2:v
        for j in inneighbors(G, i)
            if elcolor[j] != 0
                available[elcolor[j]] = false
            end
        end
        for cr = 1:v
            if available[cr] == true
				if !haskey(colors, cr)
	                colors[cr] = []
	            end
				push!(colors[cr], i)
                elcolor[i] = cr
				colorcount[cr] += 1
                break
            end
        end
        fill!(available, true)
    end
    return [els for (cridx, els) in colors]

end



function distribute_seq!(forces, positions, velocities, pw1, pw2, visco, fac, area, stiffness, node_idxs)

	for t_idx in 1:length(node_idxs)

		ni = node_idxs[t_idx]

		pos = positions[ni]
		vel = velocities[ni]

		p1 = pw1[t_idx]
		p2 = pw2[t_idx]

		U = p1[1]*pos[1] + p1[2]*pos[2] + p1[3]*pos[3]
		V = p2[1]*pos[1] + p2[2]*pos[2] + p2[3]*pos[3]

		Ur = p1[1]*vel[1] + p1[2]*vel[2] + p1[3]*vel[3]
		Vr = p2[1]*vel[1] + p2[2]*vel[2] + p2[3]*vel[3]

		sr11 = dot(U,Ur)
		sr22 = dot(V,Vr)

		bvpress = visco*fac[t_idx]*(sr11+sr22)

		eps = SVector{3,Float64}(0.5*(dot(U,U)-1.0), 0.5*(dot(V,V)-1.0), dot(U,V))
		sig = stiffness[t_idx]*eps

		for i in 1:3
			forces[ni[i]] -= area[t_idx]*(sig[1]+bvpress)*p1[i]*U + (sig[2]+bvpress)*p2[i]*V + sig[3]*(p1[i]*V + p2[i]*U)
		end

    end
end


function distribute_par!(forces, positions, velocities, pw1, pw2, visco, fac, area, stiffness, node_idxs, colors)

	for col_tri in colors
		# Each color is safe to assemble threaded
	    @Threads.threads for t_idx in col_tri

			ni = node_idxs[t_idx]

			pos = positions[ni]
			vel = velocities[ni]

			p1 = pw1[t_idx]
			p2 = pw2[t_idx]

			U = p1[1]*pos[1] + p1[2]*pos[2] + p1[3]*pos[3]
			V = p2[1]*pos[1] + p2[2]*pos[2] + p2[3]*pos[3]

			Ur = p1[1]*vel[1] + p1[2]*vel[2] + p1[3]*vel[3]
			Vr = p2[1]*vel[1] + p2[2]*vel[2] + p2[3]*vel[3]

			sr11 = dot(U,Ur)
			sr22 = dot(V,Vr)

			bvpress = visco*fac[t_idx]*(sr11+sr22)

			eps = SVector{3,Float64}(0.5*(dot(U,U)-1.0), 0.5*(dot(V,V)-1.0), dot(U,V))
	        sig = stiffness[t_idx]*eps

			for i in 1:3
				forces[ni[i]] -= area[t_idx]*(sig[1]+bvpress)*p1[i]*U + (sig[2]+bvpress)*p2[i]*V + sig[3]*(p1[i]*V + p2[i]*U)
			end


	    end
	end
end


struct ScratchValues
    forces::Vector{Vec3D}
	positions::Vector{Vec3D}
	velocities::Vector{Vec3D}
    pw1::Vector{Vec3D}
	pw2::Vector{Vec3D}
	visco::Float64
	fac::Vector{Float64}
	stiffness::Vector{Mat33}
	area::Vector{Float64}
end


function create_scratchvalues(positions, velocities, pw1, pw2, visco, fac, stiffness, area)
	nthreads = Threads.nthreads()
	nn = length(positions)
	forces = [[zeros(Vec3D) for i in 1:nn] for i in 1:nthreads]
	positions = [positions for i in 1:nthreads]
	velocities = [velocities for i in 1:nthreads]
	pw1 = [pw1 for i in 1:nthreads]
	pw2 = [pw2 for i in 1:nthreads]
	visco = [visco for i in 1:nthreads]
	fac = [fac for i in 1:nthreads]
	stiffness = [stiffness for i in 1:nthreads]
	area = [area for i in 1:nthreads]
	return [ScratchValues(forces[i], positions[i], velocities[i], pw1[i], pw2[i], visco[i], fac[i], stiffness[i], area[i]) for i in 1:nthreads]
end


function distribute_tri!(scratch, ni, t_idx)

	f, pos, vel, pw1, pw2, visco, fac, stiffness, area = scratch.forces, scratch.positions, scratch.velocities, scratch.pw1, scratch.pw2, scratch.visco, scratch.fac, scratch.stiffness, scratch.area

	U = pw1[t_idx][1]*pos[ni[1]] + pw1[t_idx][2]*pos[ni[2]] + pw1[t_idx][3]*pos[ni[3]]
	V = pw2[t_idx][1]*pos[ni[1]] + pw2[t_idx][2]*pos[ni[2]] + pw2[t_idx][3]*pos[ni[3]]

	Ur = pw1[t_idx][1]*vel[ni[1]] + pw1[t_idx][2]*vel[ni[2]] + pw1[t_idx][3]*vel[ni[3]]
	Vr = pw2[t_idx][1]*vel[ni[1]] + pw2[t_idx][2]*vel[ni[2]] + pw2[t_idx][3]*vel[ni[3]]

	sr11 = dot(U,Ur)
	sr22 = dot(V,Vr)

	bvpress = visco*fac[t_idx]*(sr11+sr22)

	eps = SVector{3,Float64}(0.5*(dot(U,U)-1.0), 0.5*(dot(V,V)-1.0), dot(U,V))
	sig = stiffness[t_idx]*eps

	for i in 1:3
		f[ni[i]] -= area[t_idx]*(sig[1]+bvpress)*pw1[t_idx][i]*U + (sig[2]+bvpress)*pw2[t_idx][i]*V + sig[3]*(pw1[t_idx][i]*V + pw2[t_idx][i]*U)
	end

end

function distribute_par_buf!(colors, scratches, node_idxs)
	for col_tri in colors
		# Each color is safe to assemble threaded
	    @Threads.threads for t_idx in col_tri
			distribute_tri!(scratches[Threads.threadid()], node_idxs[t_idx], t_idx)
	    end
	end
end



Nn = 10000 # Number of nodes
Nels = 15000 # Number of triangles

# Node fields
positions = [Vec3D(rand(3)) for _ in 1:Nn]
velocities = [Vec3D(rand(3)) for _ in 1:Nn]
forces = [Vec3D(zeros(3)) for _ in 1:Nn]


# Connectivity (node indices for each triangle)
node_idxs = [Idxs(rand(1:Nn,3)) for _ in 1:Nels]

# Triangle fields
stiffness = [Mat33(rand(3,3)) for _ in 1:Nels]
pw1 = [Vec3D(rand(3)) for _ in 1:Nels]
pw2 = [Vec3D(rand(3)) for _ in 1:Nels]
visco = 0.1
area = [rand() for _ in 1:Nels]
fac = [rand() for _ in 1:Nels]


# Each color contains triangle indices that do not share nodes
colors = greedy_coloring(node_idxs)

# Buffer values for each thread
scratches = create_scratchvalues(positions, velocities, pw1, pw2, visco, fac, stiffness, area)

@btime distribute_seq!($forces, $positions, $velocities, $pw1, $pw2, $visco, $fac, $area, $stiffness, $node_idxs)
@btime distribute_par!($forces, $positions, $velocities, $pw1, $pw2, $visco, $fac, $area, $stiffness, $node_idxs, $colors)
@btime distribute_par_buf!($colors, $scratches, $node_idxs)

When you have per-thread assembly buffer, you don’t care whether the elements write to the “same” locations. The coloring is not needed. The price to pay is the need to add those buffers together, but that again can be parallelized with threads.

Hi, thanks for the feedback!
You’re right, I managed to get a nice speedup by using buffers without the colors, even with the cost of adding the buffers together. I noticed that there is no benefit of putting arrays that are only read by the function in the buffers ( pw1, pw2, visco, fac, area, stiffness, …), however there is a huge benefit of putting the force array in it because threads actively write to this array.

So by having force buffers as:

scratches = [[zeros(Vec3D) for i in 1:Nn] for i in 1:Threads.nthreads()]

And the buffered assembly as:

function distribute_par_buf!(scratches, forces, positions, velocities, pw1, pw2, visco, fac, area, stiffness, node_idxs)

	# Distribute forces
	@inbounds @Threads.threads for t_idx in 1:length(pw1)
		ni = node_idxs[t_idx]
		distribute_tri!(scratches[Threads.threadid()], ni, positions[ni], velocities[ni], pw1[t_idx], pw2[t_idx], visco, fac[t_idx], area[t_idx], stiffness[t_idx])
	end

	# Sum forces from all scratches
	@inbounds @Threads.threads for i in 1:length(forces)
		@inbounds for j in 1:Threads.nthreads()
			forces[i] += scratches[j][i]
		end
	end

end

I now have these timings:

1 Thread:
Timing sequential distribute:  545.207 μs (0 allocations: 0 bytes)
Timing threaded distribute:    1.094 ms (126 allocations: 12.03 KiB)
Timing threaded distribute with buffers:    584.261 μs (18 allocations: 1.69 KiB)

2 Threads:
Timing sequential distribute:  497.919 μs (0 allocations: 0 bytes)
Timing threaded distribute:   1.237 ms (240 allocations: 23.91 KiB)
Timing threaded distribute with buffers:  434.119 μs (32 allocations: 3.16 KiB)

4 Threads:
Timing sequential distribute:    504.540 μs (0 allocations: 0 bytes)
Timing threaded distribute:   765.982 μs (420 allocations: 43.09 KiB)
Timing threaded distribute with buffers:   388.860 μs (60 allocations: 6.13 KiB)

8 Threads:
Timing sequential distribute:   496.583 μs (0 allocations: 0 bytes)
Timing threaded distribute:   667.276 μs (870 allocations: 90.70 KiB)
Timing threaded distribute with buffers:  307.065 μs (118 allocations: 12.09 KiB)

12 Threads:
Timing sequential distribute:  492.321 μs (0 allocations: 0 bytes)
Timing threaded distribute:  673.497 μs (1118 allocations: 117.20 KiB)
Timing threaded distribute with buffers:  294.560 μs (172 allocations: 18.00 KiB)

Coloring techniques might be more suited to a much higher number of threads, e.g. on a GPU.

One last question regarding threading: let’s say I have two functions that internally use @threads loops, funwithtreads1 and funwithtreads2. As these two functions do not share any inputs or outputs, I would like to execute them asynchronously, for ex with the @spawn macro. Are spawn and threads compatible in Julia v1.3? Is there a way to limit the number of threads used by the @threads macro?

Thanks!