Parallel assembly of a finite element sparse matrix

There was a thread recently (Behavior of threads) concerning a problem of interest to me: how to teach a serial finite element code to perform the assembly of a global sparse matrix in parallel (on shared memory computers).

There were several good ideas (many thanks!), which in the end served as a basis of the implementation demonstrated here on the problem of heat conduction.

Some salient observations:

  1. The serial code required no modifications.
  2. The implementation is based on tasks. There was no need to know which task was running the code.
  3. A handful of serial pieces of computation remain, which of course limits the available parallel speed up. In particular, the construction of the sparse matrix from a triplet of vectors does not run in parallel.
  4. There is also a puzzle there: Sometimes a few of the tasks take quite a bit longer than the others. For instance, running with sixteen tasks, 5 tasks take ~11 seconds each, the other 11 tasks take over 16 seconds each. This on a machine with 64 cores and plenty of memory. And, of course, perfectly balanced load.

Any ideas on how to address the serial parts of the algorithm would be appreciated, and collaboration to implement the most promising leads would be welcome.

2 Likes

One general thing: Did you pin your Julia threads (i.e. use something like JULIA_EXCLUSIVE=1 or ThreadPinning.jl)? If not, does it make a difference?

What CPU(s) is it exactly?

Btw, a more minimal example would be helpful. A number of the “range splitting” functions are not defined in the linked file, afaics.

The complete example could be run as

if !isdir("FinEtoolsHeatDiff.jl")
    run(`git clone https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl.git`)
end
cd("FinEtoolsHeatDiff.jl/examples")
println("Current folder: $(pwd())")
Pkg.activate(".")
Pkg.instantiate()
@show Threads.nthreads()
include(joinpath(pwd(), "steady_state/3-d/Poisson_examples.jl"))
using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_example()
using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_example()

No, I did not.

There must be something wrong with that particular hardware. It was

“4 socket(s), 64 cores, AMD Opteron™ Processor 6380”,
“RAM = 252 GB, Cache: L3 = 6 MB L2 = 2048kiB L1 = 16 KB”

The computation scaled much better on this machine:

2x Intel Xeon E5 2670
Cores 8
Code Name Sandy Bridge-EP/EX
Package Socket 2011 LGA
Technology 32nm
Specification Intel Xeon CPU E5-2670 0 @ 2.60GHz
L1 Data Cache Size 8 x 32 KBytes
L1 Instructions Cache Size 8 x 32 KBytes
L2 Unified Cache Size 8 x 256 KBytes
L3 Unified Cache Size 20480 KBytes
255 GB DDR3

All tasks were running at the same wall-clock time for the second machine (which ran WSL2, btw). The part that ran in the tasks took 52.2 sec on 1 task, 26.4 on 2 tasks, 13.2 on 4 tasks, and 7.8 sec on 8 tasks.

Alas, I spoke too soon. The other machine display similar random delays of individual tasks. Here is an example. The computation was done twice with 4 tasks. First time, all tasks ran in the same time; in the second case one task (# 3) was significantly delayed.

Threads.nthreads() = 4
[ Info: All examples may be executed with
using .Main.Poisson_examples; Main.Poisson_examples.allrun()

First run:
Mesh generation
1000000 elements
Searching nodes  for BC
Number of free degrees of freedom: 3910599
Conductivity
LinearAlgebra.BLAS.get_num_threads() = 1
[ Info: All done serial 73.12100005149841
[ Info: 1: Before conductivity 2.9690001010894775
[ Info: 3: Before conductivity 2.9690001010894775
[ Info: 4: Before conductivity 2.9850001335144043
[ Info: 2: Before conductivity 3.0
[ Info: 2: After conductivity 15.765000104904175
[ Info: 1: After conductivity 15.781000137329102
[ Info: 3: After conductivity 15.812000036239624
[ Info: 4: After conductivity 15.858999967575073
[ Info: After sync 15.858999967575073
[ Info: All done 38.342000007629395

Second run:
Mesh generation
1000000 elements
Searching nodes  for BC
Number of free degrees of freedom: 3910599
Conductivity
LinearAlgebra.BLAS.get_num_threads() = 1
[ Info: All done serial 73.49599981307983
[ Info: 2: Before conductivity 2.8430001735687256
[ Info: 1: Before conductivity 2.8430001735687256
[ Info: 4: Before conductivity 2.8430001735687256
[ Info: 3: Before conductivity 2.8430001735687256
[ Info: 4: After conductivity 15.28000020980835
[ Info: 2: After conductivity 15.343000173568726
[ Info: 1: After conductivity 15.343000173568726
[ Info: 3: After conductivity 28.076000213623047
[ Info: After sync 28.076000213623047
[ Info: All done 51.075000047683716

What is happening? Perhaps the GC kicks in in there for some reason? Or is the scheduler not working as well as it should?

It looks like the GC is not at fault here: I turned it off before the loop over the tasks
(and back on after the loop): the hanging task is still there, showing up randomly.

Again, what happens if you do? Are the timings more stable?

Sorry, it wasn’t clear that that was a request to try it. Will do.

Worse. Now I haven’t seen equal times for the tasks even once: There are always two groups of tasks, the slow ones and the fast ones.

Just as a reference, Threaded Assembly · Ferrite.jl has an example of threaded assembly using Ferrite.jl. It uses a coloring algorithm (where no elements of the same color share dofs) so that it can run the assembly in parallel as well.

Thank you, Kristoffer. I am aware.

We approached the problem from different viewpoints. In your case the serial part is front-loaded (the pattern creation), and in my case the computation is partitioned, with no data races, but the triplet then needs to be converted into a CSC matrix (serial part).

I noticed that you get a similar sort of parallel scaling that I do. Which is good. But I wonder how you measured it, and whether you can also see the irregularities in the thread consumption of wall-clock time?

In this issue are a lot of timings for threaded assembly with Ferrite.jl: Threaded Assembly Performance Degradation · Issue #526 · Ferrite-FEM/Ferrite.jl · GitHub Take a look especially here: Threaded Assembly Performance Degradation · Issue #526 · Ferrite-FEM/Ferrite.jl · GitHub

1 Like

Below is the information on how to run the example. I would be very much interested in measurements performed on another machine(s).

Alas, LinuxPerf.jl appears to be broken. Edit: Looks like it is not available for WSL2.

Interesting! I implemented an alternative way based on threads. This works quite well, at least the irregularities (slow tasks mixed in with fast tasks) are gone: all threads run in the same time.

Here is the parallel speed up for the thread implementation:


Note well that this is only the computation of the conductivity matrix and its assembly into the COO format.
Because the conversion into the CSC format is not parallelized, the overall speed up is quite a bit worse.

Machine used in the above graph:

2x Intel Xeon E5 2670
Cores 8
Code Name Sandy Bridge-EP/EX
Package Socket 2011 LGA
Technology 32nm
Specification Intel Xeon CPU E5-2670 0 @ 2.60GHz
L1 Data Cache Size 8 x 32 KBytes
L1 Instructions Cache Size 8 x 32 KBytes
L2 Unified Cache Size 8 x 256 KBytes
L3 Unified Cache Size 20480 KBytes
255 GB DDR3

This machine was running WSL2 under Windows 10.

Linear heat conduction problem. 343000 serendipity quadratic elements with 3x3x3 Gauss quadrature.

Open question: what is wrong with the task-based implementation? Why are some tasks much slower than others in the same batch?

References:
The task loop: https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl/blob/d041cd06035547e7bdb1422a94daf006594f1393/examples/steady_state/3-d/Poisson_examples.jl#L336
The thread loop: https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl/blob/d041cd06035547e7bdb1422a94daf006594f1393/examples/steady_state/3-d/Poisson_examples.jl#L479

1 Like

Further data for the same problem. This machine:
4 socket(s), 64 cores, AMD Opteron™ Processor 6380,
RAM = 252 GB, Cache: L3 = 6 MB L2 = 2048kiB L1 = 16 KB

Again, the thread-based loop.

Edit: Extended to more threads.

Am i reading the results correctly , i.e. the Xeon is giving worse multi threaded performance ?