Parallel assembly of a finite element sparse matrix

PetrKryslUCSD · March 12, 2023, 2:11am

There was a thread recently (Behavior of threads) concerning a problem of interest to me: how to teach a serial finite element code to perform the assembly of a global sparse matrix in parallel (on shared memory computers).

There were several good ideas (many thanks!), which in the end served as a basis of the implementation demonstrated here on the problem of heat conduction.

Some salient observations:

The serial code required no modifications.
The implementation is based on tasks. There was no need to know which task was running the code.
A handful of serial pieces of computation remain, which of course limits the available parallel speed up. In particular, the construction of the sparse matrix from a triplet of vectors does not run in parallel.
There is also a puzzle there: Sometimes a few of the tasks take quite a bit longer than the others. For instance, running with sixteen tasks, 5 tasks take ~11 seconds each, the other 11 tasks take over 16 seconds each. This on a machine with 64 cores and plenty of memory. And, of course, perfectly balanced load.

Any ideas on how to address the serial parts of the algorithm would be appreciated, and collaboration to implement the most promising leads would be welcome.

carstenbauer · March 12, 2023, 6:35am

One general thing: Did you pin your Julia threads (i.e. use something like JULIA_EXCLUSIVE=1 or ThreadPinning.jl)? If not, does it make a difference?

carstenbauer · March 12, 2023, 6:37am

What CPU(s) is it exactly?

carstenbauer · March 12, 2023, 6:39am

Btw, a more minimal example would be helpful. A number of the “range splitting” functions are not defined in the linked file, afaics.

PetrKryslUCSD · March 12, 2023, 6:44pm

The complete example could be run as

if !isdir("FinEtoolsHeatDiff.jl")
    run(`git clone https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl.git`)
end
cd("FinEtoolsHeatDiff.jl/examples")
println("Current folder: $(pwd())")
Pkg.activate(".")
Pkg.instantiate()
@show Threads.nthreads()
include(joinpath(pwd(), "steady_state/3-d/Poisson_examples.jl"))
using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_example()
using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_example()

PetrKryslUCSD · March 12, 2023, 6:51pm

No, I did not.

PetrKryslUCSD · March 12, 2023, 6:54pm

There must be something wrong with that particular hardware. It was

“4 socket(s), 64 cores, AMD Opteron™ Processor 6380”,
“RAM = 252 GB, Cache: L3 = 6 MB L2 = 2048kiB L1 = 16 KB”

The computation scaled much better on this machine:

2x Intel Xeon E5 2670
Cores 8
Code Name Sandy Bridge-EP/EX
Package Socket 2011 LGA
Technology 32nm
Specification Intel Xeon CPU E5-2670 0 @ 2.60GHz
L1 Data Cache Size 8 x 32 KBytes
L1 Instructions Cache Size 8 x 32 KBytes
L2 Unified Cache Size 8 x 256 KBytes
L3 Unified Cache Size 20480 KBytes
255 GB DDR3

All tasks were running at the same wall-clock time for the second machine (which ran WSL2, btw). The part that ran in the tasks took 52.2 sec on 1 task, 26.4 on 2 tasks, 13.2 on 4 tasks, and 7.8 sec on 8 tasks.

PetrKryslUCSD · March 13, 2023, 8:24pm

Alas, I spoke too soon. The other machine display similar random delays of individual tasks. Here is an example. The computation was done twice with 4 tasks. First time, all tasks ran in the same time; in the second case one task (# 3) was significantly delayed.

Threads.nthreads() = 4
[ Info: All examples may be executed with
using .Main.Poisson_examples; Main.Poisson_examples.allrun()

First run:
Mesh generation
1000000 elements
Searching nodes  for BC
Number of free degrees of freedom: 3910599
Conductivity
LinearAlgebra.BLAS.get_num_threads() = 1
[ Info: All done serial 73.12100005149841
[ Info: 1: Before conductivity 2.9690001010894775
[ Info: 3: Before conductivity 2.9690001010894775
[ Info: 4: Before conductivity 2.9850001335144043
[ Info: 2: Before conductivity 3.0
[ Info: 2: After conductivity 15.765000104904175
[ Info: 1: After conductivity 15.781000137329102
[ Info: 3: After conductivity 15.812000036239624
[ Info: 4: After conductivity 15.858999967575073
[ Info: After sync 15.858999967575073
[ Info: All done 38.342000007629395

Second run:
Mesh generation
1000000 elements
Searching nodes  for BC
Number of free degrees of freedom: 3910599
Conductivity
LinearAlgebra.BLAS.get_num_threads() = 1
[ Info: All done serial 73.49599981307983
[ Info: 2: Before conductivity 2.8430001735687256
[ Info: 1: Before conductivity 2.8430001735687256
[ Info: 4: Before conductivity 2.8430001735687256
[ Info: 3: Before conductivity 2.8430001735687256
[ Info: 4: After conductivity 15.28000020980835
[ Info: 2: After conductivity 15.343000173568726
[ Info: 1: After conductivity 15.343000173568726
[ Info: 3: After conductivity 28.076000213623047
[ Info: After sync 28.076000213623047
[ Info: All done 51.075000047683716

What is happening? Perhaps the GC kicks in in there for some reason? Or is the scheduler not working as well as it should?

PetrKryslUCSD · March 14, 2023, 1:04am

It looks like the GC is not at fault here: I turned it off before the loop over the tasks
(and back on after the loop): the hanging task is still there, showing up randomly.

carstenbauer · March 14, 2023, 5:59am

Again, what happens if you do? Are the timings more stable?

PetrKryslUCSD · March 14, 2023, 2:52pm

Sorry, it wasn’t clear that that was a request to try it. Will do.

PetrKryslUCSD · March 14, 2023, 3:28pm

Worse. Now I haven’t seen equal times for the tasks even once: There are always two groups of tasks, the slow ones and the fast ones.

kristoffer.carlsson · March 14, 2023, 3:58pm

Just as a reference, Threaded Assembly · Ferrite.jl has an example of threaded assembly using Ferrite.jl. It uses a coloring algorithm (where no elements of the same color share dofs) so that it can run the assembly in parallel as well.

PetrKryslUCSD · March 14, 2023, 5:15pm

Thank you, Kristoffer. I am aware.

We approached the problem from different viewpoints. In your case the serial part is front-loaded (the pattern creation), and in my case the computation is partitioned, with no data races, but the triplet then needs to be converted into a CSC matrix (serial part).

I noticed that you get a similar sort of parallel scaling that I do. Which is good. But I wonder how you measured it, and whether you can also see the irregularities in the thread consumption of wall-clock time?

koehlerson · March 14, 2023, 5:33pm

In this issue are a lot of timings for threaded assembly with Ferrite.jl: Threaded Assembly Performance Degradation · Issue #526 · Ferrite-FEM/Ferrite.jl · GitHub Take a look especially here: Threaded Assembly Performance Degradation · Issue #526 · Ferrite-FEM/Ferrite.jl · GitHub

PetrKryslUCSD · March 14, 2023, 6:40pm

Below is the information on how to run the example. I would be very much interested in measurements performed on another machine(s).

PetrKryslUCSD · March 14, 2023, 9:51pm

Alas, LinuxPerf.jl appears to be broken. Edit: Looks like it is not available for WSL2.

PetrKryslUCSD · March 16, 2023, 2:39am

Interesting! I implemented an alternative way based on threads. This works quite well, at least the irregularities (slow tasks mixed in with fast tasks) are gone: all threads run in the same time.

Here is the parallel speed up for the thread implementation:

Note well that this is only the computation of the conductivity matrix and its assembly into the COO format.
Because the conversion into the CSC format is not parallelized, the overall speed up is quite a bit worse.

Machine used in the above graph:

2x Intel Xeon E5 2670
Cores 8
Code Name Sandy Bridge-EP/EX
Package Socket 2011 LGA
Technology 32nm
Specification Intel Xeon CPU E5-2670 0 @ 2.60GHz
L1 Data Cache Size 8 x 32 KBytes
L1 Instructions Cache Size 8 x 32 KBytes
L2 Unified Cache Size 8 x 256 KBytes
L3 Unified Cache Size 20480 KBytes
255 GB DDR3

This machine was running WSL2 under Windows 10.

Linear heat conduction problem. 343000 serendipity quadratic elements with 3x3x3 Gauss quadrature.

Open question: what is wrong with the task-based implementation? Why are some tasks much slower than others in the same batch?

References:
The task loop: https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl/blob/d041cd06035547e7bdb1422a94daf006594f1393/examples/steady_state/3-d/Poisson_examples.jl#L336
The thread loop: https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl/blob/d041cd06035547e7bdb1422a94daf006594f1393/examples/steady_state/3-d/Poisson_examples.jl#L479

PetrKryslUCSD · March 16, 2023, 4:15pm

Further data for the same problem. This machine:
4 socket(s), 64 cores, AMD Opteron™ Processor 6380,
RAM = 252 GB, Cache: L3 = 6 MB L2 = 2048kiB L1 = 16 KB

Again, the thread-based loop.

Edit: Extended to more threads.

purplishrock · March 16, 2023, 7:06pm

Am i reading the results correctly , i.e. the Xeon is giving worse multi threaded performance ?

Topic		Replies	Views
[ann] FinEtoolsMultithreading Package Announcements	0	262	March 28, 2024
Multithreading with sparse arrays Numerics	3	1722	May 6, 2021
Implementing parallel computing in finite element problems Numerics	6	2026	August 15, 2018
A question about parallel performance in multithreading Performance question , performance , multithreading	10	648	May 12, 2022
Notes on lock-free programming Teaching & Outreach	6	1174	May 27, 2021

Parallel assembly of a finite element sparse matrix

Related topics