There was a thread recently (Behavior of threads) concerning a problem of interest to me: how to teach a serial finite element code to perform the assembly of a global sparse matrix in parallel (on shared memory computers).
There were several good ideas (many thanks!), which in the end served as a basis of the implementation demonstrated here on the problem of heat conduction.
Some salient observations:
The serial code required no modifications.
The implementation is based on tasks. There was no need to know which task was running the code.
A handful of serial pieces of computation remain, which of course limits the available parallel speed up. In particular, the construction of the sparse matrix from a triplet of vectors does not run in parallel.
There is also a puzzle there: Sometimes a few of the tasks take quite a bit longer than the others. For instance, running with sixteen tasks, 5 tasks take ~11 seconds each, the other 11 tasks take over 16 seconds each. This on a machine with 64 cores and plenty of memory. And, of course, perfectly balanced load.
Any ideas on how to address the serial parts of the algorithm would be appreciated, and collaboration to implement the most promising leads would be welcome.
The computation scaled much better on this machine:
2x Intel Xeon E5 2670
Cores 8
Code Name Sandy Bridge-EP/EX
Package Socket 2011 LGA
Technology 32nm
Specification Intel Xeon CPU E5-2670 0 @ 2.60GHz
L1 Data Cache Size 8 x 32 KBytes
L1 Instructions Cache Size 8 x 32 KBytes
L2 Unified Cache Size 8 x 256 KBytes
L3 Unified Cache Size 20480 KBytes
255 GB DDR3
All tasks were running at the same wall-clock time for the second machine (which ran WSL2, btw). The part that ran in the tasks took 52.2 sec on 1 task, 26.4 on 2 tasks, 13.2 on 4 tasks, and 7.8 sec on 8 tasks.
Alas, I spoke too soon. The other machine display similar random delays of individual tasks. Here is an example. The computation was done twice with 4 tasks. First time, all tasks ran in the same time; in the second case one task (# 3) was significantly delayed.
Threads.nthreads() = 4
[ Info: All examples may be executed with
using .Main.Poisson_examples; Main.Poisson_examples.allrun()
First run:
Mesh generation
1000000 elements
Searching nodes for BC
Number of free degrees of freedom: 3910599
Conductivity
LinearAlgebra.BLAS.get_num_threads() = 1
[ Info: All done serial 73.12100005149841
[ Info: 1: Before conductivity 2.9690001010894775
[ Info: 3: Before conductivity 2.9690001010894775
[ Info: 4: Before conductivity 2.9850001335144043
[ Info: 2: Before conductivity 3.0
[ Info: 2: After conductivity 15.765000104904175
[ Info: 1: After conductivity 15.781000137329102
[ Info: 3: After conductivity 15.812000036239624
[ Info: 4: After conductivity 15.858999967575073
[ Info: After sync 15.858999967575073
[ Info: All done 38.342000007629395
Second run:
Mesh generation
1000000 elements
Searching nodes for BC
Number of free degrees of freedom: 3910599
Conductivity
LinearAlgebra.BLAS.get_num_threads() = 1
[ Info: All done serial 73.49599981307983
[ Info: 2: Before conductivity 2.8430001735687256
[ Info: 1: Before conductivity 2.8430001735687256
[ Info: 4: Before conductivity 2.8430001735687256
[ Info: 3: Before conductivity 2.8430001735687256
[ Info: 4: After conductivity 15.28000020980835
[ Info: 2: After conductivity 15.343000173568726
[ Info: 1: After conductivity 15.343000173568726
[ Info: 3: After conductivity 28.076000213623047
[ Info: After sync 28.076000213623047
[ Info: All done 51.075000047683716
What is happening? Perhaps the GC kicks in in there for some reason? Or is the scheduler not working as well as it should?
It looks like the GC is not at fault here: I turned it off before the loop over the tasks
(and back on after the loop): the hanging task is still there, showing up randomly.
Just as a reference, Threaded Assembly · Ferrite.jl has an example of threaded assembly using Ferrite.jl. It uses a coloring algorithm (where no elements of the same color share dofs) so that it can run the assembly in parallel as well.
We approached the problem from different viewpoints. In your case the serial part is front-loaded (the pattern creation), and in my case the computation is partitioned, with no data races, but the triplet then needs to be converted into a CSC matrix (serial part).
I noticed that you get a similar sort of parallel scaling that I do. Which is good. But I wonder how you measured it, and whether you can also see the irregularities in the thread consumption of wall-clock time?
Interesting! I implemented an alternative way based on threads. This works quite well, at least the irregularities (slow tasks mixed in with fast tasks) are gone: all threads run in the same time.
Here is the parallel speed up for the thread implementation:
Note well that this is only the computation of the conductivity matrix and its assembly into the COO format.
Because the conversion into the CSC format is not parallelized, the overall speed up is quite a bit worse.
Machine used in the above graph:
2x Intel Xeon E5 2670
Cores 8
Code Name Sandy Bridge-EP/EX
Package Socket 2011 LGA
Technology 32nm
Specification Intel Xeon CPU E5-2670 0 @ 2.60GHz
L1 Data Cache Size 8 x 32 KBytes
L1 Instructions Cache Size 8 x 32 KBytes
L2 Unified Cache Size 8 x 256 KBytes
L3 Unified Cache Size 20480 KBytes
255 GB DDR3
This machine was running WSL2 under Windows 10.
Linear heat conduction problem. 343000 serendipity quadratic elements with 3x3x3 Gauss quadrature.
Open question: what is wrong with the task-based implementation? Why are some tasks much slower than others in the same batch?