Interesting! I implemented an alternative way based on threads. This works quite well, at least the irregularities (slow tasks mixed in with fast tasks) are gone: all threads run in the same time.
Here is the parallel speed up for the thread implementation:
Note well that this is only the computation of the conductivity matrix and its assembly into the COO format.
Because the conversion into the CSC format is not parallelized, the overall speed up is quite a bit worse.
Machine used in the above graph:
2x Intel Xeon E5 2670
Cores 8
Code Name Sandy Bridge-EP/EX
Package Socket 2011 LGA
Technology 32nm
Specification Intel Xeon CPU E5-2670 0 @ 2.60GHz
L1 Data Cache Size 8 x 32 KBytes
L1 Instructions Cache Size 8 x 32 KBytes
L2 Unified Cache Size 8 x 256 KBytes
L3 Unified Cache Size 20480 KBytes
255 GB DDR3
This machine was running WSL2 under Windows 10.
Linear heat conduction problem. 343000 serendipity quadratic elements with 3x3x3 Gauss quadrature.
Open question: what is wrong with the task-based implementation? Why are some tasks much slower than others in the same batch?
References:
The task loop: https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl/blob/d041cd06035547e7bdb1422a94daf006594f1393/examples/steady_state/3-d/Poisson_examples.jl#L336
The thread loop: https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl/blob/d041cd06035547e7bdb1422a94daf006594f1393/examples/steady_state/3-d/Poisson_examples.jl#L479
