Parallel assembly of a finite element sparse matrix

Yes, in this case. It could be the number of cores…

Well the XEON does have 8 cores, so … you think it would be good up to 7 :wink:

Really interesting data points that you have provided. I really like to see these kind of metrics in the Julia forum !

edit: math is hard. 2x xeon = 16. so 1 for windows, and 1 for the virus scanner in windows leaves 14

1 Like

Actually, each of the sockets has 8 cores. 16 in total then. I don’t have an explanation for the poor performance.

Thanks for investigating this issue further Prof. Krysl. Sorry, I could not find time to come back to the issue linked by Kristoffer yet (but I still follow the thread). From a performance perspective the assembly on a single thread is more or less compute bound. Parallelizing assembly with a low number of threads should not be a big issue (with sufficient memory bandwidth and cache). However, with more threads you increase the pressure on all memory lanes. Here I still think it is a mixture of cache/bandwidth issues (i.e. bad or even conflicting cache access patterns+memory bus cannot keep up with the CPUs read/write access) and frequency boosting (i.e. at lower total load each core has higher frequency). Fore some discussion I highly recommend the WorkStream paper (, because we basically reproduce Figure 4 from this paper. However, take this with a grain of salt, as I still have to confirm everything in more detailed benchmarks.

Another relevant thread is How to achieve perfect scaling with Threads (Julia 1.7.1) - #10 by carstenbauer which discusses some of the mentioned problems in more detail.

1 Like

The machine Firenze with Windows 10 Julia 1.8.5.
Better speed ups than the WSL2 setup.

Also, I think I solved the problem with the tasks: I think I need to start (N+1) threads in order to have N tasks. Apparently, one thread needs to be the main to carry the spine of the computation. If I do this, the tasks finish in more or less the same time (there is still a bit of variation, on the order of 10 %, but nothing like the huge differences between fast and slow tasks before).

This is the speed up of Firenze/Windows 10/with tasks.

Edit: The tasks take about 0.03 sec to start and another 0.03 seconds to spawn. So that can add up.

Alas, I spoke to soon. No, the problem of the slow tasks still not solved. On the Horntail, even starting N+1 (or even N+8) threads for N tasks, some of the tasks are still delayed by a significant factor (100-200%) relative to others.

How do I debug this?