Parallel assembly of a finite element sparse matrix

Thanks for investigating this issue further Prof. Krysl. Sorry, I could not find time to come back to the issue linked by Kristoffer yet (but I still follow the thread). From a performance perspective the assembly on a single thread is more or less compute bound. Parallelizing assembly with a low number of threads should not be a big issue (with sufficient memory bandwidth and cache). However, with more threads you increase the pressure on all memory lanes. Here I still think it is a mixture of cache/bandwidth issues (i.e. bad or even conflicting cache access patterns+memory bus cannot keep up with the CPUs read/write access) and frequency boosting (i.e. at lower total load each core has higher frequency). Fore some discussion I highly recommend the WorkStream paper (dx.doi.org/10.1145/2851488), because we basically reproduce Figure 4 from this paper. However, take this with a grain of salt, as I still have to confirm everything in more detailed benchmarks.

Another relevant thread is How to achieve perfect scaling with Threads (Julia 1.7.1) - #10 by carstenbauer which discusses some of the mentioned problems in more detail.

1 Like