Parallel assembly of a finite element sparse matrix

Yes, in this case. It could be the number of cores…

Well the XEON does have 8 cores, so … you think it would be good up to 7 :wink:

Really interesting data points that you have provided. I really like to see these kind of metrics in the Julia forum !

edit: math is hard. 2x xeon = 16. so 1 for windows, and 1 for the virus scanner in windows leaves 14

1 Like

Actually, each of the sockets has 8 cores. 16 in total then. I don’t have an explanation for the poor performance.

Thanks for investigating this issue further Prof. Krysl. Sorry, I could not find time to come back to the issue linked by Kristoffer yet (but I still follow the thread). From a performance perspective the assembly on a single thread is more or less compute bound. Parallelizing assembly with a low number of threads should not be a big issue (with sufficient memory bandwidth and cache). However, with more threads you increase the pressure on all memory lanes. Here I still think it is a mixture of cache/bandwidth issues (i.e. bad or even conflicting cache access patterns+memory bus cannot keep up with the CPUs read/write access) and frequency boosting (i.e. at lower total load each core has higher frequency). Fore some discussion I highly recommend the WorkStream paper (, because we basically reproduce Figure 4 from this paper. However, take this with a grain of salt, as I still have to confirm everything in more detailed benchmarks.

Another relevant thread is How to achieve perfect scaling with Threads (Julia 1.7.1) - #10 by carstenbauer which discusses some of the mentioned problems in more detail.

1 Like

The machine Firenze with Windows 10 Julia 1.8.5.
Better speed ups than the WSL2 setup.

Also, I think I solved the problem with the tasks: I think I need to start (N+1) threads in order to have N tasks. Apparently, one thread needs to be the main to carry the spine of the computation. If I do this, the tasks finish in more or less the same time (there is still a bit of variation, on the order of 10 %, but nothing like the huge differences between fast and slow tasks before).

This is the speed up of Firenze/Windows 10/with tasks.

Edit: The tasks take about 0.03 sec to start and another 0.03 seconds to spawn. So that can add up.

Alas, I spoke to soon. No, the problem of the slow tasks still not solved. On the Horntail, even starting N+1 (or even N+8) threads for N tasks, some of the tasks are still delayed by a significant factor (100-200%) relative to others.

How do I debug this?

I see similar irregularities in the task time consumed (some tasks, a random number of them, will take much longer than others, often double the time) on Apple Mac M2 Ultra.

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 × Apple M2 Ultra
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

I don’t know what it is, but I suspect task scheduling. An issue has been filed,

1 Like

Greetings. Much like your work. Is sparse matrix assembly still running in parallel here?

Just to be clear here and to avoid potential confusion, your @threads based implementation also creates tasks under the hood. Roughly speaking @threads is the same as dividing the iteration interval into nthreads() parts and then @spawning a task per chunk (i.e. one thread runs one task corresponding to one chunk).

AFAICS, you’re trying to do the same in your “task variant”, i.e. chunking with chunks and @spawning the tasks manually.

Am I understanding correctly, that the “thread variant” behaves correctly but the “task variant” is not? Given the above, if you see (vastly) different behavior between the two variants, you should check what you’re doing differently in both cases. For example, you seem to be running (i.e. allocating)

_a = _task_local_assembler(a, b)

on each task in the “task variant” but outside of the @threads loop in the “threads variant”.

Yes, indeed. There are two implementations here task-based and threads-based.

At the risk of coming off as nitpicky (but to be more precise for beginners who might be reading this): Both implementations are creating tasks which then run on multiple threads. Hence, they are both “task-based” and “threads-based”.
What is (likely) meant here is that one implementation uses @spawn whereas the other uses @threads.

Thank you for honing my description (nitpicky is good!). It is as you describe.
In the “spawn” case I create the tasks explicitly, whereas “threads” do whatever it is they do under the hood to process their chunks.

It would seem I found a possible reason why the tasks are sometimes delayed, and also a possible solution: Tasks with the same workloads sometimes finish in much longer time than others · Issue #53269 · JuliaLang/julia · GitHub

The parallel speedups on the Mac M2 Ultra of the assembly with 2, 4, 8, and 16 assembly tasks for an 8-million finite elements mesh:
2: 1.9, 4: 3.49, 8: 6.53, 16: 11.08.