Thanks for investigating this issue further Prof. Krysl. Sorry, I could not find time to come back to the issue linked by Kristoffer yet (but I still follow the thread). From a performance perspective the assembly on a single thread is more or less compute bound. Parallelizing assembly with a low number of threads should not be a big issue (with sufficient memory bandwidth and cache). However, with more threads you increase the pressure on all memory lanes. Here I still think it is a mixture of cache/bandwidth issues (i.e. bad or even conflicting cache access patterns+memory bus cannot keep up with the CPUs read/write access) and frequency boosting (i.e. at lower total load each core has higher frequency). Fore some discussion I highly recommend the WorkStream paper (dx.doi.org/10.1145/2851488), because we basically reproduce Figure 4 from this paper. However, take this with a grain of salt, as I still have to confirm everything in more detailed benchmarks.
Also, I think I solved the problem with the tasks: I think I need to start (N+1) threads in order to have N tasks. Apparently, one thread needs to be the main to carry the spine of the computation. If I do this, the tasks finish in more or less the same time (there is still a bit of variation, on the order of 10 %, but nothing like the huge differences between fast and slow tasks before).
This is the speed up of Firenze/Windows 10/with tasks.
Alas, I spoke to soon. No, the problem of the slow tasks still not solved. On the Horntail, even starting N+1 (or even N+8) threads for N tasks, some of the tasks are still delayed by a significant factor (100-200%) relative to others.
I see similar irregularities in the task time consumed (some tasks, a random number of them, will take much longer than others, often double the time) on Apple Mac M2 Ultra.
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 24 × Apple M2 Ultra
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 on 16 virtual cores
I don’t know what it is, but I suspect task scheduling. An issue has been filed,
Just to be clear here and to avoid potential confusion, your @threads based implementation also creates tasks under the hood. Roughly speaking @threads is the same as dividing the iteration interval into nthreads() parts and then @spawning a task per chunk (i.e. one thread runs one task corresponding to one chunk).
AFAICS, you’re trying to do the same in your “task variant”, i.e. chunking with chunks and @spawning the tasks manually.
Am I understanding correctly, that the “thread variant” behaves correctly but the “task variant” is not? Given the above, if you see (vastly) different behavior between the two variants, you should check what you’re doing differently in both cases. For example, you seem to be running (i.e. allocating)
_a = _task_local_assembler(a, b)
on each task in the “task variant” but outside of the @threads loop in the “threads variant”.
At the risk of coming off as nitpicky (but to be more precise for beginners who might be reading this): Both implementations are creating tasks which then run on multiple threads. Hence, they are both “task-based” and “threads-based”.
What is (likely) meant here is that one implementation uses @spawn whereas the other uses @threads.
Thank you for honing my description (nitpicky is good!). It is as you describe.
In the “spawn” case I create the tasks explicitly, whereas “threads” do whatever it is they do under the hood to process their chunks.
The parallel speedups on the Mac M2 Ultra of the assembly with 2, 4, 8, and 16 assembly tasks for an 8-million finite elements mesh:
2: 1.9, 4: 3.49, 8: 6.53, 16: 11.08.