Parallel assembly of a finite element sparse matrix

It would seem I found a possible reason why the tasks are sometimes delayed, and also a possible solution: Tasks with the same workloads sometimes finish in much longer time than others · Issue #53269 · JuliaLang/julia · GitHub

The parallel speedups on the Mac M2 Ultra of the assembly with 2, 4, 8, and 16 assembly tasks for an 8-million finite elements mesh:
2: 1.9, 4: 3.49, 8: 6.53, 16: 11.08.