To be more specific: I didn’t change anything relevant in the implementation. The thing is that all allocations and GC occur in the first part, which is relatively fast for small number of threads.
However, with many cores, the fist part becomes limiting, because the slow one scales really well.
What I did now is to split the fist part into two: one where allocations and GC take place, and a second that assumes the buffers are allocated.
This clearly identifies the scaling problems with the allocations and GC of the fist step. Which makes your very first hints very accurate… and that clearly gives me path to potentially improve the code.