FYI, with Transducers.jl (see Thread- and process-based parallelisms in Transducers.jl (+ some news)), it’s reduce(+, Map(identity), x; basesize=length(x) ÷ nChunks)
.
I find this kind of approaches limiting as it’s impossible to write a parallel version of sum(f, xs)
this way without relying on compiler internal (aka Core.Compiler.return_type
).
I think this pattern would invoke false sharing and could be bad for performance.