What's the most efficient way to sum the results of multithreading?


I need to sum the outputs of each thread. Is there a standard/efficient way to do this?
for example:

X = Matrix{Float64}(d, N)
Threads.@threads for k = 1:N
    X[:, k] .= some_calculations()
sum!(ones(Int64, d), X)

Preferably avoid initialising an X, because I will only need the row sum.
Sorry about it if this is a stupid question. Any help would be much appreciated!

I think the easiest way is to use ThreadsX.mapreduce: GitHub - tkf/ThreadsX.jl: Parallelized Base functions


ThreadsX.jl or any JuliaFolds-related packages do not have out-of-the-box multi-dimensional reduction API. This is mainly because Julia already has a rich set of DSL packages such as Tullio.jl, TensorOperations.jl, and LoopVectorization.jl etc. to support it.

Having said that, you can use Broadcasting transducer to construct a custom multi-dimensional reduction:

julia> using Folds, Transducers, Statistics

julia> randn(2, 3) |> eachcol |> Map(x -> x ./ mean(x)) |> Broadcasting() |> Folds.sum
2-element Vector{Float64}:

or simply looping over the input in the other way around:

julia> Folds.collect(sum(xs) for xs in eachcol(randn(2, 3)))
3-element Vector{Float64}:
1 Like

Hi, thank you for your reply. LoopVectorization works for me.
I also tried Broadcasting transducer, but it was not as good as ThreadsX, even with allocations. Could you shed some light on this?

julia> @btime ThreadsX.sum(x -> (sleep(0.5); x .* transpose(x)), eachcol(randn(1000, 50)))  
  2.106 s (524 allocations: 755.72 MiB)

julia> @btime randn(1000, 50) |> eachcol |> Map(x -> (sleep(0.5); x .* transpose(x))) |> Broadcasting() |> Folds.sum
  12.717 s (358 allocations: 397.12 MiB)

Thank you very much!