I need to sum the outputs of each thread. Is there a standard/efficient way to do this?
for example:
X = Matrix{Float64}(d, N)
Threads.@threads for k = 1:N
X[:, k] .= some_calculations()
end
sum!(ones(Int64, d), X)
Preferably avoid initialising an X, because I will only need the row sum.
Sorry about it if this is a stupid question. Any help would be much appreciated!
ThreadsX.jl or any JuliaFolds-related packages do not have out-of-the-box multi-dimensional reduction API. This is mainly because Julia already has a rich set of DSL packages such as Tullio.jl, TensorOperations.jl, and LoopVectorization.jl etc. to support it.
Having said that, you can use Broadcasting transducer to construct a custom multi-dimensional reduction:
Hi, thank you for your reply. LoopVectorization works for me.
I also tried Broadcasting transducer, but it was not as good as ThreadsX, even with allocations. Could you shed some light on this?
julia> @btime ThreadsX.sum(x -> (sleep(0.5); x .* transpose(x)), eachcol(randn(1000, 50)))
2.106 s (524 allocations: 755.72 MiB)
julia> @btime randn(1000, 50) |> eachcol |> Map(x -> (sleep(0.5); x .* transpose(x))) |> Broadcasting() |> Folds.sum
12.717 s (358 allocations: 397.12 MiB)