The threaded version is still 50x slower than the single threaded due to the atomic. You want something like:
function singlethread(arr)
N, M = size(arr)
cumsum = 0.0
@inbounds for j in 1:M
for i in 1:N
cumsum += arr[i, j]
end
end
return cumsum
end
function multithread(arr)
N, M = size(arr)
partial_sums = zeros(Float64, Threads.nthreads())
Threads.@threads for j in 1:M
t = Threads.threadid()
@inbounds for i in 1:N
partial_sums[t] += arr[i, j]
end
end
return sum(partial_sums)
end
julia> @btime singlethread(arr)
84.051 ms (1 allocation: 16 bytes)
4.9994246012104645e7
julia> @btime multithread(arr)
26.634 ms (3 allocations: 176 bytes)
4.9994246012085415e7
Note that I also changed the order of i
and j
in your single thread loop to have it go columnwise through the elements like how Julias Arrays are stored,