Threads.@threads memory leak

The threaded version is still 50x slower than the single threaded due to the atomic. You want something like:

function singlethread(arr)
    N, M = size(arr)
    cumsum = 0.0
    @inbounds for j in 1:M
        for i in 1:N
            cumsum += arr[i, j]
        end
    end
    return cumsum
end

function multithread(arr)
    N, M = size(arr)

    partial_sums = zeros(Float64, Threads.nthreads())
    Threads.@threads for j in 1:M
        t = Threads.threadid()
        @inbounds for i in 1:N
            partial_sums[t] += arr[i, j]
        end
    end
    return sum(partial_sums)
end
julia> @btime singlethread(arr)
  84.051 ms (1 allocation: 16 bytes)
4.9994246012104645e7

julia> @btime multithread(arr)
  26.634 ms (3 allocations: 176 bytes)
4.9994246012085415e7

Note that I also changed the order of i and j in your single thread loop to have it go columnwise through the elements like how Julias Arrays are stored,

4 Likes