Just in order to demonstrate that the performance issue is simd-related:
function mysum(r)
s = zero(eltype(r))
for i in eachindex(r)
s += r[i]
end
return s
end
function mysum_simd(r)
s = zero(eltype(r))
@simd for i in eachindex(r)
s += r[i]
end
return s
end
Benchmark:
julia> r = rand(100,100);
julia> @btime sum($r)
854.386 ns (0 allocations: 0 bytes)
5025.939581928298
julia> @btime mysum($r)
8.900 μs (0 allocations: 0 bytes)
5025.939581928294
julia> @btime mysum_simd($r)
568.108 ns (0 allocations: 0 bytes)
5025.939581928298
Note that the result is also more accurate for the simd-sum, in this case at least (after some more tests, I see this varies a bit, but the simd version is mostly more accurate):
julia> sum(big, r) - mysum(r)
4.055866753560621873475611209869384765625e-12
julia> sum(big, r) - mysum_simd(r)
-4.91606755304019316099584102630615234375e-13