Well, this is a bit embarrassing… I’m writing up a set of slides to present to colleagues about Julia, and I wanted to show an example how explicit looping and so on is fine and dandy and as fast as any other language construct.

However! - I actually can’t get my mysum() function to match Julia’s sum(A, dims=1) function (summing a 2D array along the first axis):

arr = rand(1000, 100_000)
@btime sum($arr, dims=1);
# ~ 56 ms
function mysum(arr)
result = zeros(eltype(arr), size(arr, 2))
for j in axes(arr, 2)
for i in axes(arr, 1)
result[j] += arr[i, j]
end
end
return result
end
@btime mysum($arr);
# ~ 80 ms

I’ve checked the obvious: its column-major iteration, there’s no allocations (except the initial result array), it’s type stable. I’ve also tried initialising an undef results array, and summing with a temporary variable for each row (in case it was the memory accesses on the heap that were slowing me down), also with no effect.

julia> function turbosum(arr)
result = zeros(eltype(arr), size(arr, 2))
@turbo for j in axes(arr, 2)
for i in axes(arr, 1)
result[j] += arr[i, j]
end
end
return result
end
turbosum (generic function with 1 method)
julia> @btime turbosum($arr);
31.189 ms (2 allocations: 781.30 KiB)
julia> @btime sum($arr; dims=1);
42.052 ms (2 allocations: 781.30 KiB)

Right, so with 1.8.1, @turbo indeed has a big improvement. But this is no longer the teaching moment I was hoping for, since ‘sprinkle with @turbo’ doesn’t really help people understand what’s going on here.

Also, I wonder how Base is doing this, without access to LoopVectorization magic…

julia> function mysum(arr)
result = zeros(eltype(arr), size(arr, 2))
@inbounds for j in axes(arr, 2)
t = zero(eltype(arr))
for i in axes(arr, 1)
t += arr[i, j]
end
result[j] = t
end
return result
end
mysum (generic function with 1 method)
julia> @btime mysum($arr);
165.281 ms (2 allocations: 781.30 KiB)
julia> function mysum(arr)
result = zeros(eltype(arr), size(arr, 2))
@inbounds for j in axes(arr, 2)
t = zero(eltype(arr))
@simd for i in axes(arr, 1)
t += arr[i, j]
end
result[j] = t
end
return result
end
mysum (generic function with 1 method)
julia> @btime mysum($arr);
53.112 ms (2 allocations: 781.30 KiB)
julia> @btime sum($arr, dims=1);
52.268 ms (2 allocations: 781.30 KiB)

It’s turtles all the way down, but I eventually got to the all-important Base.mapreduce_impl() which does a little magic with block sizes and @simd.

You’re right. I need both @simdand a temporary loop variable for it to work.

I’m a little confused here though, since I didn’t think @simd worked with reduction variables, and now it seems to. Am I guaranteed that the sum won’t suffer from race conditions?

I don’t think race conditions are an issue, since IIUC, we’re not accessing the same memory location from multiple tasks. What @simd does is reorder associative operations, so it may evaluate a + (b+c) instead of (a+b) + c. This means that if we use @simd in reductions, it will likely change the result due to floating-point rounding errors. We may check that mysum(arr) differs from vec(sum(arr, dims=1)), although they’re approximately equal.