# How to match performance of sum(A, dims=1)?

Well, this is a bit embarrassing… I’m writing up a set of slides to present to colleagues about Julia, and I wanted to show an example how explicit looping and so on is fine and dandy and as fast as any other language construct.

However! - I actually can’t get my `mysum()` function to match Julia’s `sum(A, dims=1)` function (summing a 2D array along the first axis):

``````arr = rand(1000, 100_000)

@btime sum(\$arr, dims=1);
# ~ 56 ms

function mysum(arr)
result = zeros(eltype(arr), size(arr, 2))

for j in axes(arr, 2)
for i in axes(arr, 1)
result[j] += arr[i, j]
end
end
return result
end

@btime mysum(\$arr);
# ~ 80 ms
``````

I’ve checked the obvious: its column-major iteration, there’s no allocations (except the initial result array), it’s type stable. I’ve also tried initialising an undef `results` array, and summing with a temporary variable for each row (in case it was the memory accesses on the heap that were slowing me down), also with no effect.

How do I speed this up?

2 Likes

Did you try `@inbounds`?

Yes, indeed. No effect.

Just upgraded from 1.8.0 to 1.8.1, and my version is now twice as slow. That’s a really curious performance degradation.

have you tried `@turbo`?

@Oscar_Smith i have yes, but I wouldn’t expect it to have an effect since this is a reduction and SIMD doesn’t easily apply.

I think SIMD applies. I recall finding this out while preparing lecture for students last year. But `@simd` was sufficient.

Here is the link to source of our lecture

`@turbo` works for me.

``````julia> function turbosum(arr)
result = zeros(eltype(arr), size(arr, 2))

@turbo for j in axes(arr, 2)
for i in axes(arr, 1)
result[j] += arr[i, j]
end
end
return result
end
turbosum (generic function with 1 method)

julia> @btime turbosum(\$arr);
31.189 ms (2 allocations: 781.30 KiB)

julia> @btime sum(\$arr; dims=1);
42.052 ms (2 allocations: 781.30 KiB)
``````

That sounds pretty bad, I can confirm that results are quite bad for the manual version on v1.8.1, but I get equally bad timings on v1.7.3

Right, so with 1.8.1, `@turbo` indeed has a big improvement. But this is no longer the teaching moment I was hoping for, since ‘sprinkle with `@turbo`’ doesn’t really help people understand what’s going on here.

Also, I wonder how Base is doing this, without access to LoopVectorization magic…

1 Like

You could just look at the base code with `@edit`?

The base code is created by macros and not easy to read.

The difference seems entirely due to `@simd`:

``````
julia> function mysum(arr)
result = zeros(eltype(arr), size(arr, 2))
@inbounds for j in axes(arr, 2)
t = zero(eltype(arr))
for i in axes(arr, 1)
t += arr[i, j]
end
result[j] = t
end
return result
end
mysum (generic function with 1 method)

julia> @btime mysum(\$arr);
165.281 ms (2 allocations: 781.30 KiB)

julia> function mysum(arr)
result = zeros(eltype(arr), size(arr, 2))
@inbounds for j in axes(arr, 2)
t = zero(eltype(arr))
@simd for i in axes(arr, 1)
t += arr[i, j]
end
result[j] = t
end
return result
end
mysum (generic function with 1 method)

julia> @btime mysum(\$arr);
53.112 ms (2 allocations: 781.30 KiB)

julia> @btime sum(\$arr, dims=1);
52.268 ms (2 allocations: 781.30 KiB)
``````
1 Like

It’s turtles all the way down, but I eventually got to the all-important `Base.mapreduce_impl()` which does a little magic with block sizes and `@simd`.

You’re right. I need both `@simd` and a temporary loop variable for it to work.

I’m a little confused here though, since I didn’t think `@simd` worked with reduction variables, and now it seems to. Am I guaranteed that the sum won’t suffer from race conditions?

1 Like

I don’t think race conditions are an issue, since IIUC, we’re not accessing the same memory location from multiple tasks. What `@simd` does is reorder associative operations, so it may evaluate `a + (b+c)` instead of `(a+b) + c`. This means that if we use `@simd` in reductions, it will likely change the result due to floating-point rounding errors. We may check that `mysum(arr)` differs from `vec(sum(arr, dims=1))`, although they’re approximately equal.

2 Likes