Performance challenge: can you write a faster sum?

mbauman · July 3, 2025, 3:47pm

OK, I’ve been fiddling with this on and off for a month now, but I’m curious if anyone can do better. Here’s the challenge: write a single generic implementation of sum(A) such that it’s fast for both A = rand(10000) and a B = @view A[begin:2:end]. Currently — even though B has half the elements — B is more than 4x slower on my M1 mac!

julia> using BenchmarkTools

julia> A = rand(10000); B = @view A[begin:2:end];

julia> @btime sum($A);
  946.640 ns (0 allocations: 0 bytes)

julia> @btime sum($B);
  4.149 μs (0 allocations: 0 bytes)

Super bonus points if you don’t use @simd (really!), you definitely can not use @fastmath, and no threads, but you can use @inbounds and you can assume there are at least some small number of elements in the array without checking. For inspiration, here’s my current best effort; it gets us half-way between the two. I get a nice 4x speedup on the view, but a 2x regression on the array:

implementation of sum_mb_1

julia> using Base.Cartesian

julia> function sum_mb_1(A)
           op = +
           f = identity
           inds = eachindex(A)
           i1, iN = firstindex(inds), lastindex(inds)
           n = length(inds)
           @nexprs 4 N->a_N = @inbounds A[inds[i1+(N-1)]]
           @nexprs 4 N->v_N = 0.0 + a_N
           for batch in 1:(n>>2)-1
               i = i1 + batch*4
               @nexprs 4 N->a_N = @inbounds A[inds[i+(N-1)]]
               @nexprs 4 N->fa_N = f(a_N)
               @nexprs 4 N->v_N = op(v_N, fa_N)
           end
           v = op(op(v_1, v_2), op(v_3, v_4))
           i = i1 + (n>>2)*4 - 1
           i == iN && return v
           for i in i+1:iN
               ai = @inbounds A[inds[i]]
               v = op(v, f(ai))
           end
           return v
       end
sum_mb_1 (generic function with 1 method)

julia> @btime sum_mb_1($A);
  2.259 μs (0 allocations: 0 bytes)

julia> @btime sum_mb_1($B);
  1.142 μs (0 allocations: 0 bytes)

The context here, of course, is WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub. If we could hardcode an effective @simd pattern that’s close to optimal for most architectures, it would solve issues like the fact that mean(A, dims=1) is significantly different from mean(A', dims=2). Obviously this will be highly architecture sensitive, but the goal is to unify these two behaviors within a single arch, and then we can experiment if it’s possible to make that work on the most common systems.

nilshg · July 3, 2025, 3:56pm

I feel like we should have started nerdsniping Discourse into making Base Julia (rather than random user code) faster years ago!

mbauman · July 3, 2025, 4:18pm

Thanks to the rubber duck phenomenon (I don’t know why I was so convinced my M1 had 256 bit widths here… ), I noticed a trivial improvement here for Float64s on my M1. I’m curious how this extends to other archs and datatypes (of course we probably need 16 bins for optimal Float32).

sum_mb_2

julia> function sum_mb_2(A)
           op = +
           f = identity
           inds = eachindex(A)
           i1, iN = firstindex(inds), lastindex(inds)
           n = length(inds)
           @nexprs 8 N->a_N = @inbounds A[inds[i1+(N-1)]]
           @nexprs 8 N->v_N = 0.0 + a_N
           for batch in 1:(n>>3)-1
               i = i1 + batch*8
               @nexprs 8 N->a_N = @inbounds A[inds[i+(N-1)]]
               @nexprs 8 N->fa_N = f(a_N)
               @nexprs 8 N->v_N = op(v_N, fa_N)
           end
           v = op(op(op(v_1, v_2), op(v_3, v_4)), op(op(v_5, v_6), op(v_7, v_8)))
           i = i1 + (n>>3)*8 - 1
           i == iN && return v
           for i in i+1:iN
               ai = @inbounds A[inds[i]]
               v = op(v, f(ai))
           end
           return v
       end
sum_mb_2 (generic function with 1 method)

julia> @btime sum_mb_2($A);
  1.137 μs (0 allocations: 0 bytes)

julia> @btime sum_mb_2($B);
  615.000 ns (0 allocations: 0 bytes)

mikmoore · July 3, 2025, 5:11pm

Are you sure your sum_mb_2 benchmark is correct? Relative to yours, my Intel x86 gives

julia> @btime sum_mb_2($A); @btime sum_mb_2($B);
  493.093 ns (0 allocations: 0 bytes)
  1.184 μs (0 allocations: 0 bytes)

julia> @btime sum($A); @btime sum($B);
  540.043 ns (0 allocations: 0 bytes)
  1.479 μs (0 allocations: 0 bytes)

so essentially reversed (B 2x slower than A). Maybe your chip can gather better? Also notice that neither is tremendously improved over a sum baseline EDIT: for me, anyway.

Have you looked at the code_native? Sometimes if you can see something dumb in the inner loop it can give an insight into how to improve the source. Although for B the code_native I’m seeing isn’t trivial to follow…

mrufsvold · July 3, 2025, 5:11pm

Are you using

f = identity
...
fa_N = f(a_N)

For performance reasons or to make the code more generic so this could be changed into a general mapreduce function in the future?

mbauman · July 3, 2025, 5:20pm

Honestly, that’s great news! I’m quite happy to simply preserve the status quo performances. It’s fascinating that the current sum on x86 is way less pessimized in the strided view case… but the actual use case that I’m targeting significant improvements for is in more complicated arrays (e.g., dimensional sums) and arbitrary iterators. I’m also not surprised that this would vary so much by architecture.

And, yeah, I had been looking extensively at both LLVM and native assembly. It was in trying to emulate the code_llvm of the naive @simd loop that I became convinced I should be targeting 4 bins. Only by stepping away from it did I realize I should try 8 bins.

mbauman · July 3, 2025, 5:21pm

It’s the latter — that should be a no-perf-impact simplification that allowed for an easy extraction from WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub

gbaraldi · July 3, 2025, 6:52pm

The simd’s do help though. With @simd your version is faster than sum which uses a simd inside

mbauman · July 3, 2025, 8:08pm

Whoa, that’s fun. I still don’t like how @simd can reach inside an unrelated function and fuse an unrelated multiplication with the reduction into a fma. This current behavior seems not great for a generic reduction framework — there’s no way that re-associating (or even re-ordering) this summation could give you a nonzero value:

julia> sum(identity, [0.0; -.1; .1; zeros(13)])
0.0

julia> sum(x->x*0.5, [0.0; -.1; .1; zeros(13)])
0.0

julia> sum(x->x*0.1, [0.0; -.1; .1; zeros(13)])
-8.326672684688674e-19

julia> sum(@noinline(x->x*0.1), [0.0; -.1; .1; zeros(13)])
0.0

I do wonder which of the extra liberties from the simdloop annotation are making the difference in the @simd-annotated sum_mb_2.

gbaraldi · July 3, 2025, 8:20pm

I believe the difference is that it creates the pairwise summation which has a specific instruction in aarch64 (faddp). Not sure why it’s not doing on the normal version but with some tweaking you might be able to do that (the semantics are all very odd here so it might not understand you are doing a proper pairwise)

mbauman · July 3, 2025, 8:46pm

Ahhhh, yes, it’s the combination of the explicit re-ordering and an annotation of reassoc on the fadd that do the ticket!

julia> add_reassoc(x::Float64,y::Float64) = Base.llvmcall("""
           %3 = fadd reassoc double %0, %1
           ret double %3""", Float64, Tuple{Float64, Float64}, x, y)
add_reassoc (generic function with 1 method)

julia> function reduce_mb_2(op, A)
           f = identity
           inds = eachindex(A)
           i1, iN = firstindex(inds), lastindex(inds)
           n = length(inds)
           @nexprs 8 N->a_N = @inbounds A[inds[i1+(N-1)]]
           @nexprs 8 N->v_N = op(zero(eltype(A)), a_N)
           for batch in 1:(n>>3)-1
               i = i1 + batch*8
               @nexprs 8 N->a_N = @inbounds A[inds[i+(N-1)]]
               @nexprs 8 N->fa_N = f(a_N)
               @nexprs 8 N->v_N = op(v_N, fa_N)
           end
           v = op(op(op(v_1, v_2), op(v_3, v_4)), op(op(v_5, v_6), op(v_7, v_8)))
           i = i1 + (n>>3)*8 - 1
           i == iN && return v
           for i in i+1:iN
               ai = @inbounds A[inds[i]]
               v = op(v, f(ai))
           end
           return v
       end
reduce_mb_2 (generic function with 1 method)

julia> @btime reduce_mb_2(+, $A);
  1.150 μs (0 allocations: 0 bytes)

julia> @btime reduce_mb_2(add_reassoc, $A);
  782.048 ns (0 allocations: 0 bytes)

julia> @btime sum_mb_2_simd($A); # implemented as you'd expect
  782.226 ns (0 allocations: 0 bytes)

julia> @btime foldl(add_reassoc, $A); # the naive loop doesn't benefit from reassoc alone
  1.142 μs (0 allocations: 0 bytes)

Unfortunately, this doesn’t magically solve the not-enough-bins problem for the smaller datatypes.

Edit: It’s also interesting that this gets the exact same performance as the @simd but — unlike the @simd version — it’s not using addp, even with the reassoc annotation.

mbauman · July 3, 2025, 10:42pm

Here we go, this rearrangement seems to be doing even better without the need for a :reassoc flag, for both my M1 and an Icelake server. It’s faster than master for 64-bit, and slightly slower for 32-bit. Might be a workable approach.

julia> function reduce_mb_3(op, A)
           f = identity
           inds = eachindex(A)
           i1, iN = firstindex(inds), lastindex(inds)
           n = length(inds)
           @nexprs 16 N->a_N = @inbounds A[inds[i1+(N-1)]]
           @nexprs 4 N->v_N = op(zero(eltype(A)), op(op(a_N, a_{N+4}), op(a_{N+8}, a_{N+12})))
           for batch in 1:(n>>4)-1
               i = i1 + batch*16
               @nexprs 16 N->a_N = @inbounds A[inds[i+(N-1)]]
               @nexprs 16 N->fa_N = f(a_N)
               @nexprs 4 N->v_N = op(v_N, op(op(a_N, a_{N+4}), op(a_{N+8}, a_{N+12})))
           end
           v = op(op(v_1, v_2), op(v_3, v_4))
           i = i1 + (n>>4)*16 - 1
           i == iN && return v
           for i in i+1:iN
               ai = @inbounds A[inds[i]]
               v = op(v, f(ai))
           end
           return v
       end
reduce_mb_3 (generic function with 1 method)

julia> A = rand(10000); B = @view A[begin:2:end];

julia> @btime reduce(+, $A)
  943.320 ns (0 allocations: 0 bytes)
4990.955387266911

julia> @btime reduce_mb_3(+, $A)
  580.357 ns (0 allocations: 0 bytes)
4990.955387266911

julia> A = rand(Float32, 10000); B = @view A[begin:2:end];

julia> @btime reduce_mb_3(+, $A)
  568.919 ns (0 allocations: 0 bytes)
4958.589f0

julia> @btime reduce(+, $A)
  486.749 ns (0 allocations: 0 bytes)
4958.5903f0

MilesCranmer · July 4, 2025, 12:52am

Very nice stuff @mbauman. It seems even faster than LoopVectorization for strided arrays which is nice

julia> A = rand(Float32, 10000); B = @view A[begin:2:end];

julia> @btime reduce_mb_3(+, $A); @btime reduce_mb_3(+, $B)
  578.093 ns (0 allocations: 0 bytes)
  618.254 ns (0 allocations: 0 bytes)
2489.102f0

julia> @btime sum_turbo($A); @btime sum_turbo($B)
  296.456 ns (0 allocations: 0 bytes)
  1.188 μs (0 allocations: 0 bytes)
2489.1023f0

for

julia> function sum_turbo(A)
           v = zero(eltype(A))
           @turbo for i in eachindex(A)
               v += A[i]
           end
           v
       end
sum_turbo (generic function with 1 method)

foobar_lv2 · July 4, 2025, 8:57am

Note that your construction uses commutativity, not just associativity. So it’s not suitable for general map-reduce. Even restricting to <:Number, this will produce wrong results for e.g. the product of quaternions (which are <: Number).

In view of that restriction, is there still a reason you want a single piece of code, instead of optimizing the most pertinent cases separately?

stevengj · July 4, 2025, 1:36pm

The PR#58418 in question branches to a separate kernel for is_commutative_op(op).

mbauman · July 4, 2025, 1:40pm

Right, this certainly can’t be used for products and any reduction operator. But we do have more operators than + for which we assume commutivity — min/max/extrema and bitwise arithmetic. This would only apply there. We should probably more prominently document that assumption for those functions if we do this.

The prod case is unfortunate though — that’s currently getting the benefit of @simd but we can’t similarly pre-emptively dispatch to the re-ordered algorithm without also hardcoding eltypes and a few mapping functions.

stevengj · July 4, 2025, 1:46pm

I’m not sure how important it is to squeeze out maximum speed for prod, especially for prod of ordinary Number types. For large arrays, it tends to overflow/underflow, so you rarely use it in that case. And for products of very small collections, you will often want to use it with tuples so that it unrolls. For colllections of strings, it is better to use join.

Mason · July 4, 2025, 2:48pm

Is there a half-way comprehensable explanation for why this is the case even with a very simple @simd loop? i.e.

julia> function simple_sum(A; init=zero(eltype(A)))
           acc = init
           @simd for x ∈ A
               acc += x
           end
           acc
       end;

julia> let A = rand(10_000)
           B = @view A[1:2:end]
           
           @btime simple_sum($A)
           @btime simple_sum($B)
       end;
  500.896 ns (0 allocations: 0 bytes)
  11.873 μs (0 allocations: 0 bytes)

Is this a problem in the LLVM vectorizer, or are we missing some information to pass onto LLVM or something?

mbauman · July 4, 2025, 3:05pm

That’s iteration; I think some of that may be addressed in recent nightlies.

Edit: yeah, on my machine a nightly from 10 days ago (40c6d1b7c13) gave me

  1.142 μs (0 allocations: 0 bytes)
  9.250 μs (0 allocations: 0 bytes)

Whereas today’s nightly gives:

  1.137 μs (0 allocations: 0 bytes)
  4.583 μs (0 allocations: 0 bytes)

That improvement is from getting length to inline in: Don't `@inbounds` AbstractArray's iterate method; optimize `checkbounds` instead by mbauman · Pull Request #58793 · JuliaLang/julia · GitHub.

I would love to better diagnose the remainder, however! That said, this is still a relatively special case — the CPU may be able to directly load from memory into the SIMD registers with a begin:2:end sort of stride. That’s not possible for a general @view A[1, :] sort of access (think a reduction with dims=2) or even a @view A[shuffle(begin:end)], but I suspect there would still be performance advantages to a SIMD-like reordering.

mbauman · July 7, 2025, 6:21pm

It’s really interesting to me that we can get better performance than the naive @simd loop without a parallel multiple-bins type of re-ordering (for Float64, on my arch). It’s still not as fast as it could be… and using multiple bins (if op is commutative, of course) gives us yet better performance and greater numerical stability at the same time.

reduce_mb_4

julia> function reduce_mb_4(op, A)
           f = identity
           inds = eachindex(A)
           i1, iN = firstindex(inds), lastindex(inds)
           n = length(inds)
           v = zero(eltype(A))
           for batch in 0:(n>>4)-1
               i = i1 + batch*16
               @nexprs 16 N->a_N = @inbounds A[inds[i+(N-1)]]
               @nexprs 16 N->fa_N = f(a_N)
               @nexprs 2 N->v_N = op(op(op(op(op(op(op(fa_{(N-1)*8+1}, fa_{(N-1)*8+2}), fa_{(N-1)*8+3}), fa_{(N-1)*8+4}), fa_{(N-1)*8+5}), fa_{(N-1)*8+6}), fa_{(N-1)*8+7}), fa_{(N-1)*8+8})
               v = op(v, op(v_1, v_2))
           end
           i = i1 + (n>>4)*16 - 1
           i == iN && return v
           for i in i+1:iN
               ai = @inbounds A[inds[i]]
               v = op(v, f(ai))
           end
           return v
       end
reduce_mb_4 (generic function with 1 method)

julia> function sum_simd(A)
           r = zero(eltype(A))
           @simd for i in eachindex(A)
               r += @inbounds A[i]
           end
           r
       end
sum_simd (generic function with 1 method)

julia> A = rand(10000); B = @view A[begin:2:end];

julia> @btime reduce_mb_4(+, $A)
  733.008 ns (0 allocations: 0 bytes)
4977.12180940769

julia> @btime sum_simd($A)
  1.142 μs (0 allocations: 0 bytes)
4977.121809407697

julia> A = rand(Float32, 10000); B = @view A[begin:2:end];

julia> @btime reduce_mb_4(+, $A)
  924.029 ns (0 allocations: 0 bytes)
5002.0093f0

julia> @btime sum_simd($A)
  561.935 ns (0 allocations: 0 bytes)
5002.0117f0

Here’s my overarching thought: foldl and foldr have well defined associativity. We get to choose the associativity of reduce. Right now, we sometimes use a pairwise algorithm with a relatively large block size (1024) over an @simd loop (which may sometimes include further splitting into parallel accumulators, depending on architecture and element type and bounds checking and type stability and indexability and data layout).

Why let all that up to chance, especially if we can simultaneously improve both performance and accuracy? If we explicitly encode an associativity (or perhaps an additional re-ordering for commutative operators) that performs well for these core numeric types and leads to a more consistent (and perhaps smaller) base block size, then my hope is that this will also perform generically well for non-simdy datatypes. Many of the same fundamentals that enable SIMD also tend to improve non-SIMD code for modern pipelining CPUs (e.g., by reducing inter- and intra-loop data dependencies). Or that’s my hope and hypothesis (hope-othesis?), anyhow. If it turns out to not help, foldl is right there for you.

I don’t know if anyone has ever published an analysis of something like the reduce_mb_4 as the base case for a pairwise splitting algorithm — it preserves ordering, and has an inner reduction of 8 elements at a time.

Some ideas I have found are:

A reduce-shift bottom-up SIMD-y fully pairwise implementation (Dalton, Wang & Blainey, 2014). This is very cool, but it requires a 64-element intermediate stack-allocated mutable array.
The pairwise “combine” operation could additionally be compensating, since that happens less frequently (Blanchard, Higham, & Mary, 2020). Allowing a distinct combine operation could generally be quite powerful; it’d enable a whole nother class of transducers.
Numpy explicitly encodes 8 parallel accumulators within 128-element blocks, but it only works on the “fast” axis. numpy/numpy/_core/src/umath/loops_utils.h.src at 85cb82c79c4923f1e3289f80498064f1b0c70f25 · numpy/numpy · GitHub, dating to numpy#3685. This seems quite conservative to me.

Topic		Replies	Views
A fast sum. Any downsides? Performance sum	18	958	December 16, 2024
Why I love Julia Community announcement	34	4822	October 11, 2019
How to match performance of sum(A, dims=1)? Performance	14	1105	September 21, 2022
Does the julia intrinsic sum() apply fastmath by default? New to Julia fast-math , sum	15	769	February 21, 2023
Sum with generic implementation is slower than double sum Performance	2	254	May 9, 2023

Performance challenge: can you write a faster sum?

Related topics