Performance challenge: can you write a faster sum?

Elrod · July 7, 2025, 11:02pm

mbauman:

julia> using BenchmarkTools

julia> A = rand(10000); B = @view A[begin:2:end];

julia> @btime sum($A);
  946.640 ns (0 allocations: 0 bytes)

julia> @btime sum($B);
  4.149 μs (0 allocations: 0 bytes)

Results are of course platform-dependent; zen5 seems to prefer @turbo:

julia> @btime reduce_mb_3(+, $A); @btime reduce_mb_3(+, $B)
  440.970 ns (0 allocations: 0 bytes)
  1.053 μs (0 allocations: 0 bytes)
2475.6003f0

julia> @btime sum_turbo($A); @btime sum_turbo($B)
  58.546 ns (0 allocations: 0 bytes)
  748.756 ns (0 allocations: 0 bytes)
2475.6f0

julia> versioninfo()
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor

Elrod · July 7, 2025, 11:07pm

LV seems to use the exact same order of operations here, at least at this size:

julia> sum(reshape(A,(100,100)),dims=2) == vreduce(+,reshape(A,(100,100)),dims=2)
true

julia> @benchmark sum(reshape($A,(100,100)),dims=2)
BenchmarkTools.Trial: 8993 samples with 198 evaluations per sample.
 Range (min … max):  443.495 ns …  10.497 μs  ┊ GC (min … max):  0.00% … 94.60%
 Time  (median):     482.333 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   557.728 ns ± 736.524 ns  ┊ GC (mean ± σ):  12.81% ±  9.23%

     ▁█▂                                                         
  ▂▃▃███▆▆▆▅▅▅▅▅▅▅▅▆▅▅▅▅▄▅▅▄▄▄▄▅▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▂▁▁▁▂ ▃
  443 ns           Histogram: frequency by time          619 ns <

 Memory estimate: 544 bytes, allocs estimate: 3.

julia> @benchmark vreduce(+,reshape($A,(100,100)),dims=2)
BenchmarkTools.Trial: 6451 samples with 767 evaluations per sample.
 Range (min … max):  127.382 ns …   3.704 μs  ┊ GC (min … max):  0.00% … 95.52%
 Time  (median):     141.150 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   200.908 ns ± 375.622 ns  ┊ GC (mean ± σ):  25.97% ± 13.02%

  █                                                              
  █▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  127 ns           Histogram: frequency by time         2.81 μs <

 Memory estimate: 544 bytes, allocs estimate: 3.

Note that for dims=2, if your array is column major, you should SIMD across the rows.
LV should just be using the naive summation order above, which is definitely not what you want for accuracy.

mbauman · July 8, 2025, 3:34pm

Nice! I should’ve thought to check LV’s code. I can see that the generic vreduce((x,y)->x+y, A) uses a well-chosen N-accumulators re-ordering strategy, but it’s not as clear to me what the ordering and associativity is for the ::typeof(+) specialization — those are defined with the @turbo macros over the loop and I’ve not tried to detangle that macro (or the generated code) further.

On my architecture for a Float64 dense array, though, LV is behaving similarly (albeit not bit-exact) to my reduce_mb_3, which uses an interleaved 4x4 pattern — that is, it uses 4 accumulators, each summing four values at a time before reducing with the accumulator. That also seems to be doing quite well in general across multiple datatypes and layouts — including other iterators that are outside the scope of LV.

simsurace · July 8, 2025, 8:07pm

I‘ve been wondering if pairwise reduction intent could be encoded via an iterator or a transducer so it could be wrapped around other iterators. But maybe that would make it harder for the compiler to optimize…

Gesee · July 8, 2025, 8:28pm

It’s normal for sum B to be slower than sum A. B isn’t contiguous in memory, when iterating on him, we need to jump from an element i to an element i+2 instead of i+1 making it less faster an disabling advanced optimization like @simd. But if you find a way to make that negligeable, I would be glad.

Mason · July 8, 2025, 10:47pm

Good question! I tried this with TransducersNext.jl and it worked pretty great:

using Base.Cartesian: @nexprs
using TransducersNext
using TransducersNext: Executor, complete, next, combine, @unroll
struct BlockedCommutativeEx <: Executor end

function TransducersNext.__fold__(rf::RF, init::T, A, ::BlockedCommutativeEx) where {RF, T}
    op(x, y) = next(rf, x, y)
    inds = eachindex(A)
    i1, iN = firstindex(inds), lastindex(inds)
    n = length(inds)
    v = init
    # statically peel off the first iteration for type stability reasons if we don't have a numeric `init`
    @unroll 1 for batch in 0:(n>>4)-1
        i = i1 + batch*16
        @nexprs 16 N-> a_N = @inbounds A[inds[i+(N-1)]]
        @nexprs 2 N-> v_N = op(
            op(op(op(op(op(op(op(init,
                                 a_{(N-1)*8+1}),
                              a_{(N-1)*8+2}),
                           a_{(N-1)*8+3}),
                        a_{(N-1)*8+4}),
                     a_{(N-1)*8+5}),
                  a_{(N-1)*8+6}),
               a_{(N-1)*8+7}),
            a_{(N-1)*8+8})
        v = combine(rf, v, combine(rf, v_1, v_2))
    end
    i = i1 + (n>>4)*16 - 1
    i == iN && return v
    for i in i+1:iN
        ai = @inbounds A[inds[i]]
        v = op(v, ai)
    end
    return complete(rf, v)
end

julia> @btime fold(+, $A; executor=BlockedCommutativeEx())
  1.146 μs (0 allocations: 0 bytes)
5003.618592521285

julia> @btime fold(+, $B; executor=BlockedCommutativeEx())
  767.919 ns (0 allocations: 0 bytes)
2509.5895307786136

Basically the same timings as with @mbauman’s function for A, but surprisingly it does better than Matt’s version for B:

julia> @btime reduce_mb_4(+, $A)
  1.144 μs (0 allocations: 0 bytes)
5003.618592521285

julia> @btime reduce_mb_4(+, $B)
  1.110 μs (0 allocations: 0 bytes)
2509.5895307786136

What’s especially cool about doing this with TransducersNext is that you can do stuff like

executor = ChunkedEx(BlockedCommutativeEx; chunksize=1024)

or whatever to do further outer-chunking of the algorithm in a composable way, with minimal performance loss:

julia> @btime fold(+, $B; executor=ChunkedEx(BlockedCommutativeEx(); chunksize=1024), init=0)
  869.175 ns (0 allocations: 0 bytes)
2507.819648819429

mbauman · July 9, 2025, 1:57pm

That sure is a funny way of spelling reduce

I think of transducers as changing the reduction operator, not the “executor.” The executor is a well-structured reduction loop — exactly what I’m trying to define — and can equivalently be swapped with the function call itself. Transducers themselves aren’t (easily) able to rearrange their execution. I wouldn’t expect any difference of behavior, unless you have infrastructure to unwrap SubArrays (with a transducer!) or some such in some indirection before you get to __fold__. Also: are you missing an element there? Or did you use different data?

I do really like how you expose the ability to do something like ChunkedEx(BlockedCommutativeEx(); chunksize=1024), that’s very cool, and I can immediately imagine exactly how that works.

I really think Transducers have a bit of the Monad problem. It’s a great and straightforward idea, wrapped up in an abstraction and language that makes it seem more complicated than it is… and then once you start shifting compute around your potential design space becomes intractably big.

Transducers fundamentally rely upon the exact particulars of the executor; that’s something we’ve not well-defined for Base.reduce and is at the core of what I’m trying to do.

Mason · July 9, 2025, 2:24pm

reduce was already taken

Yeah, this is one of the main things that TransducersNext.jl is trying to tackle (and maybe it needs a rename). Yes, a transducer is only about the reduction operator. However, we also have a concept of “executors” which express how the data itself is partitioned and fed into the reducing operator, and I want to bring this more into the foreground.

The idea is that I want to make it so that people writing Executors don’t need to worry about re-implementing transducers, and people writing custom transducers shouldn’t have to worry about re-implementing all the executors.

So in TransducersNext.jl, I have

SequentialEx (this basically does foldl)
SIMDEx for making a simple @simd loop
ChunkedEx for chunking
ThreadEx for multithreading partitioning
DistributedEx for distributed partitioning
KernelAbstractionsEx under development
Maybe soon a BlockedCommutativeEx under development

All but sequential, SIMD, and BlockedCommutative take an inner executor so you can compose them.

Yes, there was a bug. Our messages crossed and I updated my comment right before you sent yours, sorry about that.

Totally agreed. I really want to find a way of simplifying this stuff down and making it less scary. I’ve had some success, but there’s a LONG way to go and it’ll require some very hard design and pedagogical decisions.

Actually I disagree with this, they should be agnostic about the particulars of the executor other than the usual stuff (like certain executors requiring associativity or commutativity).

For instance, here’s how the BlockedCommutativeEx works totally fine (and quick!) with the Map transducer (and is faster than mapreduce!):

julia> @btime fold(+, Map(sin), $B; executor=BlockedCommutativeEx())
  12.533 μs (0 allocations: 0 bytes)
2306.4133570501153

julia> @btime mapreduce(sin, +, $B)
  21.440 μs (0 allocations: 0 bytes)
2306.413357050116

Here’s using it with Map and Filter (though with no perf benefit here vs traditional executors):

julia> C = rand(1:10, 10000);

julia> @btime fold(+, Filter(iseven) ⨟ Map(sin), $C; executor=SIMDEx())
  31.288 μs (0 allocations: 0 bytes)
274.85264475617697

julia> @btime mapreduce(sin, +, Iterators.filter(iseven, $C))
  31.268 μs (0 allocations: 0 bytes)
274.85264475617697

Mason · July 9, 2025, 2:32pm

Just to expand on this, if one is working with a transducer, it’s very important to properly handle when you use the innermost reducing operator (e.g. +) vs when you’re using the transformed operator (e.g. Map(sin)'(+)). One needs to account for this whenever doing re-associations, and that’s the source of the differences between my version and yours. Otherwise you end up applying sin multiple times when combining sub-reductions!

So there are subtle differences, but the general shape of the algorithm is the same.

mbauman · July 9, 2025, 2:39pm

I’m actually not thinking about associativity here, but rather the things like:

How init is used (or not!)
How the empty- and one-element cases behave
If op always receives the accumulator on the LHS
If op might receive two elements directly, or if it always takes an accumulator (or init) and an element
If op might receive two accumulators (or if it’s a potentially separate combine operation)
If op might be speculatively executed or memoized
… and yes, exactly as you just wrote “it’s very important to properly handle when you use the innermost reducing operator”

You can’t get any of those assumptions wrong or leave them to chance as you design your transformed op.

Mason · July 9, 2025, 2:45pm

Ah I see what you mean. Yeah, because Transducers / TransducersNext run into these issues way more ‘catastrophically’ than regular reduce, we take a much more rigid set of rules here that all executors must follow or chaos ensues:

init is always used on every (sub)reduction.
one-element case is done with init. zero element case is an error if it hits the default init.
rf (the transformed reducing operator e.g. Map(sin)'(+)) always takes the accumulator on the LHS. The inner-most reducing operator (e.g. +) can take two accumulators**
We don’t currently have a policy on speculative execution or memoization, but we probably should.

Edit: I should mention that we have an overloabable interface for how two accumulators are combined: combine(rf, l, r). By default, this strips out all the transducers out of rf and goes down to the inner-most reducing operator, but each transducer in the chain and the innermost reducing operator are both allowed to intervene and change what combine means for them.

On the other hand, it’s next(rf, acc, elem) that always gets an accumulator on the left and an element of the iterator in on right.

mbauman · July 9, 2025, 5:03pm

So, yeah, to bring this back to my original intent here: my overarching goal is to bring some sort of a “better” generic associativity (and maybe a reordering for some ops if necessary) and rules to Base.reduce in a way that can move WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub forward.

My biggest question is if there exists a decent-enough generic “SIMD-oblivious algorithm” (which isn’t really a thing to my knowledge) that explicitly encodes an associativity that allows for sufficient use of some SIMD instructions (if possible, on most platforms) to avoid major performance regressions from the status-quo @simd loops. There are many subtle reasons that have all added up to my distaste for @simd for something as fundamental as reduce. Things that can change its results include (but are not limited to): --bounds-check=yes, inlineability, type instabilities, views, transposes (and different dims thereof), generator vs. comprehension, iterator style, fma fusion, etc., etc., etc., etc.

The wonderful part is that batched pairwise and inner SIMD-like associativities tend to improve performance on modern CPUs even if they don’t use SIMD instructions thanks to their reduced data dependencies. If op itself is much more expensive than a single instruction or two, then that should dominate any sort of overhead we incur in making this more complicated than the straight loop. And by explicitly batching up the extraction (getindex/iterate) of N elements at a time, N applications of f, and then the applications of op, we can often recover the use of SIMD instructions in cases that wouldn’t have used them with @simd in the first place.

If we can get the pairwise reduction refactor in, it’s going to cause some churn (because folks depend on its internals, bad usages of init, exact values, etc). So all the more reason to make sure its new general behaviors are working as best as they can be. Maybe it makes sense to even explicitly document its associativity — numpy documents theirs.

Topic		Replies	Views
A fast sum. Any downsides? Performance sum	18	959	December 16, 2024
Why I love Julia Community announcement	34	4822	October 11, 2019
How to match performance of sum(A, dims=1)? Performance	14	1105	September 21, 2022
Does the julia intrinsic sum() apply fastmath by default? New to Julia fast-math , sum	15	769	February 21, 2023
Sum with generic implementation is slower than double sum Performance	2	254	May 9, 2023

Performance challenge: can you write a faster sum?

Related topics