Performance challenge: can you write a faster sum?

Results are of course platform-dependent; zen5 seems to prefer @turbo:

julia> @btime reduce_mb_3(+, $A); @btime reduce_mb_3(+, $B)
  440.970 ns (0 allocations: 0 bytes)
  1.053 μs (0 allocations: 0 bytes)
2475.6003f0

julia> @btime sum_turbo($A); @btime sum_turbo($B)
  58.546 ns (0 allocations: 0 bytes)
  748.756 ns (0 allocations: 0 bytes)
2475.6f0

julia> versioninfo()
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor
2 Likes

LV seems to use the exact same order of operations here, at least at this size:

julia> sum(reshape(A,(100,100)),dims=2) == vreduce(+,reshape(A,(100,100)),dims=2)
true

julia> @benchmark sum(reshape($A,(100,100)),dims=2)
BenchmarkTools.Trial: 8993 samples with 198 evaluations per sample.
 Range (min … max):  443.495 ns …  10.497 μs  ┊ GC (min … max):  0.00% … 94.60%
 Time  (median):     482.333 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   557.728 ns ± 736.524 ns  ┊ GC (mean ± σ):  12.81% ±  9.23%

     ▁█▂                                                         
  ▂▃▃███▆▆▆▅▅▅▅▅▅▅▅▆▅▅▅▅▄▅▅▄▄▄▄▅▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▂▁▁▁▂ ▃
  443 ns           Histogram: frequency by time          619 ns <

 Memory estimate: 544 bytes, allocs estimate: 3.

julia> @benchmark vreduce(+,reshape($A,(100,100)),dims=2)
BenchmarkTools.Trial: 6451 samples with 767 evaluations per sample.
 Range (min … max):  127.382 ns …   3.704 μs  ┊ GC (min … max):  0.00% … 95.52%
 Time  (median):     141.150 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   200.908 ns ± 375.622 ns  ┊ GC (mean ± σ):  25.97% ± 13.02%

  █                                                              
  █▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  127 ns           Histogram: frequency by time         2.81 μs <

 Memory estimate: 544 bytes, allocs estimate: 3.

Note that for dims=2, if your array is column major, you should SIMD across the rows.
LV should just be using the naive summation order above, which is definitely not what you want for accuracy.

2 Likes

Nice! I should’ve thought to check LV’s code. I can see that the generic vreduce((x,y)->x+y, A) uses a well-chosen N-accumulators re-ordering strategy, but it’s not as clear to me what the ordering and associativity is for the ::typeof(+) specialization — those are defined with the @turbo macros over the loop and I’ve not tried to detangle that macro (or the generated code) further.

On my architecture for a Float64 dense array, though, LV is behaving similarly (albeit not bit-exact) to my reduce_mb_3, which uses an interleaved 4x4 pattern — that is, it uses 4 accumulators, each summing four values at a time before reducing with the accumulator. That also seems to be doing quite well in general across multiple datatypes and layouts — including other iterators that are outside the scope of LV.

2 Likes

I‘ve been wondering if pairwise reduction intent could be encoded via an iterator or a transducer so it could be wrapped around other iterators. But maybe that would make it harder for the compiler to optimize…

It’s normal for sum B to be slower than sum A. B isn’t contiguous in memory, when iterating on him, we need to jump from an element i to an element i+2 instead of i+1 making it less faster an disabling advanced optimization like @simd. But if you find a way to make that negligeable, I would be glad.

Good question! I tried this with TransducersNext.jl and it worked pretty great:

using Base.Cartesian: @nexprs
using TransducersNext
using TransducersNext: Executor, complete, next, combine, @unroll
struct BlockedCommutativeEx <: Executor end

function TransducersNext.__fold__(rf::RF, init::T, A, ::BlockedCommutativeEx) where {RF, T}
    op(x, y) = next(rf, x, y)
    inds = eachindex(A)
    i1, iN = firstindex(inds), lastindex(inds)
    n = length(inds)
    v = init
    # statically peel off the first iteration for type stability reasons if we don't have a numeric `init`
    @unroll 1 for batch in 0:(n>>4)-1
        i = i1 + batch*16
        @nexprs 16 N-> a_N = @inbounds A[inds[i+(N-1)]]
        @nexprs 2 N-> v_N = op(
            op(op(op(op(op(op(op(init,
                                 a_{(N-1)*8+1}),
                              a_{(N-1)*8+2}),
                           a_{(N-1)*8+3}),
                        a_{(N-1)*8+4}),
                     a_{(N-1)*8+5}),
                  a_{(N-1)*8+6}),
               a_{(N-1)*8+7}),
            a_{(N-1)*8+8})
        v = combine(rf, v, combine(rf, v_1, v_2))
    end
    i = i1 + (n>>4)*16 - 1
    i == iN && return v
    for i in i+1:iN
        ai = @inbounds A[inds[i]]
        v = op(v, ai)
    end
    return complete(rf, v)
end
julia> @btime fold(+, $A; executor=BlockedCommutativeEx())
  1.146 μs (0 allocations: 0 bytes)
5003.618592521285

julia> @btime fold(+, $B; executor=BlockedCommutativeEx())
  767.919 ns (0 allocations: 0 bytes)
2509.5895307786136

Basically the same timings as with @mbauman’s function for A, but surprisingly it does better than Matt’s version for B:

julia> @btime reduce_mb_4(+, $A)
  1.144 μs (0 allocations: 0 bytes)
5003.618592521285

julia> @btime reduce_mb_4(+, $B)
  1.110 μs (0 allocations: 0 bytes)
2509.5895307786136

What’s especially cool about doing this with TransducersNext is that you can do stuff like

executor = ChunkedEx(BlockedCommutativeEx; chunksize=1024)

or whatever to do further outer-chunking of the algorithm in a composable way, with minimal performance loss:

julia> @btime fold(+, $B; executor=ChunkedEx(BlockedCommutativeEx(); chunksize=1024), init=0)
  869.175 ns (0 allocations: 0 bytes)
2507.819648819429
8 Likes

That sure is a funny way of spelling reduce :slight_smile:

I think of transducers as changing the reduction operator, not the “executor.” The executor is a well-structured reduction loop — exactly what I’m trying to define — and can equivalently be swapped with the function call itself. Transducers themselves aren’t (easily) able to rearrange their execution. I wouldn’t expect any difference of behavior, unless you have infrastructure to unwrap SubArrays (with a transducer!) or some such in some indirection before you get to __fold__. Also: are you missing an element there? Or did you use different data?

I do really like how you expose the ability to do something like ChunkedEx(BlockedCommutativeEx(); chunksize=1024), that’s very cool, and I can immediately imagine exactly how that works.

I really think Transducers have a bit of the Monad problem. It’s a great and straightforward idea, wrapped up in an abstraction and language that makes it seem more complicated than it is… and then once you start shifting compute around your potential design space becomes intractably big.

Transducers fundamentally rely upon the exact particulars of the executor; that’s something we’ve not well-defined for Base.reduce and is at the core of what I’m trying to do.

1 Like

reduce was already taken :slight_smile:

Yeah, this is one of the main things that TransducersNext.jl is trying to tackle (and maybe it needs a rename). Yes, a transducer is only about the reduction operator. However, we also have a concept of “executors” which express how the data itself is partitioned and fed into the reducing operator, and I want to bring this more into the foreground.

The idea is that I want to make it so that people writing Executors don’t need to worry about re-implementing transducers, and people writing custom transducers shouldn’t have to worry about re-implementing all the executors.

So in TransducersNext.jl, I have

  • SequentialEx (this basically does foldl)
  • SIMDEx for making a simple @simd loop
  • ChunkedEx for chunking
  • ThreadEx for multithreading partitioning
  • DistributedEx for distributed partitioning
  • KernelAbstractionsEx under development
  • Maybe soon a BlockedCommutativeEx under development

All but sequential, SIMD, and BlockedCommutative take an inner executor so you can compose them.

Yes, there was a bug. Our messages crossed and I updated my comment right before you sent yours, sorry about that.

Totally agreed. I really want to find a way of simplifying this stuff down and making it less scary. I’ve had some success, but there’s a LONG way to go and it’ll require some very hard design and pedagogical decisions.

Actually I disagree with this, they should be agnostic about the particulars of the executor other than the usual stuff (like certain executors requiring associativity or commutativity).

For instance, here’s how the BlockedCommutativeEx works totally fine (and quick!) with the Map transducer (and is faster than mapreduce!):

julia> @btime fold(+, Map(sin), $B; executor=BlockedCommutativeEx())
  12.533 μs (0 allocations: 0 bytes)
2306.4133570501153

julia> @btime mapreduce(sin, +, $B)
  21.440 μs (0 allocations: 0 bytes)
2306.413357050116

Here’s using it with Map and Filter (though with no perf benefit here vs traditional executors):

julia> C = rand(1:10, 10000);

julia> @btime fold(+, Filter(iseven) ⨟ Map(sin), $C; executor=SIMDEx())
  31.288 μs (0 allocations: 0 bytes)
274.85264475617697

julia> @btime mapreduce(sin, +, Iterators.filter(iseven, $C))
  31.268 μs (0 allocations: 0 bytes)
274.85264475617697
2 Likes

Just to expand on this, if one is working with a transducer, it’s very important to properly handle when you use the innermost reducing operator (e.g. +) vs when you’re using the transformed operator (e.g. Map(sin)'(+)). One needs to account for this whenever doing re-associations, and that’s the source of the differences between my version and yours. Otherwise you end up applying sin multiple times when combining sub-reductions!

So there are subtle differences, but the general shape of the algorithm is the same.

I’m actually not thinking about associativity here, but rather the things like:

  • How init is used (or not!)
  • How the empty- and one-element cases behave
  • If op always receives the accumulator on the LHS
  • If op might receive two elements directly, or if it always takes an accumulator (or init) and an element
  • If op might receive two accumulators (or if it’s a potentially separate combine operation)
  • If op might be speculatively executed or memoized
  • … and yes, exactly as you just wrote “it’s very important to properly handle when you use the innermost reducing operator”

You can’t get any of those assumptions wrong or leave them to chance as you design your transformed op.

1 Like

Ah I see what you mean. Yeah, because Transducers / TransducersNext run into these issues way more ‘catastrophically’ than regular reduce, we take a much more rigid set of rules here that all executors must follow or chaos ensues:

  • init is always used on every (sub)reduction.
  • one-element case is done with init. zero element case is an error if it hits the default init.
  • rf (the transformed reducing operator e.g. Map(sin)'(+)) always takes the accumulator on the LHS. The inner-most reducing operator (e.g. +) can take two accumulators**
  • We don’t currently have a policy on speculative execution or memoization, but we probably should.

Edit: I should mention that we have an overloabable interface for how two accumulators are combined: combine(rf, l, r). By default, this strips out all the transducers out of rf and goes down to the inner-most reducing operator, but each transducer in the chain and the innermost reducing operator are both allowed to intervene and change what combine means for them.

On the other hand, it’s next(rf, acc, elem) that always gets an accumulator on the left and an element of the iterator in on right.

So, yeah, to bring this back to my original intent here: my overarching goal is to bring some sort of a “better” generic associativity (and maybe a reordering for some ops if necessary) and rules to Base.reduce in a way that can move WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub forward.

My biggest question is if there exists a decent-enough generic “SIMD-oblivious algorithm” (which isn’t really a thing to my knowledge) that explicitly encodes an associativity that allows for sufficient use of some SIMD instructions (if possible, on most platforms) to avoid major performance regressions from the status-quo @simd loops. There are many subtle reasons that have all added up to my distaste for @simd for something as fundamental as reduce. Things that can change its results include (but are not limited to): --bounds-check=yes, inlineability, type instabilities, views, transposes (and different dims thereof), generator vs. comprehension, iterator style, fma fusion, etc., etc., etc., etc.

The wonderful part is that batched pairwise and inner SIMD-like associativities tend to improve performance on modern CPUs even if they don’t use SIMD instructions thanks to their reduced data dependencies. If op itself is much more expensive than a single instruction or two, then that should dominate any sort of overhead we incur in making this more complicated than the straight loop. And by explicitly batching up the extraction (getindex/iterate) of N elements at a time, N applications of f, and then the applications of op, we can often recover the use of SIMD instructions in cases that wouldn’t have used them with @simd in the first place.

If we can get the pairwise reduction refactor in, it’s going to cause some churn (because folks depend on its internals, bad usages of init, exact values, etc). So all the more reason to make sure its new general behaviors are working as best as they can be. Maybe it makes sense to even explicitly document its associativity — numpy documents theirs.

5 Likes