Performance challenge: can you write a faster sum?

Results are of course platform-dependent; zen5 seems to prefer @turbo:

julia> @btime reduce_mb_3(+, $A); @btime reduce_mb_3(+, $B)
  440.970 ns (0 allocations: 0 bytes)
  1.053 μs (0 allocations: 0 bytes)
2475.6003f0

julia> @btime sum_turbo($A); @btime sum_turbo($B)
  58.546 ns (0 allocations: 0 bytes)
  748.756 ns (0 allocations: 0 bytes)
2475.6f0

julia> versioninfo()
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor
2 Likes

LV seems to use the exact same order of operations here, at least at this size:

julia> sum(reshape(A,(100,100)),dims=2) == vreduce(+,reshape(A,(100,100)),dims=2)
true

julia> @benchmark sum(reshape($A,(100,100)),dims=2)
BenchmarkTools.Trial: 8993 samples with 198 evaluations per sample.
 Range (min … max):  443.495 ns …  10.497 μs  ┊ GC (min … max):  0.00% … 94.60%
 Time  (median):     482.333 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   557.728 ns ± 736.524 ns  ┊ GC (mean ± σ):  12.81% ±  9.23%

     ▁█▂                                                         
  ▂▃▃███▆▆▆▅▅▅▅▅▅▅▅▆▅▅▅▅▄▅▅▄▄▄▄▅▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▂▁▁▁▂ ▃
  443 ns           Histogram: frequency by time          619 ns <

 Memory estimate: 544 bytes, allocs estimate: 3.

julia> @benchmark vreduce(+,reshape($A,(100,100)),dims=2)
BenchmarkTools.Trial: 6451 samples with 767 evaluations per sample.
 Range (min … max):  127.382 ns …   3.704 μs  ┊ GC (min … max):  0.00% … 95.52%
 Time  (median):     141.150 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   200.908 ns ± 375.622 ns  ┊ GC (mean ± σ):  25.97% ± 13.02%

  █                                                              
  █▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▂
  127 ns           Histogram: frequency by time         2.81 μs <

 Memory estimate: 544 bytes, allocs estimate: 3.

Note that for dims=2, if your array is column major, you should SIMD across the rows.
LV should just be using the naive summation order above, which is definitely not what you want for accuracy.

2 Likes

Nice! I should’ve thought to check LV’s code. I can see that the generic vreduce((x,y)->x+y, A) uses a well-chosen N-accumulators re-ordering strategy, but it’s not as clear to me what the ordering and associativity is for the ::typeof(+) specialization — those are defined with the @turbo macros over the loop and I’ve not tried to detangle that macro (or the generated code) further.

On my architecture for a Float64 dense array, though, LV is behaving similarly (albeit not bit-exact) to my reduce_mb_3, which uses an interleaved 4x4 pattern — that is, it uses 4 accumulators, each summing four values at a time before reducing with the accumulator. That also seems to be doing quite well in general across multiple datatypes and layouts — including other iterators that are outside the scope of LV.

2 Likes

I‘ve been wondering if pairwise reduction intent could be encoded via an iterator or a transducer so it could be wrapped around other iterators. But maybe that would make it harder for the compiler to optimize…

It’s normal for sum B to be slower than sum A. B isn’t contiguous in memory, when iterating on him, we need to jump from an element i to an element i+2 instead of i+1 making it less faster an disabling advanced optimization like @simd. But if you find a way to make that negligeable, I would be glad.

Good question! I tried this with TransducersNext.jl and it worked pretty great:

using Base.Cartesian: @nexprs
using TransducersNext
using TransducersNext: Executor, complete, next, combine, @unroll
struct BlockedCommutativeEx <: Executor end

function TransducersNext.__fold__(rf::RF, init::T, A, ::BlockedCommutativeEx) where {RF, T}
    op(x, y) = next(rf, x, y)
    inds = eachindex(A)
    i1, iN = firstindex(inds), lastindex(inds)
    n = length(inds)
    v = init
    # statically peel off the first iteration for type stability reasons if we don't have a numeric `init`
    @unroll 1 for batch in 0:(n>>4)-1
        i = i1 + batch*16
        @nexprs 16 N-> a_N = @inbounds A[inds[i+(N-1)]]
        @nexprs 2 N-> v_N = op(
            op(op(op(op(op(op(op(init,
                                 a_{(N-1)*8+1}),
                              a_{(N-1)*8+2}),
                           a_{(N-1)*8+3}),
                        a_{(N-1)*8+4}),
                     a_{(N-1)*8+5}),
                  a_{(N-1)*8+6}),
               a_{(N-1)*8+7}),
            a_{(N-1)*8+8})
        v = combine(rf, v, combine(rf, v_1, v_2))
    end
    i = i1 + (n>>4)*16 - 1
    i == iN && return v
    for i in i+1:iN
        ai = @inbounds A[inds[i]]
        v = op(v, ai)
    end
    return complete(rf, v)
end
julia> @btime fold(+, $A; executor=BlockedCommutativeEx())
  1.146 μs (0 allocations: 0 bytes)
5003.618592521285

julia> @btime fold(+, $B; executor=BlockedCommutativeEx())
  767.919 ns (0 allocations: 0 bytes)
2509.5895307786136

Basically the same timings as with @mbauman’s function for A, but surprisingly it does better than Matt’s version for B:

julia> @btime reduce_mb_4(+, $A)
  1.144 μs (0 allocations: 0 bytes)
5003.618592521285

julia> @btime reduce_mb_4(+, $B)
  1.110 μs (0 allocations: 0 bytes)
2509.5895307786136

What’s especially cool about doing this with TransducersNext is that you can do stuff like

executor = ChunkedEx(BlockedCommutativeEx; chunksize=1024)

or whatever to do further outer-chunking of the algorithm in a composable way, with minimal performance loss:

julia> @btime fold(+, $B; executor=ChunkedEx(BlockedCommutativeEx(); chunksize=1024), init=0)
  869.175 ns (0 allocations: 0 bytes)
2507.819648819429
8 Likes

That sure is a funny way of spelling reduce :slight_smile:

I think of transducers as changing the reduction operator, not the “executor.” The executor is a well-structured reduction loop — exactly what I’m trying to define — and can equivalently be swapped with the function call itself. Transducers themselves aren’t (easily) able to rearrange their execution. I wouldn’t expect any difference of behavior, unless you have infrastructure to unwrap SubArrays (with a transducer!) or some such in some indirection before you get to __fold__. Also: are you missing an element there? Or did you use different data?

I do really like how you expose the ability to do something like ChunkedEx(BlockedCommutativeEx(); chunksize=1024), that’s very cool, and I can immediately imagine exactly how that works.

I really think Transducers have a bit of the Monad problem. It’s a great and straightforward idea, wrapped up in an abstraction and language that makes it seem more complicated than it is… and then once you start shifting compute around your potential design space becomes intractably big.

Transducers fundamentally rely upon the exact particulars of the executor; that’s something we’ve not well-defined for Base.reduce and is at the core of what I’m trying to do.

1 Like

reduce was already taken :slight_smile:

Yeah, this is one of the main things that TransducersNext.jl is trying to tackle (and maybe it needs a rename). Yes, a transducer is only about the reduction operator. However, we also have a concept of “executors” which express how the data itself is partitioned and fed into the reducing operator, and I want to bring this more into the foreground.

The idea is that I want to make it so that people writing Executors don’t need to worry about re-implementing transducers, and people writing custom transducers shouldn’t have to worry about re-implementing all the executors.

So in TransducersNext.jl, I have

  • SequentialEx (this basically does foldl)
  • SIMDEx for making a simple @simd loop
  • ChunkedEx for chunking
  • ThreadEx for multithreading partitioning
  • DistributedEx for distributed partitioning
  • KernelAbstractionsEx under development
  • Maybe soon a BlockedCommutativeEx under development

All but sequential, SIMD, and BlockedCommutative take an inner executor so you can compose them.

Yes, there was a bug. Our messages crossed and I updated my comment right before you sent yours, sorry about that.

Totally agreed. I really want to find a way of simplifying this stuff down and making it less scary. I’ve had some success, but there’s a LONG way to go and it’ll require some very hard design and pedagogical decisions.

Actually I disagree with this, they should be agnostic about the particulars of the executor other than the usual stuff (like certain executors requiring associativity or commutativity).

For instance, here’s how the BlockedCommutativeEx works totally fine (and quick!) with the Map transducer (and is faster than mapreduce!):

julia> @btime fold(+, Map(sin), $B; executor=BlockedCommutativeEx())
  12.533 μs (0 allocations: 0 bytes)
2306.4133570501153

julia> @btime mapreduce(sin, +, $B)
  21.440 μs (0 allocations: 0 bytes)
2306.413357050116

Here’s using it with Map and Filter (though with no perf benefit here vs traditional executors):

julia> C = rand(1:10, 10000);

julia> @btime fold(+, Filter(iseven) ⨟ Map(sin), $C; executor=SIMDEx())
  31.288 μs (0 allocations: 0 bytes)
274.85264475617697

julia> @btime mapreduce(sin, +, Iterators.filter(iseven, $C))
  31.268 μs (0 allocations: 0 bytes)
274.85264475617697
3 Likes

Just to expand on this, if one is working with a transducer, it’s very important to properly handle when you use the innermost reducing operator (e.g. +) vs when you’re using the transformed operator (e.g. Map(sin)'(+)). One needs to account for this whenever doing re-associations, and that’s the source of the differences between my version and yours. Otherwise you end up applying sin multiple times when combining sub-reductions!

So there are subtle differences, but the general shape of the algorithm is the same.

I’m actually not thinking about associativity here, but rather the things like:

  • How init is used (or not!)
  • How the empty- and one-element cases behave
  • If op always receives the accumulator on the LHS
  • If op might receive two elements directly, or if it always takes an accumulator (or init) and an element
  • If op might receive two accumulators (or if it’s a potentially separate combine operation)
  • If op might be speculatively executed or memoized
  • … and yes, exactly as you just wrote “it’s very important to properly handle when you use the innermost reducing operator”

You can’t get any of those assumptions wrong or leave them to chance as you design your transformed op.

1 Like

Ah I see what you mean. Yeah, because Transducers / TransducersNext run into these issues way more ‘catastrophically’ than regular reduce, we take a much more rigid set of rules here that all executors must follow or chaos ensues:

  • init is always used on every (sub)reduction.
  • one-element case is done with init. zero element case is an error if it hits the default init.
  • rf (the transformed reducing operator e.g. Map(sin)'(+)) always takes the accumulator on the LHS. The inner-most reducing operator (e.g. +) can take two accumulators**
  • We don’t currently have a policy on speculative execution or memoization, but we probably should.

Edit: I should mention that we have an overloabable interface for how two accumulators are combined: combine(rf, l, r). By default, this strips out all the transducers out of rf and goes down to the inner-most reducing operator, but each transducer in the chain and the innermost reducing operator are both allowed to intervene and change what combine means for them.

On the other hand, it’s next(rf, acc, elem) that always gets an accumulator on the left and an element of the iterator in on right.

So, yeah, to bring this back to my original intent here: my overarching goal is to bring some sort of a “better” generic associativity (and maybe a reordering for some ops if necessary) and rules to Base.reduce in a way that can move WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub forward.

My biggest question is if there exists a decent-enough generic “SIMD-oblivious algorithm” (which isn’t really a thing to my knowledge) that explicitly encodes an associativity that allows for sufficient use of some SIMD instructions (if possible, on most platforms) to avoid major performance regressions from the status-quo @simd loops. There are many subtle reasons that have all added up to my distaste for @simd for something as fundamental as reduce. Things that can change its results include (but are not limited to): --bounds-check=yes, inlineability, type instabilities, views, transposes (and different dims thereof), generator vs. comprehension, iterator style, fma fusion, etc., etc., etc., etc.

The wonderful part is that batched pairwise and inner SIMD-like associativities tend to improve performance on modern CPUs even if they don’t use SIMD instructions thanks to their reduced data dependencies. If op itself is much more expensive than a single instruction or two, then that should dominate any sort of overhead we incur in making this more complicated than the straight loop. And by explicitly batching up the extraction (getindex/iterate) of N elements at a time, N applications of f, and then the applications of op, we can often recover the use of SIMD instructions in cases that wouldn’t have used them with @simd in the first place.

If we can get the pairwise reduction refactor in, it’s going to cause some churn (because folks depend on its internals, bad usages of init, exact values, etc). So all the more reason to make sure its new general behaviors are working as best as they can be. Maybe it makes sense to even explicitly document its associativity — numpy documents theirs.

7 Likes

Hello, I just tried your sum from the PR, and I’ve generalized your ideas using @generated functions: implementing accumulators (for your Base.jl and reduce_mb_4 versions), and incorporating tree reduction within the loop itself (reduce_mb_3).

Here is my attempt:

macro tree_reduce(op, prefix, N, M, stride)
    prefix_str = string(prefix)

    function build_tree(indices)
        count = length(indices)
        if count == 1
            return Symbol(prefix_str, :_, indices[1])
        elseif count == 2
            return :($op($(Symbol(prefix_str, :_, indices[1])),
                $(Symbol(prefix_str, :_, indices[2]))))
        else
            mid = count ÷ 2
            left = build_tree(indices[1:mid])
            right = build_tree(indices[mid+1:end])
            return :($op($left, $right))
        end
    end

    # Generate strided indices: N, N+stride, N+2*stride, ..., N+(M-1)*stride
    indices = [N + i * stride for i in 0:M-1]

    return esc(build_tree(indices))
end



@generated function reduce_abstracted(op, A, ::Val{Ntot}, ::Val{Nacc}) where {Ntot,Nacc}
    Pacc = trailing_zeros(Nacc)
    Ptot = trailing_zeros(Ntot)
    quote
        f = identity
        inds = eachindex(A)
        i1, iN = firstindex(inds), lastindex(inds)
        n = length(inds)
        @nexprs $Ntot N -> a_N = @inbounds A[inds[i1+(N-1)]]
        @nexprs $Nacc N -> v_N = @tree_reduce(op, a, N, $(cld(Ntot, Nacc)), $Nacc)

        for batch in 1:(n>>$Ptot)-1
            i = 1 + (batch << $Ptot)
            @nexprs $Ntot N -> begin
                a_N = @inbounds A[inds[i+(N-1)]]
            end
            @nexprs $Nacc N -> v_N = op(v_N, @tree_reduce(op, a, N, $(cld(Ntot, Nacc)), $Nacc))

        end
        v = @tree_reduce(op, v, 1, $Nacc, 1)
        i = i1 + (n >> $Ptot) * $Ntot - 1
        i == iN && return v
        for i in i+1:iN
            ai = @inbounds A[inds[i]]
            v = op(v, f(ai))
        end
        return v
    end
end

The Base.sum from your PR correspond to Nacc=8 (8 accumulators) and Ntot=8 (no intern tree reduction). You mb_3 function correspond to Ntot=16 and Nacc=4.

Here are my benchmarks:

n = 2^15
a = rand(Float64, n)
for Ntot in (4, 8, 16, 32)
    for Nacc in (1, 2, 4, 8)
        if Nacc <= Ntot
            println("n=$n, a[1:n], view(a, 1:2:n) (above, below) Ntot=$Ntot, Nacc=$Nacc")
            @btime reduce_abstracted(+, $a, $(Val(Ntot)), $(Val(Nacc)))
            @btime reduce_abstracted(+, $(view(a, 1:2:n)), $(Val(Ntot)), $(Val(Nacc)))
        end
    end
end
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=4, Nacc=1
  3.294 μs (0 allocations: 0 bytes)
  2.475 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=4, Nacc=2
  3.405 μs (0 allocations: 0 bytes)
  2.486 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=4, Nacc=4
  3.649 μs (0 allocations: 0 bytes)
  2.449 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=8, Nacc=1
  4.808 μs (0 allocations: 0 bytes)
  2.130 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=8, Nacc=2
  2.396 μs (0 allocations: 0 bytes)
  2.181 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=8, Nacc=4
  2.147 μs (0 allocations: 0 bytes)
  4.372 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=8, Nacc=8
  3.427 μs (0 allocations: 0 bytes)
  2.225 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=16, Nacc=1
  4.323 μs (0 allocations: 0 bytes)
  2.227 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=16, Nacc=2
  2.061 μs (0 allocations: 0 bytes)
  2.123 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=16, Nacc=4
  1.739 μs (0 allocations: 0 bytes)
  4.345 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=16, Nacc=8
  2.468 μs (0 allocations: 0 bytes)
  4.318 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=32, Nacc=1
  4.236 μs (0 allocations: 0 bytes)
  2.214 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=32, Nacc=2
  2.270 μs (0 allocations: 0 bytes)
  2.369 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=32, Nacc=4
  1.348 μs (0 allocations: 0 bytes)
  3.978 μs (0 allocations: 0 bytes)
n=32768, a[1:n], view(a, 1:2:n) (above, below) Ntot=32, Nacc=8
  1.651 μs (0 allocations: 0 bytes)
  3.959 μs (0 allocations: 0 bytes)

For comparison, the Base.sum from your PR on Float64:

@btime Base.sum($a)
@btime Base.sum($(view(a, 1:2:n)))
3.459 μs (0 allocations: 0 bytes)
16295.798570780385

  4.040 μs (0 allocations: 0 bytes)
8128.148967556634

This is different from what I got because of the additionnal pairwise reduction I guess.
Here is what the current Base.sum doing on Float64:

2.488 μs (0 allocations: 0 bytes)
16410.273204764926

 6.539 μs (0 allocations: 0 bytes)
8170.225306474282

and just for completeness, the results from @simd:

a=rand(Float32, 2^15)
function simdsum(a::AbstractArray{T}) where T
    s = T(0)
    @simd for x in a
        s += x
    end
    return s
end
@btime simdsum($a)
@btime simdsum($(view(a, 1:2:2^15)))
1.058 μs (0 allocations: 0 bytes)
16345.918f0

6.549 μs (0 allocations: 0 bytes)
8140.3384f0

On my Dell machine, the current sum implementation underperforms for non-view floating-point arrays. I’m curious whether you observe similar benchmark trends on your system:

  • Val(16), Val(2): Good all-around performance across array types
  • Val(32), Val(4): Optimal performance specifically for contiguous bitstype arrays

I’ve found that @simd performs poorly on views and larger element types, but on my machine it’s unbeatable for contiguous bitstype arrays—with one exception: very small arrays benefit more from vectorized loads followed by reduction.

For context, I started a discussion on this topic a few days ago (before discovering this thread):

I also documented my initial attempt to outperform mapreduce on contiguous bitstype arrays using @simd:
https://epilliat.github.io/coding/notebooks/mapreduce_cpu_perf.html

A key finding: alignment is critical for strong @simd performance, at least on my hardware. The current Base.sum implementation doesn’t properly initialize accumulators for 1024-element blocks, which significantly degrades performance.

3 Likes