Pairwise reduction significantly slower than two scalar reductions

user664303 · March 17, 2025, 4:43pm

I’ve written two reductions which both do the same thing: compute the sum and the squared norm of a vector. One does this in one pass through the data, accumulating two values as it goes. The second does two passes through the data, accumulating a single, scalar value each time. The second approach, as implemented below, is currently more than 3x faster, on my machine (and only for vector sizes that fit in cache). This feels like it shouldn’t be the case - it should be faster with the single pass through.

using BenchmarkTools, StaticArrays

x_xx(x) = SVector(x, x * x)
xx(x) = x * x
sum1(vec::AbstractVector{T}) where T = sum(x_xx, vec; init=SVector(T(0), T(0)))
sum2(vec) = SVector(sum(vec), sum(xx, vec))

function time()
    A = randn(2^13)
    @btime sum1($A) # A single, combined vector reduction
    @btime sum2($A) # Two vector reductions
    return
end

Is there any way to improve my implementation of the single reduction to make it faster? It’s frustrating at present, because once the vector can’t fit in the cache, the single pass through approach becomes faster, so there is some point at which it is worth switching algorithm.

jling · March 17, 2025, 5:15pm

It’s usually that the N-pass can use SIMD and SIMD factor is greater than N.

a few things are unfortunately like this in Julia:

julia> const a = rand(10^6);

julia> @be minimum(a)
Benchmark: 342 samples with 1 evaluation
 min    264.215 μs
 median 268.488 μs
 mean   277.420 μs
 max    698.978 μs

julia> @be extrema(a)
Benchmark: 39 samples with 1 evaluation
 min    2.560 ms
 median 2.569 ms
 mean   2.571 ms
 max    2.597 ms

extrema is slower than taking minimum & maximum.

another examples is findmin can be slower than minimum & findfirst(==(min_ele), )

user664303 · March 17, 2025, 5:38pm

From what you’re saying, I’ve understood that there is currently no better way of doing a parallel reduction in Julia, currently. However, I’m sure there could be: each SIMD chunk could be loaded, then each reduction applied serially, in the same way the scalar reduction is done. This way you save extra loads, and potential cache misses, and the rest of the code is the same. I can see that this might require a different interface to the reduction wrapper, though.

sgaure · March 17, 2025, 5:43pm

I’m not very surprised by this. If memory bandwidth is the limiting factor, it will typically pay to do more operations for each memory access. When everything is in cache, it’s not that big deal.

sum2 does two passes, each with a single value being accumulated. That means it’s easy to vectorize the operations (a.k.a. SIMD), so 4 or 8 or more of the additions can be done in parallel. Also, it might be that the sum function has some specializations for such simple vectors (or mapreduce has, which sum typically uses for such vectors).

user664303 · March 17, 2025, 5:50pm

It’s important to note that the one pass algorithm could exploit the same SIMD reduction also. It just doesn’t because the interface gets in the way.

mbauman · March 17, 2025, 5:56pm

The reduction machinery that underpins sum is complicated and has lots of possible paths depending on all sorts of things, some optimizing performance, some optimizing numerical accuracy, some doing both with varying levels of success. It’s not hard to end up in a less-than-optimal path, depending on your criteria of optimal. It can definitely be improved, especially if you have strong opinions on what “optimal” means to you.

In this case, this isn’t a fundamental limitation for my processor — but it definitely might be if the “fused” operations are complicated enough.

julia> @btime sum1($A);
  7.646 μs (0 allocations: 0 bytes)

julia> @btime sum2($A);
  3.385 μs (0 allocations: 0 bytes)

julia> function sum3(A)
           r = zero(eltype(A))
           r2 = r*r
           @simd for elt in A
               r += elt
               r2 += elt*elt
           end
           return (r, r2)
       end
sum3 (generic function with 1 method)

julia> @btime sum3($A);
  2.472 μs (0 allocations: 0 bytes)

Just a note on terminology here — parallel reduction and pairwise reduction are both fairly well established terms of art that are about the ordering of elements in the reduction and not about the operation itself. I’d call this a “fused” operation, but I’m not sure if there’s more standard terminology.

user664303 · March 17, 2025, 6:14pm

Thanks, @mbauman. This is a simple solution that works well for both of the vector types I care about.

mbauman · March 17, 2025, 6:57pm

Stripping away sum also makes the problem much more obvious:

julia> function sum4(A)
           r = SVector(zero(eltype(A)), zero(eltype(A)))
           @simd for elt in A
               r += SVector(elt, elt*elt)
           end
           return r
       end
sum4 (generic function with 1 method)

julia> @btime sum4($A);
  7.646 μs (0 allocations: 0 bytes)

This ends up looking very similar to sum3 in the IR, but the compiler needs to be able to identify reduction variables and handle them specially. I don’t know if we can or how to support such composite reduction variables right now.

user664303 · March 17, 2025, 7:05pm

I was thinking about having a separate scalar function for each reduction, such as:

f1(x) = x
f2(x) = x * x
sum(f1, f2, vec)

But if your functions need to recompute intermediate values, it’s not optimal. Unless they’re inlined and the compiler can remove the redundancies. Anyway, it maybe overcomplicates things, so is likely not worth it.

DNF · March 17, 2025, 11:02pm

Aside from the issues discussed in this thread, I don’t see any reason to use SVector rather than Tuple. You’re not using any linear algebra functionality, so then SVector is just a redundant wrapper around Tuple.

matthias314 · March 17, 2025, 11:42pm

sum1 adds up the SVectors returned by x_xx.

Dan · March 18, 2025, 12:33am

When doing reduction the operation is assumed to be associative. This allows the known parallel implementations. But here, the crucial bit is using SIMD better and to do that commutativity is important (which would allow reordering the summations across SIMD wide channels).
Without checking the code, I would assume specialized sum for this case does use commutative SIMD, but generic reduce or @simd for may fail to know and use the commutativity (and shouldn’t assume it).
Perhaps this is an example where traits can come in handy.

Topic		Replies	Views
Performance challenge: can you write a faster sum? Performance simd	31	1558	July 9, 2025
How to match performance of sum(A, dims=1)? Performance	14	1106	September 21, 2022
Why are vectorized operations faster in julia? Performance simd	3	6777	April 1, 2019
A fast sum. Any downsides? Performance sum	18	961	December 16, 2024
Improving the performance of a sum over a vector of vectors Performance performance , loops , iterators	10	639	July 13, 2022

Pairwise reduction significantly slower than two scalar reductions

Related topics