OK, I’ve been fiddling with this on and off for a month now, but I’m curious if anyone can do better. Here’s the challenge: write a single generic implementation of sum(A) such that it’s fast for bothA = rand(10000) and a B = @view A[begin:2:end]. Currently — even though B has half the elements — B is more than 4x slower on my M1 mac!
julia> using BenchmarkTools
julia> A = rand(10000); B = @view A[begin:2:end];
julia> @btime sum($A);
946.640 ns (0 allocations: 0 bytes)
julia> @btime sum($B);
4.149 μs (0 allocations: 0 bytes)
Super bonus points if you don’t use @simd (really!), you definitely can not use @fastmath, and no threads, but you can use @inbounds and you can assume there are at least some small number of elements in the array without checking. For inspiration, here’s my current best effort; it gets us half-way between the two. I get a nice 4x speedup on the view, but a 2x regression on the array:
implementation of sum_mb_1
julia> using Base.Cartesian
julia> function sum_mb_1(A)
op = +
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 4 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 4 N->v_N = 0.0 + a_N
for batch in 1:(n>>2)-1
i = i1 + batch*4
@nexprs 4 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 4 N->fa_N = f(a_N)
@nexprs 4 N->v_N = op(v_N, fa_N)
end
v = op(op(v_1, v_2), op(v_3, v_4))
i = i1 + (n>>2)*4 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
sum_mb_1 (generic function with 1 method)
The context here, of course, is WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub. If we could hardcode an effective@simd pattern that’s close to optimal for most architectures, it would solve issues like the fact that mean(A, dims=1) is significantly different from mean(A', dims=2). Obviously this will be highly architecture sensitive, but the goal is to unify these two behaviors within a single arch, and then we can experiment if it’s possible to make that work on the most common systems.
Thanks to the rubber duck phenomenon (I don’t know why I was so convinced my M1 had 256 bit widths here… ), I noticed a trivial improvement here for Float64s on my M1. I’m curious how this extends to other archs and datatypes (of course we probably need 16 bins for optimal Float32).
sum_mb_2
julia> function sum_mb_2(A)
op = +
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 8 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 8 N->v_N = 0.0 + a_N
for batch in 1:(n>>3)-1
i = i1 + batch*8
@nexprs 8 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 8 N->fa_N = f(a_N)
@nexprs 8 N->v_N = op(v_N, fa_N)
end
v = op(op(op(v_1, v_2), op(v_3, v_4)), op(op(v_5, v_6), op(v_7, v_8)))
i = i1 + (n>>3)*8 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
sum_mb_2 (generic function with 1 method)
so essentially reversed (B 2x slower than A). Maybe your chip can gather better? Also notice that neither is tremendously improved over a sum baseline EDIT: for me, anyway.
Have you looked at the code_native? Sometimes if you can see something dumb in the inner loop it can give an insight into how to improve the source. Although for B the code_native I’m seeing isn’t trivial to follow…
Honestly, that’s great news! I’m quite happy to simply preserve the status quo performances. It’s fascinating that the current sum on x86 is way less pessimized in the strided view case… but the actual use case that I’m targeting significant improvements for is in more complicated arrays (e.g., dimensional sums) and arbitrary iterators. I’m also not surprised that this would vary so much by architecture.
And, yeah, I had been looking extensively at both LLVM and native assembly. It was in trying to emulate the code_llvm of the naive @simd loop that I became convinced I should be targeting 4 bins. Only by stepping away from it did I realize I should try 8 bins.
Whoa, that’s fun. I still don’t like how @simd can reach inside an unrelated function and fuse an unrelated multiplication with the reduction into a fma. This current behavior seems not great for a generic reduction framework — there’s no way that re-associating (or even re-ordering) this summation could give you a nonzero value:
I believe the difference is that it creates the pairwise summation which has a specific instruction in aarch64 (faddp). Not sure why it’s not doing on the normal version but with some tweaking you might be able to do that (the semantics are all very odd here so it might not understand you are doing a proper pairwise)
Ahhhh, yes, it’s the combination of the explicit re-ordering and an annotation of reassoc on the fadd that do the ticket!
julia> add_reassoc(x::Float64,y::Float64) = Base.llvmcall("""
%3 = fadd reassoc double %0, %1
ret double %3""", Float64, Tuple{Float64, Float64}, x, y)
add_reassoc (generic function with 1 method)
julia> function reduce_mb_2(op, A)
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 8 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 8 N->v_N = op(zero(eltype(A)), a_N)
for batch in 1:(n>>3)-1
i = i1 + batch*8
@nexprs 8 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 8 N->fa_N = f(a_N)
@nexprs 8 N->v_N = op(v_N, fa_N)
end
v = op(op(op(v_1, v_2), op(v_3, v_4)), op(op(v_5, v_6), op(v_7, v_8)))
i = i1 + (n>>3)*8 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
reduce_mb_2 (generic function with 1 method)
julia> @btime reduce_mb_2(+, $A);
1.150 μs (0 allocations: 0 bytes)
julia> @btime reduce_mb_2(add_reassoc, $A);
782.048 ns (0 allocations: 0 bytes)
julia> @btime sum_mb_2_simd($A); # implemented as you'd expect
782.226 ns (0 allocations: 0 bytes)
julia> @btime foldl(add_reassoc, $A); # the naive loop doesn't benefit from reassoc alone
1.142 μs (0 allocations: 0 bytes)
Unfortunately, this doesn’t magically solve the not-enough-bins problem for the smaller datatypes.
Edit: It’s also interesting that this gets the exact same performance as the @simd but — unlike the @simd version — it’s not using addp, even with the reassoc annotation.
Here we go, this rearrangement seems to be doing even better without the need for a :reassoc flag, for both my M1 and an Icelake server. It’s faster than master for 64-bit, and slightly slower for 32-bit. Might be a workable approach.
julia> function reduce_mb_3(op, A)
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 16 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 4 N->v_N = op(zero(eltype(A)), op(op(a_N, a_{N+4}), op(a_{N+8}, a_{N+12})))
for batch in 1:(n>>4)-1
i = i1 + batch*16
@nexprs 16 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 16 N->fa_N = f(a_N)
@nexprs 4 N->v_N = op(v_N, op(op(a_N, a_{N+4}), op(a_{N+8}, a_{N+12})))
end
v = op(op(v_1, v_2), op(v_3, v_4))
i = i1 + (n>>4)*16 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
reduce_mb_3 (generic function with 1 method)
julia> A = rand(10000); B = @view A[begin:2:end];
julia> @btime reduce(+, $A)
943.320 ns (0 allocations: 0 bytes)
4990.955387266911
julia> @btime reduce_mb_3(+, $A)
580.357 ns (0 allocations: 0 bytes)
4990.955387266911
julia> A = rand(Float32, 10000); B = @view A[begin:2:end];
julia> @btime reduce_mb_3(+, $A)
568.919 ns (0 allocations: 0 bytes)
4958.589f0
julia> @btime reduce(+, $A)
486.749 ns (0 allocations: 0 bytes)
4958.5903f0
Note that your construction uses commutativity, not just associativity. So it’s not suitable for general map-reduce. Even restricting to <:Number, this will produce wrong results for e.g. the product of quaternions (which are <: Number).
In view of that restriction, is there still a reason you want a single piece of code, instead of optimizing the most pertinent cases separately?
Right, this certainly can’t be used for products and any reduction operator. But we do have more operators than + for which we assume commutivity — min/max/extrema and bitwise arithmetic. This would only apply there. We should probably more prominently document that assumption for those functions if we do this.
The prod case is unfortunate though — that’s currently getting the benefit of @simd but we can’t similarly pre-emptively dispatch to the re-ordered algorithm without also hardcoding eltypes and a few mapping functions.
I’m not sure how important it is to squeeze out maximum speed for prod, especially for prod of ordinary Number types. For large arrays, it tends to overflow/underflow, so you rarely use it in that case. And for products of very small collections, you will often want to use it with tuples so that it unrolls. For colllections of strings, it is better to use join.
Is there a half-way comprehensable explanation for why this is the case even with a very simple @simd loop? i.e.
julia> function simple_sum(A; init=zero(eltype(A)))
acc = init
@simd for x ∈ A
acc += x
end
acc
end;
julia> let A = rand(10_000)
B = @view A[1:2:end]
@btime simple_sum($A)
@btime simple_sum($B)
end;
500.896 ns (0 allocations: 0 bytes)
11.873 μs (0 allocations: 0 bytes)
Is this a problem in the LLVM vectorizer, or are we missing some information to pass onto LLVM or something?
I would love to better diagnose the remainder, however! That said, this is still a relatively special case — the CPU may be able to directly load from memory into the SIMD registers with a begin:2:end sort of stride. That’s not possible for a general @view A[1, :] sort of access (think a reduction with dims=2) or even a @view A[shuffle(begin:end)], but I suspect there would still be performance advantages to a SIMD-like reordering.
It’s really interesting to me that we can get better performance than the naive @simd loop without a parallel multiple-bins type of re-ordering (for Float64, on my arch). It’s still not as fast as it could be… and using multiple bins (if op is commutative, of course) gives us yet better performance and greater numerical stability at the same time.
reduce_mb_4
julia> function reduce_mb_4(op, A)
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
v = zero(eltype(A))
for batch in 0:(n>>4)-1
i = i1 + batch*16
@nexprs 16 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 16 N->fa_N = f(a_N)
@nexprs 2 N->v_N = op(op(op(op(op(op(op(fa_{(N-1)*8+1}, fa_{(N-1)*8+2}), fa_{(N-1)*8+3}), fa_{(N-1)*8+4}), fa_{(N-1)*8+5}), fa_{(N-1)*8+6}), fa_{(N-1)*8+7}), fa_{(N-1)*8+8})
v = op(v, op(v_1, v_2))
end
i = i1 + (n>>4)*16 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
reduce_mb_4 (generic function with 1 method)
julia> function sum_simd(A)
r = zero(eltype(A))
@simd for i in eachindex(A)
r += @inbounds A[i]
end
r
end
sum_simd (generic function with 1 method)
julia> A = rand(10000); B = @view A[begin:2:end];
julia> @btime reduce_mb_4(+, $A)
733.008 ns (0 allocations: 0 bytes)
4977.12180940769
julia> @btime sum_simd($A)
1.142 μs (0 allocations: 0 bytes)
4977.121809407697
julia> A = rand(Float32, 10000); B = @view A[begin:2:end];
julia> @btime reduce_mb_4(+, $A)
924.029 ns (0 allocations: 0 bytes)
5002.0093f0
julia> @btime sum_simd($A)
561.935 ns (0 allocations: 0 bytes)
5002.0117f0
Here’s my overarching thought: foldl and foldr have well defined associativity. We get to choose the associativity of reduce. Right now, we sometimes use a pairwise algorithm with a relatively large block size (1024) over an @simd loop (which may sometimes include further splitting into parallel accumulators, depending on architecture and element type and bounds checking and type stability and indexability and data layout).
Why let all that up to chance, especially if we can simultaneously improve both performance and accuracy? If we explicitly encode an associativity (or perhaps an additional re-ordering for commutative operators) that performs well for these core numeric types and leads to a more consistent (and perhaps smaller) base block size, then my hope is that this will also perform generically well for non-simdy datatypes. Many of the same fundamentals that enable SIMD also tend to improve non-SIMD code for modern pipelining CPUs (e.g., by reducing inter- and intra-loop data dependencies). Or that’s my hope and hypothesis (hope-othesis?), anyhow. If it turns out to not help, foldl is right there for you.
I don’t know if anyone has ever published an analysis of something like the reduce_mb_4 as the base case for a pairwise splitting algorithm — it preserves ordering, and has an inner reduction of 8 elements at a time.
Some ideas I have found are:
A reduce-shift bottom-up SIMD-y fully pairwise implementation (Dalton, Wang & Blainey, 2014). This is very cool, but it requires a 64-element intermediate stack-allocated mutable array.
The pairwise “combine” operation could additionally be compensating, since that happens less frequently (Blanchard, Higham, & Mary, 2020). Allowing a distinct combine operation could generally be quite powerful; it’d enable a whole nother class of transducers.