OK, I’ve been fiddling with this on and off for a month now, but I’m curious if anyone can do better. Here’s the challenge: write a single generic implementation of sum(A)
such that it’s fast for both A = rand(10000)
and a B = @view A[begin:2:end]
. Currently — even though B
has half the elements — B
is more than 4x slower on my M1 mac!
julia> using BenchmarkTools
julia> A = rand(10000); B = @view A[begin:2:end];
julia> @btime sum($A);
946.640 ns (0 allocations: 0 bytes)
julia> @btime sum($B);
4.149 μs (0 allocations: 0 bytes)
Super bonus points if you don’t use @simd
(really!), you definitely can not use @fastmath
, and no threads, but you can use @inbounds
and you can assume there are at least some small number of elements in the array without checking. For inspiration, here’s my current best effort; it gets us half-way between the two. I get a nice 4x speedup on the view, but a 2x regression on the array:
implementation of sum_mb_1
julia> using Base.Cartesian
julia> function sum_mb_1(A)
op = +
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 4 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 4 N->v_N = 0.0 + a_N
for batch in 1:(n>>2)-1
i = i1 + batch*4
@nexprs 4 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 4 N->fa_N = f(a_N)
@nexprs 4 N->v_N = op(v_N, fa_N)
end
v = op(op(v_1, v_2), op(v_3, v_4))
i = i1 + (n>>2)*4 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
sum_mb_1 (generic function with 1 method)
julia> @btime sum_mb_1($A);
2.259 μs (0 allocations: 0 bytes)
julia> @btime sum_mb_1($B);
1.142 μs (0 allocations: 0 bytes)
The context here, of course, is WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub. If we could hardcode an effective @simd
pattern that’s close to optimal for most architectures, it would solve issues like the fact that mean(A, dims=1)
is significantly different from mean(A', dims=2)
. Obviously this will be highly architecture sensitive, but the goal is to unify these two behaviors within a single arch, and then we can experiment if it’s possible to make that work on the most common systems.
13 Likes
I feel like we should have started nerdsniping Discourse into making Base Julia (rather than random user code) faster years ago!
12 Likes
Thanks to the rubber duck phenomenon (I don’t know why I was so convinced my M1 had 256 bit widths here…
), I noticed a trivial improvement here for Float64s on my M1. I’m curious how this extends to other archs and datatypes (of course we probably need 16 bins for optimal Float32).
sum_mb_2
julia> function sum_mb_2(A)
op = +
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 8 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 8 N->v_N = 0.0 + a_N
for batch in 1:(n>>3)-1
i = i1 + batch*8
@nexprs 8 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 8 N->fa_N = f(a_N)
@nexprs 8 N->v_N = op(v_N, fa_N)
end
v = op(op(op(v_1, v_2), op(v_3, v_4)), op(op(v_5, v_6), op(v_7, v_8)))
i = i1 + (n>>3)*8 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
sum_mb_2 (generic function with 1 method)
julia> @btime sum_mb_2($A);
1.137 μs (0 allocations: 0 bytes)
julia> @btime sum_mb_2($B);
615.000 ns (0 allocations: 0 bytes)
1 Like
Are you sure your sum_mb_2
benchmark is correct? Relative to yours, my Intel x86 gives
julia> @btime sum_mb_2($A); @btime sum_mb_2($B);
493.093 ns (0 allocations: 0 bytes)
1.184 μs (0 allocations: 0 bytes)
julia> @btime sum($A); @btime sum($B);
540.043 ns (0 allocations: 0 bytes)
1.479 μs (0 allocations: 0 bytes)
so essentially reversed (B
2x slower than A
). Maybe your chip can gather better? Also notice that neither is tremendously improved over a sum
baseline EDIT: for me, anyway.
Have you looked at the code_native
? Sometimes if you can see something dumb in the inner loop it can give an insight into how to improve the source. Although for B
the code_native
I’m seeing isn’t trivial to follow…
1 Like
Are you using
f = identity
...
fa_N = f(a_N)
For performance reasons or to make the code more generic so this could be changed into a general mapreduce function in the future?
Honestly, that’s great news! I’m quite happy to simply preserve the status quo performances. It’s fascinating that the current sum
on x86 is way less pessimized in the strided view case… but the actual use case that I’m targeting significant improvements for is in more complicated arrays (e.g., dimensional sums) and arbitrary iterators. I’m also not surprised that this would vary so much by architecture.
And, yeah, I had been looking extensively at both LLVM and native assembly. It was in trying to emulate the code_llvm
of the naive @simd
loop that I became convinced I should be targeting 4 bins. Only by stepping away from it did I realize I should try 8 bins.
1 Like
It’s the latter — that should be a no-perf-impact simplification that allowed for an easy extraction from WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub
The simd’s do help though. With @simd
your version is faster than sum
which uses a simd inside
2 Likes
Whoa, that’s fun. I still don’t like how @simd
can reach inside an unrelated function and fuse an unrelated multiplication with the reduction into a fma
. This current behavior seems not great for a generic reduction framework — there’s no way that re-associating (or even re-ordering) this summation could give you a nonzero value:
julia> sum(identity, [0.0; -.1; .1; zeros(13)])
0.0
julia> sum(x->x*0.5, [0.0; -.1; .1; zeros(13)])
0.0
julia> sum(x->x*0.1, [0.0; -.1; .1; zeros(13)])
-8.326672684688674e-19
julia> sum(@noinline(x->x*0.1), [0.0; -.1; .1; zeros(13)])
0.0
I do wonder which of the extra liberties from the simdloop
annotation are making the difference in the @simd
-annotated sum_mb_2
.
I believe the difference is that it creates the pairwise summation which has a specific instruction in aarch64 (faddp). Not sure why it’s not doing on the normal version but with some tweaking you might be able to do that (the semantics are all very odd here so it might not understand you are doing a proper pairwise)
1 Like
Ahhhh, yes, it’s the combination of the explicit re-ordering and an annotation of reassoc
on the fadd
that do the ticket!
julia> add_reassoc(x::Float64,y::Float64) = Base.llvmcall("""
%3 = fadd reassoc double %0, %1
ret double %3""", Float64, Tuple{Float64, Float64}, x, y)
add_reassoc (generic function with 1 method)
julia> function reduce_mb_2(op, A)
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 8 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 8 N->v_N = op(zero(eltype(A)), a_N)
for batch in 1:(n>>3)-1
i = i1 + batch*8
@nexprs 8 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 8 N->fa_N = f(a_N)
@nexprs 8 N->v_N = op(v_N, fa_N)
end
v = op(op(op(v_1, v_2), op(v_3, v_4)), op(op(v_5, v_6), op(v_7, v_8)))
i = i1 + (n>>3)*8 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
reduce_mb_2 (generic function with 1 method)
julia> @btime reduce_mb_2(+, $A);
1.150 μs (0 allocations: 0 bytes)
julia> @btime reduce_mb_2(add_reassoc, $A);
782.048 ns (0 allocations: 0 bytes)
julia> @btime sum_mb_2_simd($A); # implemented as you'd expect
782.226 ns (0 allocations: 0 bytes)
julia> @btime foldl(add_reassoc, $A); # the naive loop doesn't benefit from reassoc alone
1.142 μs (0 allocations: 0 bytes)
Unfortunately, this doesn’t magically solve the not-enough-bins problem for the smaller datatypes.
Edit: It’s also interesting that this gets the exact same performance as the @simd
but — unlike the @simd
version — it’s not using addp
, even with the reassoc
annotation.
1 Like
Here we go, this rearrangement seems to be doing even better without the need for a :reassoc
flag, for both my M1 and an Icelake server. It’s faster than master for 64-bit, and slightly slower for 32-bit. Might be a workable approach.
julia> function reduce_mb_3(op, A)
f = identity
inds = eachindex(A)
i1, iN = firstindex(inds), lastindex(inds)
n = length(inds)
@nexprs 16 N->a_N = @inbounds A[inds[i1+(N-1)]]
@nexprs 4 N->v_N = op(zero(eltype(A)), op(op(a_N, a_{N+4}), op(a_{N+8}, a_{N+12})))
for batch in 1:(n>>4)-1
i = i1 + batch*16
@nexprs 16 N->a_N = @inbounds A[inds[i+(N-1)]]
@nexprs 16 N->fa_N = f(a_N)
@nexprs 4 N->v_N = op(v_N, op(op(a_N, a_{N+4}), op(a_{N+8}, a_{N+12})))
end
v = op(op(v_1, v_2), op(v_3, v_4))
i = i1 + (n>>4)*16 - 1
i == iN && return v
for i in i+1:iN
ai = @inbounds A[inds[i]]
v = op(v, f(ai))
end
return v
end
reduce_mb_3 (generic function with 1 method)
julia> A = rand(10000); B = @view A[begin:2:end];
julia> @btime reduce(+, $A)
943.320 ns (0 allocations: 0 bytes)
4990.955387266911
julia> @btime reduce_mb_3(+, $A)
580.357 ns (0 allocations: 0 bytes)
4990.955387266911
julia> A = rand(Float32, 10000); B = @view A[begin:2:end];
julia> @btime reduce_mb_3(+, $A)
568.919 ns (0 allocations: 0 bytes)
4958.589f0
julia> @btime reduce(+, $A)
486.749 ns (0 allocations: 0 bytes)
4958.5903f0
5 Likes
Very nice stuff @mbauman. It seems even faster than LoopVectorization for strided arrays which is nice
julia> A = rand(Float32, 10000); B = @view A[begin:2:end];
julia> @btime reduce_mb_3(+, $A); @btime reduce_mb_3(+, $B)
578.093 ns (0 allocations: 0 bytes)
618.254 ns (0 allocations: 0 bytes)
2489.102f0
julia> @btime sum_turbo($A); @btime sum_turbo($B)
296.456 ns (0 allocations: 0 bytes)
1.188 μs (0 allocations: 0 bytes)
2489.1023f0
for
julia> function sum_turbo(A)
v = zero(eltype(A))
@turbo for i in eachindex(A)
v += A[i]
end
v
end
sum_turbo (generic function with 1 method)
Note that your construction uses commutativity, not just associativity. So it’s not suitable for general map-reduce. Even restricting to <:Number
, this will produce wrong results for e.g. the product of quaternions (which are <: Number
).
In view of that restriction, is there still a reason you want a single piece of code, instead of optimizing the most pertinent cases separately?
1 Like