The foldl call sums the numbers strictly from left to right whereas sum uses pairwise summation, which is much more accurate and also faster. Doing a left-to-right SIMD summation would be even faster but not as accurate (although slightly more accurate than strict left-to-right). Floating-point summation is an unexpectedly deep and tricky problem.
Pairwise summation generally isn’t faster than left-to-right summation; they are about the same speed if the base case of the pairwise algorithm (which performs left-to-right summation!) is large enough. I don’t think the timing difference here are significant (they might be due to timing fluctuations or compiler hiccups).
(Sometimes divide-and-conquer versions of algorithms gain a speed advantage from improved cache locality, but that isn’t the case for summation since each element is used exactly once and the left-to-right algorithm already maximizes spatial locality.)
Note that similar timings were in the original post—I would not have guessed this. This makes sense because strict left-to-right summation cannot use SIMD because that requires reassociating the sum. I haven’t looked into it deeply, but my quick take is that:
foldl(+, a) is slowest because strict left-to-right association prevents SIMD
sum(a) allows SIMD, which gives a big boost, but the pairwise summation is slower than a simpler algorithm so you net about 2x faster than foldl(+, a)
mysum(a) is fastest: simple left-to-right but allowing just enough reassociation to use SIMD.
function naivesum(a)
s = zero(eltype(a))
@simd for x in a
s += x
end
return s
end
which is mostly left-to-right summation but is not strictly left-to-right at a fine-grained level because @simd allows some reassociation. This achieves basically the same performance as sum:
julia> @btime sum($a);
5.047 ms (0 allocations: 0 bytes)
julia> @btime naivesum($a);
5.166 ms (0 allocations: 0 bytes)
What’s fun is that @simd typically improves accuracy, too, as it’ll use 2-8 intermediate “buckets” instead of just one. Not as good or as robust as pairwise, but generally better than the non-simd naive thing since, assuming all values are positive, each bucket will have 1-3 more bits of precision.
Oh, sure enough! I just assumed the 512÷64 == 8 would mean 8 buckets for Float64 on AVX512, but LLVM indeed is loading and adding four <8 x doubles> in each inner loop. I suppose this is just a small loop unroll microptimization since it has the registers available to do so? That’s cool.
As I understand it, the reason to use more than 8 buckets is that by using 32, you only touch each register every 4 operations, which matters since the latency of the instructions is worse than their bandwidth.
Exactly. If you don’t unroll, each add instruction has to wait on the previous one to finish before it can start executing. That doesn’t let you take advantage of super scalar parallelism.
If you have a theoretical throughput of 2 add instructions / clock cycle and a latency of 4 clock cycles, you could have 8 add instructions happening at a time.
Meaning unrolling up to 8x could increase performance in that example.
But a sum is unlikely to be hitting theoretical peak performance anyway, so 4 is a good compromise (especially if you handle the remainder with a scalar loop).
LoopVectorization lets you use an unroll argument to manually specify the unroll factor (on top of SIMD):
julia> using LoopVectorization, BenchmarkTools
julia> x = rand(512);
julia> @generated function mysumavx(a, ::Val{U}) where {U}
quote
s = zero(eltype(a))
@avx unroll=$U for i ∈ eachindex(a)
s += a[i]
end
s
end
end
mysumavx (generic function with 2 methods)
julia> for U ∈ 1:8
@btime mysumavx($x, $(Val(U)))
end
38.540 ns (0 allocations: 0 bytes)
24.304 ns (0 allocations: 0 bytes)
19.972 ns (0 allocations: 0 bytes)
17.227 ns (0 allocations: 0 bytes)
17.768 ns (0 allocations: 0 bytes)
17.100 ns (0 allocations: 0 bytes)
15.146 ns (0 allocations: 0 bytes)
15.217 ns (0 allocations: 0 bytes)
[Currently it requires an integer literal, but I should change that for the sake of making examples like the above not require a generated function.]
Unrolling by 4 was over twice as fast as not unrolling, but returns quickly diminished.