Reduce vs. foldl: performance and precision

Hi guys,

Can you explain me, please, the difference in performance and precision between reduce and foldl?

julia> using BenchmarkTools

julia> a = rand(10^7)
10000000-element Array{Float64,1}:
 0.8488959437260439
 0.7665194419056716
 0.5489381803548694
 0.1692203002559569
 0.48751467346409405
 0.6575900969521085
 ⋮
 0.1928387601015309
 0.9422496727361025
 0.25160718216629463
 0.15838788242124457
 0.35871647538214857

julia> @btime sum(a)
  4.845 ms (1 allocation: 16 bytes)
4.9994262716435e6

julia> @btime reduce(+, a)
  4.856 ms (1 allocation: 16 bytes)
4.9994262716435e6

julia> @btime foldl(+, a)
  9.917 ms (1 allocation: 16 bytes)
4.999426271643946e6

Thank you.

The foldl call sums the numbers strictly from left to right whereas sum uses pairwise summation, which is much more accurate and also faster. Doing a left-to-right SIMD summation would be even faster but not as accurate (although slightly more accurate than strict left-to-right). Floating-point summation is an unexpectedly deep and tricky problem.

6 Likes

Thank you. And what about the reduce? Does it behave similar to sum?

Yes, reduce does not guarantee anything about association order, so by calling it you’re effectively saying “do whatever seems best/fastest”.

Pairwise summation generally isn’t faster than left-to-right summation; they are about the same speed if the base case of the pairwise algorithm (which performs left-to-right summation!) is large enough. I don’t think the timing difference here are significant (they might be due to timing fluctuations or compiler hiccups).

(Sometimes divide-and-conquer versions of algorithms gain a speed advantage from improved cache locality, but that isn’t the case for summation since each element is used exactly once and the left-to-right algorithm already maximizes spatial locality.)

2 Likes

That’s not what benchmarks show:

julia> @btime sum($a)
  4.460 ms (0 allocations: 0 bytes)
4.998527508812167e6

julia> @btime foldl($+, $a)
  10.614 ms (0 allocations: 0 bytes)
4.998527508812602e6

Note that similar timings were in the original post—I would not have guessed this. This makes sense because strict left-to-right summation cannot use SIMD because that requires reassociating the sum. I haven’t looked into it deeply, but my quick take is that:

  1. foldl(+, a) is slowest because strict left-to-right association prevents SIMD
  2. sum(a) allows SIMD, which gives a big boost, but the pairwise summation is slower than a simpler algorithm so you net about 2x faster than foldl(+, a)
  3. mysum(a) is fastest: simple left-to-right but allowing just enough reassociation to use SIMD.
4 Likes

Oh, right. I was thinking of

function naivesum(a)
    s = zero(eltype(a))
    @simd for x in a
        s += x
    end
    return s
end

which is mostly left-to-right summation but is not strictly left-to-right at a fine-grained level because @simd allows some reassociation. This achieves basically the same performance as sum:

julia> @btime sum($a);
  5.047 ms (0 allocations: 0 bytes)

julia> @btime naivesum($a);
  5.166 ms (0 allocations: 0 bytes)
3 Likes

What’s fun is that @simd typically improves accuracy, too, as it’ll use 2-8 intermediate “buckets” instead of just one. Not as good or as robust as pairwise, but generally better than the non-simd naive thing since, assuming all values are positive, each bucket will have 1-3 more bits of precision.

2 Likes

With Float64, it’ll use 32-buckets with AVX512 on LLVM <= 9, and 16 buckets with AVX2.

The tested vectors are large enough to be memory bound. Picking something small like 512 shows the naive sum has an advantage:

julia> x = rand(512);

julia> @btime sum($x)
  48.728 ns (0 allocations: 0 bytes)
259.44736738237776

julia> @btime mysum($x)
  17.526 ns (0 allocations: 0 bytes)
259.4473673823778

julia> @btime mysumavx($x)
  16.294 ns (0 allocations: 0 bytes)
259.44736738237776
2 Likes

Oh, sure enough! I just assumed the 512÷64 == 8 would mean 8 buckets for Float64 on AVX512, but LLVM indeed is loading and adding four <8 x doubles> in each inner loop. I suppose this is just a small loop unroll microptimization since it has the registers available to do so? That’s cool.

As I understand it, the reason to use more than 8 buckets is that by using 32, you only touch each register every 4 operations, which matters since the latency of the instructions is worse than their bandwidth.

4 Likes

Exactly. If you don’t unroll, each add instruction has to wait on the previous one to finish before it can start executing. That doesn’t let you take advantage of super scalar parallelism.
If you have a theoretical throughput of 2 add instructions / clock cycle and a latency of 4 clock cycles, you could have 8 add instructions happening at a time.
Meaning unrolling up to 8x could increase performance in that example.

But a sum is unlikely to be hitting theoretical peak performance anyway, so 4 is a good compromise (especially if you handle the remainder with a scalar loop).

LoopVectorization lets you use an unroll argument to manually specify the unroll factor (on top of SIMD):

julia> using LoopVectorization, BenchmarkTools

julia> x = rand(512);

julia> @generated function mysumavx(a, ::Val{U}) where {U}
       quote
       s = zero(eltype(a))
       @avx unroll=$U for i ∈ eachindex(a)
       s += a[i]
       end
       s
       end
       end
mysumavx (generic function with 2 methods)

julia> for U ∈ 1:8
           @btime mysumavx($x, $(Val(U)))
       end
  38.540 ns (0 allocations: 0 bytes)
  24.304 ns (0 allocations: 0 bytes)
  19.972 ns (0 allocations: 0 bytes)
  17.227 ns (0 allocations: 0 bytes)
  17.768 ns (0 allocations: 0 bytes)
  17.100 ns (0 allocations: 0 bytes)
  15.146 ns (0 allocations: 0 bytes)
  15.217 ns (0 allocations: 0 bytes)

[Currently it requires an integer literal, but I should change that for the sake of making examples like the above not require a generated function.]
Unrolling by 4 was over twice as fast as not unrolling, but returns quickly diminished.

7 Likes