Reduce vs. foldl: performance and precision

Boo · July 17, 2020, 12:07pm

Hi guys,

Can you explain me, please, the difference in performance and precision between reduce and foldl?

julia> using BenchmarkTools

julia> a = rand(10^7)
10000000-element Array{Float64,1}:
 0.8488959437260439
 0.7665194419056716
 0.5489381803548694
 0.1692203002559569
 0.48751467346409405
 0.6575900969521085
 ⋮
 0.1928387601015309
 0.9422496727361025
 0.25160718216629463
 0.15838788242124457
 0.35871647538214857

julia> @btime sum(a)
  4.845 ms (1 allocation: 16 bytes)
4.9994262716435e6

julia> @btime reduce(+, a)
  4.856 ms (1 allocation: 16 bytes)
4.9994262716435e6

julia> @btime foldl(+, a)
  9.917 ms (1 allocation: 16 bytes)
4.999426271643946e6

Thank you.

StefanKarpinski · July 17, 2020, 12:36pm

The foldl call sums the numbers strictly from left to right whereas sum uses pairwise summation, which is much more accurate and also faster. Doing a left-to-right SIMD summation would be even faster but not as accurate (although slightly more accurate than strict left-to-right). Floating-point summation is an unexpectedly deep and tricky problem.

Boo · July 17, 2020, 12:52pm

Thank you. And what about the reduce? Does it behave similar to sum?

StefanKarpinski · July 17, 2020, 1:15pm

Yes, reduce does not guarantee anything about association order, so by calling it you’re effectively saying “do whatever seems best/fastest”.

stevengj · July 17, 2020, 2:54pm

Pairwise summation generally isn’t faster than left-to-right summation; they are about the same speed if the base case of the pairwise algorithm (which performs left-to-right summation!) is large enough. I don’t think the timing difference here are significant (they might be due to timing fluctuations or compiler hiccups).

(Sometimes divide-and-conquer versions of algorithms gain a speed advantage from improved cache locality, but that isn’t the case for summation since each element is used exactly once and the left-to-right algorithm already maximizes spatial locality.)

StefanKarpinski · July 17, 2020, 5:04pm

That’s not what benchmarks show:

julia> @btime sum($a)
  4.460 ms (0 allocations: 0 bytes)
4.998527508812167e6

julia> @btime foldl($+, $a)
  10.614 ms (0 allocations: 0 bytes)
4.998527508812602e6

Note that similar timings were in the original post—I would not have guessed this. This makes sense because strict left-to-right summation cannot use SIMD because that requires reassociating the sum. I haven’t looked into it deeply, but my quick take is that:

foldl(+, a) is slowest because strict left-to-right association prevents SIMD
sum(a) allows SIMD, which gives a big boost, but the pairwise summation is slower than a simpler algorithm so you net about 2x faster than foldl(+, a)
mysum(a) is fastest: simple left-to-right but allowing just enough reassociation to use SIMD.

stevengj · July 17, 2020, 5:16pm

Oh, right. I was thinking of

function naivesum(a)
    s = zero(eltype(a))
    @simd for x in a
        s += x
    end
    return s
end

which is mostly left-to-right summation but is not strictly left-to-right at a fine-grained level because @simd allows some reassociation. This achieves basically the same performance as sum:

julia> @btime sum($a);
  5.047 ms (0 allocations: 0 bytes)

julia> @btime naivesum($a);
  5.166 ms (0 allocations: 0 bytes)

mbauman · July 17, 2020, 5:22pm

What’s fun is that @simd typically improves accuracy, too, as it’ll use 2-8 intermediate “buckets” instead of just one. Not as good or as robust as pairwise, but generally better than the non-simd naive thing since, assuming all values are positive, each bucket will have 1-3 more bits of precision.

Elrod · July 17, 2020, 7:01pm

With Float64, it’ll use 32-buckets with AVX512 on LLVM <= 9, and 16 buckets with AVX2.

The tested vectors are large enough to be memory bound. Picking something small like 512 shows the naive sum has an advantage:

julia> x = rand(512);

julia> @btime sum($x)
  48.728 ns (0 allocations: 0 bytes)
259.44736738237776

julia> @btime mysum($x)
  17.526 ns (0 allocations: 0 bytes)
259.4473673823778

julia> @btime mysumavx($x)
  16.294 ns (0 allocations: 0 bytes)
259.44736738237776

mbauman · July 17, 2020, 7:57pm

Oh, sure enough! I just assumed the 512÷64 == 8 would mean 8 buckets for Float64 on AVX512, but LLVM indeed is loading and adding four <8 x doubles> in each inner loop. I suppose this is just a small loop unroll microptimization since it has the registers available to do so? That’s cool.

Oscar_Smith · July 17, 2020, 8:26pm

As I understand it, the reason to use more than 8 buckets is that by using 32, you only touch each register every 4 operations, which matters since the latency of the instructions is worse than their bandwidth.

Elrod · July 17, 2020, 9:44pm

Exactly. If you don’t unroll, each add instruction has to wait on the previous one to finish before it can start executing. That doesn’t let you take advantage of super scalar parallelism.
If you have a theoretical throughput of 2 add instructions / clock cycle and a latency of 4 clock cycles, you could have 8 add instructions happening at a time.
Meaning unrolling up to 8x could increase performance in that example.

But a sum is unlikely to be hitting theoretical peak performance anyway, so 4 is a good compromise (especially if you handle the remainder with a scalar loop).

LoopVectorization lets you use an unroll argument to manually specify the unroll factor (on top of SIMD):

julia> using LoopVectorization, BenchmarkTools

julia> x = rand(512);

julia> @generated function mysumavx(a, ::Val{U}) where {U}
       quote
       s = zero(eltype(a))
       @avx unroll=$U for i ∈ eachindex(a)
       s += a[i]
       end
       s
       end
       end
mysumavx (generic function with 2 methods)

julia> for U ∈ 1:8
           @btime mysumavx($x, $(Val(U)))
       end
  38.540 ns (0 allocations: 0 bytes)
  24.304 ns (0 allocations: 0 bytes)
  19.972 ns (0 allocations: 0 bytes)
  17.227 ns (0 allocations: 0 bytes)
  17.768 ns (0 allocations: 0 bytes)
  17.100 ns (0 allocations: 0 bytes)
  15.146 ns (0 allocations: 0 bytes)
  15.217 ns (0 allocations: 0 bytes)

[Currently it requires an integer literal, but I should change that for the sake of making examples like the above not require a generated function.]
Unrolling by 4 was over twice as fast as not unrolling, but returns quickly diminished.

Topic		Replies	Views
Performance challenge: can you write a faster sum? Performance simd	31	1709	July 9, 2025
A fast sum. Any downsides? Performance sum	18	964	December 16, 2024
@fastmath for library functions? (2x slowdown of binary search because of NaN checks in Base) Performance fast-math	14	760	January 30, 2023
Different Float64 sum on different architectures General Usage	9	1132	September 20, 2020
Parallel reductions Julia at Scale	23	5172	June 19, 2022

Reduce vs. foldl: performance and precision

Related topics