Why using `+=` instead of `=` generates slow code?

I have two functions, their only difference is one uses += and another uses =, but are making a > 4x performance difference. The + is much faster than sin and cos and I suppose the following two functions should not make such a big difference. I tried these two versions in fortran, and confirmed this point. So why does Julia compiler generates slow code?

julia> using BenchmarkTools

julia> function pyramid0!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           @inbounds for i=1:size(v!,1)-1
               for j=1:size(v!,2)-i
                   v![i+1,j] = cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid0! (generic function with 1 method)

julia> let
           n = 1000
           x = collect(Float64, 1:n)
           v = zeros(1000, 1000)
           @benchmark pyramid0!($v, $x) seconds=1
       end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.902 ms (0.00% GC)
  median time:      5.913 ms (0.00% GC)
  mean time:        5.943 ms (0.00% GC)
  maximum time:     6.341 ms (0.00% GC)
  --------------
  samples:          169
  evals/sample:     1

julia> function pyramid0!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           @inbounds for i=1:size(v!,1)-1
               for j=1:size(v!,2)-i
                   v![i+1,j] += cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid0! (generic function with 1 method)

julia> let
           n = 1000
           x = collect(Float64, 1:n)
           v = zeros(1000, 1000)
           @benchmark pyramid0!($v, $x) seconds=1
       end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.104 ms (0.00% GC)
  median time:      25.257 ms (0.00% GC)
  mean time:        25.393 ms (0.00% GC)
  maximum time:     28.555 ms (0.00% GC)
  --------------
  samples:          40
  evals/sample:     1

I tried Julia 1.5, 1.6 and master branch.

2 Likes

This is really weird.

Although it doesn’t really answer the original question, using @avx from LoopVectorization.jl for the inner loop helps quite a lot.

julia> using BenchmarkTools, LoopVectorization

julia> function pyramid0!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           @inbounds for i=1:size(v!,1)-1
               for j=1:size(v!,2)-i
                   v![i+1,j] = cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid0! (generic function with 1 method)

julia> function pyramid1!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           @inbounds for i=1:size(v!,1)-1
               for j=1:size(v!,2)-i
                   v![i+1,j] += cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid1! (generic function with 1 method)

julia> function pyramid0_avx!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           for i in 1:size(v!,1)-1
               @avx for j in 1:size(v!,2)-i
                   v![i+1,j] = cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid0_avx! (generic function with 1 method)

julia> function pyramid1_avx!(v!, x::AbstractVector{T}) where T
           @assert size(v!,2) == size(v!,1) == length(x)
           for j=1:length(x)
               v![1,j] = x[j]
           end
           for i in 1:size(v!,1)-1
               @avx for j in 1:size(v!,2)-i
                   v![i+1,j] += cos(v![i,j+1]) * sin(v![i,j])
               end
           end
       end
pyramid1_avx! (generic function with 1 method)

julia> n = 1000; @benchmark pyramid0!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.948 ms (0.00% GC)
  median time:      2.997 ms (0.00% GC)
  mean time:        3.008 ms (0.00% GC)
  maximum time:     3.475 ms (0.00% GC)
  --------------
  samples:          257
  evals/sample:     1

julia> n = 1000; @benchmark pyramid1!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.381 ms (0.00% GC)
  median time:      3.960 ms (0.00% GC)
  mean time:        3.885 ms (0.00% GC)
  maximum time:     4.473 ms (0.00% GC)
  --------------
  samples:          209
  evals/sample:     1

julia> n = 1000; @benchmark pyramid0_avx!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.086 ms (0.00% GC)
  median time:      2.203 ms (0.00% GC)
  mean time:        2.210 ms (0.00% GC)
  maximum time:     2.734 ms (0.00% GC)
  --------------
  samples:          318
  evals/sample:     1

julia> n = 1000; @benchmark pyramid1_avx!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.455 ms (0.00% GC)
  median time:      2.829 ms (0.00% GC)
  mean time:        2.867 ms (0.00% GC)
  maximum time:     3.500 ms (0.00% GC)
  --------------
  samples:          260
  evals/sample:     1

The version with += instead of = needs to load more data, so it’s expected to be a bit slower.

2 Likes

Note that you didn’t really benchmark the same stuff in your original code: If you update v via +=, the ranges where sin and cos are called increase over time. I think this might be one problem you observe. This is reduced by using the setup keyword argument of @benchmark.

6 Likes

I think what said is correct @ranocha, Thank you!
I should not assume sin(x) takes the same amount of time.

julia> @benchmark sin(x) setup=(x=0.5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.901 ns (0.00% GC)
  median time:      5.916 ns (0.00% GC)
  mean time:        5.929 ns (0.00% GC)
  maximum time:     24.249 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark sin(x) setup=(x=50.5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.536 ns (0.00% GC)
  median time:      9.727 ns (0.00% GC)
  mean time:        9.806 ns (0.00% GC)
  maximum time:     22.641 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
2 Likes

You can also fill! in your benchmark, e.g.

julia> @benchmark pyramid0!(fill!($v,0), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.799 ms (0.00% GC)
  median time:      3.815 ms (0.00% GC)
  mean time:        3.816 ms (0.00% GC)
  maximum time:     4.062 ms (0.00% GC)
  --------------
  samples:          1310
  evals/sample:     1

julia> @benchmark pyramid1!(fill!($v,0), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.822 ms (0.00% GC)
  median time:      3.863 ms (0.00% GC)
  mean time:        3.865 ms (0.00% GC)
  maximum time:     5.302 ms (0.00% GC)
  --------------
  samples:          1293
  evals/sample:     1

julia> @benchmark pyramid0_avx!(fill!($v,0), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.605 ms (0.00% GC)
  median time:      1.689 ms (0.00% GC)
  mean time:        1.689 ms (0.00% GC)
  maximum time:     2.071 ms (0.00% GC)
  --------------
  samples:          2955
  evals/sample:     1

julia> @benchmark pyramid1_avx!(fill!($v,0), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.553 ms (0.00% GC)
  median time:      1.599 ms (0.00% GC)
  mean time:        1.600 ms (0.00% GC)
  maximum time:     2.435 ms (0.00% GC)
  --------------
  samples:          3118
  evals/sample:     1

Note that transposing v gives you much better performance:

julia> @benchmark pyramid0_avx!(fill!($v,0)', $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     966.062 μs (0.00% GC)
  median time:      1.008 ms (0.00% GC)
  mean time:        1.006 ms (0.00% GC)
  maximum time:     1.240 ms (0.00% GC)
  --------------
  samples:          4955
  evals/sample:     1

julia> @benchmark pyramid1_avx!(fill!($v,0)', $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     963.066 μs (0.00% GC)
  median time:      1.015 ms (0.00% GC)
  mean time:        1.013 ms (0.00% GC)
  maximum time:     1.367 ms (0.00% GC)
  --------------
  samples:          4921
  evals/sample:     1

So if your other code doesn’t lose performance when using a transposed v, that’s worth considering.

To understand the problem, look at the source for trig functions here. The code branches, taking a fast path, skipping Base.Math.rem_pio2_kernel if the value is less than pi/4:

julia> @btime sin($(Ref(π/4))[])
  6.557 ns (0 allocations: 0 bytes)
0.7071067811865475

julia> @btime sin($(Ref(prevfloat(π/4)))[])
  3.507 ns (0 allocations: 0 bytes)
0.7071067811865475

julia> @btime Base.Math.rem_pio2_kernel($(Ref(π/4))[])
  3.071 ns (0 allocations: 0 bytes)
(0, Base.Math.DoubleFloat64(0.7853981633974483, 0.0))

Note that the SIMD versions LoopVectorization uses cannot take branches like this.
That means that the functions are often slower – they always need to take a code path capable of handling all cases – but they benefit from evaluating many answers at a time.
What this means is that LoopVectorization’s performance won’t change based on values, as long as you avoid denormals:

julia> @benchmark pyramid1_avx!($v', $x) # faster than above because no `fill!`
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     705.587 μs (0.00% GC)
  median time:      723.322 μs (0.00% GC)
  mean time:        720.143 μs (0.00% GC)
  maximum time:     914.075 μs (0.00% GC)
  --------------
  samples:          6912
  evals/sample:     1

julia> @benchmark pyramid1!($v', $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.081 ms (0.00% GC)
  median time:      11.151 ms (0.00% GC)
  mean time:        11.157 ms (0.00% GC)
  maximum time:     13.536 ms (0.00% GC)
  --------------
  samples:          448
  evals/sample:     1
6 Likes