 # Why using `+=` instead of `=` generates slow code?

I have two functions, their only difference is one uses `+=` and another uses `=`, but are making a > `4x` performance difference. The `+` is much faster than `sin` and `cos` and I suppose the following two functions should not make such a big difference. I tried these two versions in fortran, and confirmed this point. So why does Julia compiler generates slow code?

``````julia> using BenchmarkTools

julia> function pyramid0!(v!, x::AbstractVector{T}) where T
@assert size(v!,2) == size(v!,1) == length(x)
for j=1:length(x)
v![1,j] = x[j]
end
@inbounds for i=1:size(v!,1)-1
for j=1:size(v!,2)-i
v![i+1,j] = cos(v![i,j+1]) * sin(v![i,j])
end
end
end
pyramid0! (generic function with 1 method)

julia> let
n = 1000
x = collect(Float64, 1:n)
v = zeros(1000, 1000)
@benchmark pyramid0!(\$v, \$x) seconds=1
end
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     5.902 ms (0.00% GC)
median time:      5.913 ms (0.00% GC)
mean time:        5.943 ms (0.00% GC)
maximum time:     6.341 ms (0.00% GC)
--------------
samples:          169
evals/sample:     1

julia> function pyramid0!(v!, x::AbstractVector{T}) where T
@assert size(v!,2) == size(v!,1) == length(x)
for j=1:length(x)
v![1,j] = x[j]
end
@inbounds for i=1:size(v!,1)-1
for j=1:size(v!,2)-i
v![i+1,j] += cos(v![i,j+1]) * sin(v![i,j])
end
end
end
pyramid0! (generic function with 1 method)

julia> let
n = 1000
x = collect(Float64, 1:n)
v = zeros(1000, 1000)
@benchmark pyramid0!(\$v, \$x) seconds=1
end
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     25.104 ms (0.00% GC)
median time:      25.257 ms (0.00% GC)
mean time:        25.393 ms (0.00% GC)
maximum time:     28.555 ms (0.00% GC)
--------------
samples:          40
evals/sample:     1
``````

I tried Julia 1.5, 1.6 and master branch.

2 Likes

This is really weird.

Although it doesn’t really answer the original question, using `@avx` from LoopVectorization.jl for the inner loop helps quite a lot.

``````julia> using BenchmarkTools, LoopVectorization

julia> function pyramid0!(v!, x::AbstractVector{T}) where T
@assert size(v!,2) == size(v!,1) == length(x)
for j=1:length(x)
v![1,j] = x[j]
end
@inbounds for i=1:size(v!,1)-1
for j=1:size(v!,2)-i
v![i+1,j] = cos(v![i,j+1]) * sin(v![i,j])
end
end
end
pyramid0! (generic function with 1 method)

julia> function pyramid1!(v!, x::AbstractVector{T}) where T
@assert size(v!,2) == size(v!,1) == length(x)
for j=1:length(x)
v![1,j] = x[j]
end
@inbounds for i=1:size(v!,1)-1
for j=1:size(v!,2)-i
v![i+1,j] += cos(v![i,j+1]) * sin(v![i,j])
end
end
end
pyramid1! (generic function with 1 method)

julia> function pyramid0_avx!(v!, x::AbstractVector{T}) where T
@assert size(v!,2) == size(v!,1) == length(x)
for j=1:length(x)
v![1,j] = x[j]
end
for i in 1:size(v!,1)-1
@avx for j in 1:size(v!,2)-i
v![i+1,j] = cos(v![i,j+1]) * sin(v![i,j])
end
end
end
pyramid0_avx! (generic function with 1 method)

julia> function pyramid1_avx!(v!, x::AbstractVector{T}) where T
@assert size(v!,2) == size(v!,1) == length(x)
for j=1:length(x)
v![1,j] = x[j]
end
for i in 1:size(v!,1)-1
@avx for j in 1:size(v!,2)-i
v![i+1,j] += cos(v![i,j+1]) * sin(v![i,j])
end
end
end
pyramid1_avx! (generic function with 1 method)

julia> n = 1000; @benchmark pyramid0!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.948 ms (0.00% GC)
median time:      2.997 ms (0.00% GC)
mean time:        3.008 ms (0.00% GC)
maximum time:     3.475 ms (0.00% GC)
--------------
samples:          257
evals/sample:     1

julia> n = 1000; @benchmark pyramid1!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.381 ms (0.00% GC)
median time:      3.960 ms (0.00% GC)
mean time:        3.885 ms (0.00% GC)
maximum time:     4.473 ms (0.00% GC)
--------------
samples:          209
evals/sample:     1

julia> n = 1000; @benchmark pyramid0_avx!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.086 ms (0.00% GC)
median time:      2.203 ms (0.00% GC)
mean time:        2.210 ms (0.00% GC)
maximum time:     2.734 ms (0.00% GC)
--------------
samples:          318
evals/sample:     1

julia> n = 1000; @benchmark pyramid1_avx!(v, x) seconds=1 setup=(v = zeros(n, n); x = collect(Float64, 1:n))
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.455 ms (0.00% GC)
median time:      2.829 ms (0.00% GC)
mean time:        2.867 ms (0.00% GC)
maximum time:     3.500 ms (0.00% GC)
--------------
samples:          260
evals/sample:     1
``````

The version with `+=` instead of `=` needs to load more data, so it’s expected to be a bit slower.

2 Likes

Note that you didn’t really benchmark the same stuff in your original code: If you update `v` via `+=`, the ranges where `sin` and `cos` are called increase over time. I think this might be one problem you observe. This is reduced by using the `setup` keyword argument of `@benchmark`.

6 Likes

I think what said is correct @ranocha, Thank you!
I should not assume `sin(x)` takes the same amount of time.

``````julia> @benchmark sin(x) setup=(x=0.5)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     5.901 ns (0.00% GC)
median time:      5.916 ns (0.00% GC)
mean time:        5.929 ns (0.00% GC)
maximum time:     24.249 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000

julia> @benchmark sin(x) setup=(x=50.5)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     9.536 ns (0.00% GC)
median time:      9.727 ns (0.00% GC)
mean time:        9.806 ns (0.00% GC)
maximum time:     22.641 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     999
``````
2 Likes

You can also `fill!` in your benchmark, e.g.

``````julia> @benchmark pyramid0!(fill!(\$v,0), \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.799 ms (0.00% GC)
median time:      3.815 ms (0.00% GC)
mean time:        3.816 ms (0.00% GC)
maximum time:     4.062 ms (0.00% GC)
--------------
samples:          1310
evals/sample:     1

julia> @benchmark pyramid1!(fill!(\$v,0), \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.822 ms (0.00% GC)
median time:      3.863 ms (0.00% GC)
mean time:        3.865 ms (0.00% GC)
maximum time:     5.302 ms (0.00% GC)
--------------
samples:          1293
evals/sample:     1

julia> @benchmark pyramid0_avx!(fill!(\$v,0), \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.605 ms (0.00% GC)
median time:      1.689 ms (0.00% GC)
mean time:        1.689 ms (0.00% GC)
maximum time:     2.071 ms (0.00% GC)
--------------
samples:          2955
evals/sample:     1

julia> @benchmark pyramid1_avx!(fill!(\$v,0), \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.553 ms (0.00% GC)
median time:      1.599 ms (0.00% GC)
mean time:        1.600 ms (0.00% GC)
maximum time:     2.435 ms (0.00% GC)
--------------
samples:          3118
evals/sample:     1
``````

Note that transposing `v` gives you much better performance:

``````julia> @benchmark pyramid0_avx!(fill!(\$v,0)', \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     966.062 μs (0.00% GC)
median time:      1.008 ms (0.00% GC)
mean time:        1.006 ms (0.00% GC)
maximum time:     1.240 ms (0.00% GC)
--------------
samples:          4955
evals/sample:     1

julia> @benchmark pyramid1_avx!(fill!(\$v,0)', \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     963.066 μs (0.00% GC)
median time:      1.015 ms (0.00% GC)
mean time:        1.013 ms (0.00% GC)
maximum time:     1.367 ms (0.00% GC)
--------------
samples:          4921
evals/sample:     1
``````

So if your other code doesn’t lose performance when using a transposed `v`, that’s worth considering.

To understand the problem, look at the source for trig functions here. The code branches, taking a fast path, skipping `Base.Math.rem_pio2_kernel` if the value is less than pi/4:

``````julia> @btime sin(\$(Ref(π/4))[])
6.557 ns (0 allocations: 0 bytes)
0.7071067811865475

julia> @btime sin(\$(Ref(prevfloat(π/4)))[])
3.507 ns (0 allocations: 0 bytes)
0.7071067811865475

julia> @btime Base.Math.rem_pio2_kernel(\$(Ref(π/4))[])
3.071 ns (0 allocations: 0 bytes)
(0, Base.Math.DoubleFloat64(0.7853981633974483, 0.0))
``````

Note that the SIMD versions LoopVectorization uses cannot take branches like this.
That means that the functions are often slower – they always need to take a code path capable of handling all cases – but they benefit from evaluating many answers at a time.
What this means is that LoopVectorization’s performance won’t change based on values, as long as you avoid denormals:

``````julia> @benchmark pyramid1_avx!(\$v', \$x) # faster than above because no `fill!`
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     705.587 μs (0.00% GC)
median time:      723.322 μs (0.00% GC)
mean time:        720.143 μs (0.00% GC)
maximum time:     914.075 μs (0.00% GC)
--------------
samples:          6912
evals/sample:     1

julia> @benchmark pyramid1!(\$v', \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     11.081 ms (0.00% GC)
median time:      11.151 ms (0.00% GC)
mean time:        11.157 ms (0.00% GC)
maximum time:     13.536 ms (0.00% GC)
--------------
samples:          448
evals/sample:     1
``````
6 Likes