# Why is a manual in-place addition so much faster than += and .+= on range-indexed arrays?

On Julia v0.5 and v0.6 we observe the following timings:

size_subA = size(subA)
checkbounds(A,I.stop,J.stop)
checkbounds(subA,length(I),length(J))
k=1;l=1
@inbounds for j in J
l = 1
@simd for i in I
A[i,j] += subA[l,k]
l += 1
end
k +=1
end
return nothing
end

A[I,J] += subA
return nothing
end

A[I,J] .+= subA
return nothing
end

function test()
A = rand(10000,10000)
I = 1:100
J = 1:200
subA = rand(100,200)
ctr = 1
info("A[I,J]+=subA (loops over I,J)")
@time begin
for i = 1:10000
ctr += length(A)
end
end
info("A[I,J] += subA")
ctr = 1
@time begin
for i = 1:10000
ctr += length(A)
end
end
@time begin
ctr = 1
info("A[I,J] .+= subA")
for i = 1:10000
ctr += length(A)
end
end
end

test()

version 0.5.0:

julia> test()
INFO: A[I,J]+=subA (loops over I,J)
0.120754 seconds
INFO: A[I,J] += subA
2.149707 seconds (100.00 k allocations: 2.984 GB, 18.07% gc time)
INFO: A[I,J] .+= subA
1.985320 seconds (200.01 k allocations: 2.987 GB, 11.53% gc time)

version 0.6.0-pre.beta.217

julia> test()
INFO: A[I,J]+=subA (loops over I,J)
0.123705 seconds
INFO: A[I,J] += subA
2.085398 seconds (80.00 k allocations: 2.983 GiB, 21.45% gc time)
INFO: A[I,J] .+= subA
1.010136 seconds (40.04 k allocations: 1.491 GiB, 13.06% gc time)

Is there a reason why the range-indexing expression is so much slower and allocates so much more memory than the loop-based version? And why does the allocation double on 0.5 between .+= and +=, while it halves on 0.6?

5 Likes

Try @views A[I,J] .+= subA.

In Julia 0.5, A[I,J] .+= subA is equivalent to A[I,J] = A[I,J] .+ subA. This first makes a copy of A[I,J] on the right hand side, since slicing makes a copy in Julia, then allocates a new array for A[I,J] .+ subA, then assigns the result to A[I,J].

In Julia 0.6, A[I,J] .+= subA is equivalent to A[I,J] .= A[I,J] .+ subA. The slice A[I,J] still allocates a new array for a copy of the slice. Because the .= and .+ are fusing operations in Julia 0.6, however, no new array is allocated for the result of .+, and instead it is written in-place into A[I,J].

If you use @views A[I,J] .+= subA, then the slice A[I,J] on the right-hand side will instead produce a view, so the entire operation will occur in a single in-place loop with no temporary arrays allocated. See the Consider using views for slices section of the performance tips.

That being said, the â€śmanualâ€ť loops will still be a bit faster, even with @views. First, they avoid the overhead of allocating the small view object. More importantly, your manual loop can take advantage of the fact that the A[i,j] on the left and right-hand sides of the assignment operation are the same, and can probably share more operations in the compiled version of the indexing operations. The array accesses may also be slightly more efficient than going through a view object.

13 Likes

Thanks for your detailed answer, now it makes sense! Is there a reason why there are so many distinctions for an operation += which I would intuitively expect to be an element-wise in-place addition operation? I canâ€™t really think of a reason why a temporary slice A[I,J] or a matrix A[I,J]+subA would be desirable, but perhaps Iâ€™m missing some limitations or use-cases.

Short answer: because a += b is syntactic sugar for a = a + b.

Longer answer: Python defines += to be a separate operator (__iadd__) that is sometimes mutating and sometimes not (e.g. if the left-hand-side is an immutable object). We decided not to do this for several reasons. One is the inconsistency of += being only sometimes mutating. Another is that the behavior isnâ€™t actually that useful: in real applications, you are usually doing more complicated operations than simply adding two vectors, so in-place updating operations are very unsatisfying until you have fusion with other operations. Thatâ€™s why we didnâ€™t get in-place updates in Julia until â€śdot fusionâ€ť was implemented, so that a .+= 3 .* sqrt.(b) .- a.^2 (or @. a += 3sqrt(b) - a^2) is fused into a single in-place loop.