Why is a manual in-place addition so much faster than += and .+= on range-indexed arrays?

rleegates · April 20, 2017, 7:18pm

On Julia v0.5 and v0.6 we observe the following timings:

function add1!{T}(A::Matrix{T}, subA::Matrix{T}, I::UnitRange{Int}, J::UnitRange{Int})
   size_subA = size(subA)
   checkbounds(A,I.stop,J.stop)
   checkbounds(subA,length(I),length(J))
   k=1;l=1
   @inbounds for j in J
       l = 1
       @simd for i in I
           A[i,j] += subA[l,k]
           l += 1
       end
       k +=1
   end
   return nothing
end

function add2!{T}(A::Matrix{T}, subA::Matrix{T}, I::UnitRange{Int}, J::UnitRange{Int})
   A[I,J] += subA
   return nothing
end

function add3!{T}(A::Matrix{T}, subA::Matrix{T}, I::UnitRange{Int}, J::UnitRange{Int})
   A[I,J] .+= subA
   return nothing
end

function test()
   A = rand(10000,10000)
   I = 1:100
   J = 1:200
   subA = rand(100,200)
   ctr = 1
   info("A[I,J]+=subA (loops over I,J)")
   @time begin
       for i = 1:10000
           add1!(A,subA,I,J)
           ctr += length(A)
       end
   end
   info("A[I,J] += subA")
   ctr = 1
   @time begin
       for i = 1:10000
           add2!(A,subA,I,J)
           ctr += length(A)
       end
   end
   @time begin
   ctr = 1
   info("A[I,J] .+= subA")
       for i = 1:10000
           add3!(A,subA,I,J)
           ctr += length(A)
       end
   end
end

test()

version 0.5.0:

julia> test()
INFO: A[I,J]+=subA (loops over I,J)
  0.120754 seconds
INFO: A[I,J] += subA
  2.149707 seconds (100.00 k allocations: 2.984 GB, 18.07% gc time)
INFO: A[I,J] .+= subA
  1.985320 seconds (200.01 k allocations: 2.987 GB, 11.53% gc time)

version 0.6.0-pre.beta.217

julia> test()
INFO: A[I,J]+=subA (loops over I,J)
  0.123705 seconds
INFO: A[I,J] += subA
  2.085398 seconds (80.00 k allocations: 2.983 GiB, 21.45% gc time)
INFO: A[I,J] .+= subA
  1.010136 seconds (40.04 k allocations: 1.491 GiB, 13.06% gc time)

Is there a reason why the range-indexing expression is so much slower and allocates so much more memory than the loop-based version? And why does the allocation double on 0.5 between .+= and +=, while it halves on 0.6?

stevengj · April 20, 2017, 9:03pm

Try @views A[I,J] .+= subA.

In Julia 0.5, A[I,J] .+= subA is equivalent to A[I,J] = A[I,J] .+ subA. This first makes a copy of A[I,J] on the right hand side, since slicing makes a copy in Julia, then allocates a new array for A[I,J] .+ subA, then assigns the result to A[I,J].

In Julia 0.6, A[I,J] .+= subA is equivalent to A[I,J] .= A[I,J] .+ subA. The slice A[I,J] still allocates a new array for a copy of the slice. Because the .= and .+ are fusing operations in Julia 0.6, however, no new array is allocated for the result of .+, and instead it is written in-place into A[I,J].

If you use @views A[I,J] .+= subA, then the slice A[I,J] on the right-hand side will instead produce a view, so the entire operation will occur in a single in-place loop with no temporary arrays allocated. See the Consider using views for slices section of the performance tips.

That being said, the “manual” loops will still be a bit faster, even with @views. First, they avoid the overhead of allocating the small view object. More importantly, your manual loop can take advantage of the fact that the A[i,j] on the left and right-hand sides of the assignment operation are the same, and can probably share more operations in the compiled version of the indexing operations. The array accesses may also be slightly more efficient than going through a view object.

rleegates · April 21, 2017, 11:26am

Thanks for your detailed answer, now it makes sense! Is there a reason why there are so many distinctions for an operation += which I would intuitively expect to be an element-wise in-place addition operation? I can’t really think of a reason why a temporary slice A[I,J] or a matrix A[I,J]+subA would be desirable, but perhaps I’m missing some limitations or use-cases.

stevengj · April 21, 2017, 11:48am

Short answer: because a += b is syntactic sugar for a = a + b.

Longer answer: Python defines += to be a separate operator (__iadd__) that is sometimes mutating and sometimes not (e.g. if the left-hand-side is an immutable object). We decided not to do this for several reasons. One is the inconsistency of += being only sometimes mutating. Another is that the behavior isn’t actually that useful: in real applications, you are usually doing more complicated operations than simply adding two vectors, so in-place updating operations are very unsatisfying until you have fusion with other operations. That’s why we didn’t get in-place updates in Julia until “dot fusion” was implemented, so that a .+= 3 .* sqrt.(b) .- a.^2 (or @. a += 3sqrt(b) - a^2) is fused into a single in-place loop.

Topic		Replies	Views
Adding vectors of different lengths New to Julia performance	12	2246	February 8, 2017
Difference between += and .+= New to Julia	5	1856	October 6, 2022
Julia in-place modification and memory usages New to Julia question	21	4028	June 8, 2020
In place updating of arrays General Usage	8	1680	February 25, 2021
Is there an in-place alternative to +=, *=, etc.? Performance question	1	485	November 5, 2021

Why is a manual in-place addition so much faster than += and .+= on range-indexed arrays?

Related topics