Why is a manual in-place addition so much faster than += and .+= on range-indexed arrays?


#1

On Julia v0.5 and v0.6 we observe the following timings:

function add1!{T}(A::Matrix{T}, subA::Matrix{T}, I::UnitRange{Int}, J::UnitRange{Int})
   size_subA = size(subA)
   checkbounds(A,I.stop,J.stop)
   checkbounds(subA,length(I),length(J))
   k=1;l=1
   @inbounds for j in J
       l = 1
       @simd for i in I
           A[i,j] += subA[l,k]
           l += 1
       end
       k +=1
   end
   return nothing
end

function add2!{T}(A::Matrix{T}, subA::Matrix{T}, I::UnitRange{Int}, J::UnitRange{Int})
   A[I,J] += subA
   return nothing
end

function add3!{T}(A::Matrix{T}, subA::Matrix{T}, I::UnitRange{Int}, J::UnitRange{Int})
   A[I,J] .+= subA
   return nothing
end

function test()
   A = rand(10000,10000)
   I = 1:100
   J = 1:200
   subA = rand(100,200)
   ctr = 1
   info("A[I,J]+=subA (loops over I,J)")
   @time begin
       for i = 1:10000
           add1!(A,subA,I,J)
           ctr += length(A)
       end
   end
   info("A[I,J] += subA")
   ctr = 1
   @time begin
       for i = 1:10000
           add2!(A,subA,I,J)
           ctr += length(A)
       end
   end
   @time begin
   ctr = 1
   info("A[I,J] .+= subA")
       for i = 1:10000
           add3!(A,subA,I,J)
           ctr += length(A)
       end
   end
end

test()

version 0.5.0:

julia> test()
INFO: A[I,J]+=subA (loops over I,J)
  0.120754 seconds
INFO: A[I,J] += subA
  2.149707 seconds (100.00 k allocations: 2.984 GB, 18.07% gc time)
INFO: A[I,J] .+= subA
  1.985320 seconds (200.01 k allocations: 2.987 GB, 11.53% gc time)

version 0.6.0-pre.beta.217

julia> test()
INFO: A[I,J]+=subA (loops over I,J)
  0.123705 seconds
INFO: A[I,J] += subA
  2.085398 seconds (80.00 k allocations: 2.983 GiB, 21.45% gc time)
INFO: A[I,J] .+= subA
  1.010136 seconds (40.04 k allocations: 1.491 GiB, 13.06% gc time)

Is there a reason why the range-indexing expression is so much slower and allocates so much more memory than the loop-based version? And why does the allocation double on 0.5 between .+= and +=, while it halves on 0.6?


#2

Try @views A[I,J] .+= subA.

In Julia 0.5, A[I,J] .+= subA is equivalent to A[I,J] = A[I,J] .+ subA. This first makes a copy of A[I,J] on the right hand side, since slicing makes a copy in Julia, then allocates a new array for A[I,J] .+ subA, then assigns the result to A[I,J].

In Julia 0.6, A[I,J] .+= subA is equivalent to A[I,J] .= A[I,J] .+ subA. The slice A[I,J] still allocates a new array for a copy of the slice. Because the .= and .+ are fusing operations in Julia 0.6, however, no new array is allocated for the result of .+, and instead it is written in-place into A[I,J].

If you use @views A[I,J] .+= subA, then the slice A[I,J] on the right-hand side will instead produce a view, so the entire operation will occur in a single in-place loop with no temporary arrays allocated. See the Consider using views for slices section of the performance tips.

That being said, the “manual” loops will still be a bit faster, even with @views. First, they avoid the overhead of allocating the small view object. More importantly, your manual loop can take advantage of the fact that the A[i,j] on the left and right-hand sides of the assignment operation are the same, and can probably share more operations in the compiled version of the indexing operations. The array accesses may also be slightly more efficient than going through a view object.


#3

Thanks for your detailed answer, now it makes sense! Is there a reason why there are so many distinctions for an operation += which I would intuitively expect to be an element-wise in-place addition operation? I can’t really think of a reason why a temporary slice A[I,J] or a matrix A[I,J]+subA would be desirable, but perhaps I’m missing some limitations or use-cases.


#4

Short answer: because a += b is syntactic sugar for a = a + b.

Longer answer: Python defines += to be a separate operator (__iadd__) that is sometimes mutating and sometimes not (e.g. if the left-hand-side is an immutable object). We decided not to do this for several reasons. One is the inconsistency of += being only sometimes mutating. Another is that the behavior isn’t actually that useful: in real applications, you are usually doing more complicated operations than simply adding two vectors, so in-place updating operations are very unsatisfying until you have fusion with other operations. That’s why we didn’t get in-place updates in Julia until “dot fusion” was implemented, so that a .+= 3 .* sqrt.(b) .- a.^2 (or @. a += 3sqrt(b) - a^2) is fused into a single in-place loop.

See also the discussion here:

It is a separate question of whether slices should be views by default. See https://github.com/JuliaLang/julia/pull/9150 and https://github.com/JuliaLang/julia/issues/13157 … in short, it is tricky, because this makes some thing faster and some things slower. So you either need an opt-in (currently with @views) or some kind of opt-out.