# Counterintuitive array update allocations

I’ve run into a kind-of counterintuitive behavior when coding some example. I have an array `A_field`, which I want to update by adding certain values in certain positions. These are provided by a an array of arrays of (flattened) indices `Part` and an array of arrays of values `A`, as follows:

``````N = 10
shapeA = (N, N)
A_field = zeros(shapeA)

Part = [[i, i+1] for i in 1:2:prod(shapeA)]
A = [rand(length(J)) for J in Part]
``````

You may want to think of this as having split some big problem for `A_field` into small chunks, solved them independently and then wanting to join the results.

The most straightforward implementation of the updating I can think of is the following:

``````function update1!(A_field, Part, A)
for (i, J) in enumerate(Part)
A_field[J] += A[i]
end
end
``````

But this strangely results in as many allocations as entries of `A_field`:

``````julia> @time update1!(A_field, Part, A)
0.000010 seconds (100 allocations: 9.375 KiB)
``````

Which of course is not nice for big problems. (Disclaimer: I used all combinations of `@views`, using `.+=` instead of `+=` and so on I could think of, and this is the one that allocated the least). A second possibility is to introduce a second loop over each `J`:

``````function update2!(A_field, Part, A)
for (k, J) in enumerate(Part)
for (i, j) in enumerate(J)
A_field[j] += A[k][i]
end
end
end
``````

Which is slightly more cumbersome to read and understand. This of course introduces no allocations:

``````julia> @time update2!(A_field, Part, A)
0.000006 seconds
``````

Finally, reshaping the array and using views seems to get rid of the allocations (apart from the ones needed for the reshaping, of course):

``````function update3!(A_field, Part, A)
A_temp = reshape(A_field, :)
for (i, J) in enumerate(Part)
@views A_temp[J] .+= A[i]
end
end
``````
``````julia> @time update3!(A_field, Part, A)
0.000009 seconds (2 allocations: 80 bytes)
``````

So at this point I am a bit confused and have a lot of questions.

1. Why does the 3rd version work well and not the 1st one?
2. Is there any way (macro, trick) to obtain the performance of the 2nd version without sacrificing the readability of the 1st version?

Thanks!

Using `@views` in the first version does not help because it basically does the `reshape` part of the third version, but on every loop iteration. This is because a `view` of a matrix with “1D indices” (i.e., `A[x]` instead of `A[x,y]`) will internally reshape the matrix as a vector, and reshaping an array allocates.

1 Like

Is this a consequence of reshape(::Array, ...) is slow and allocates · Issue #36313 · JuliaLang/julia · GitHub?

1 Like

Thanks for your answer, and sorry for the lateness. Then, as far as I understand, to modify a slice of a N-d array written as its flattened indices, one has either to explicitly cycle over the flattened indices (udpate2!) or allocate (update!). Is this correct?

`A_field[J] += A[i]` is just `A_field[J] = A_field[J] + A[i]`, which results in an intermediary that has to be allocated (plus `A_field[J]` creates a slice, which copies). The third version doesn’t create that intermediary because the slice on the right hand side is now a view - it is equivalent to `@views A_field[J] .= A_field[J] .+ A[i]`, so it additionally writes the result directly back into `A_field[J]`.

That depends on how general it has to be. Personally, I think `update3!` is the best version you’re going to get for the most general cases, but further improvements may depend on how `Part` slices into your arrays in general.

I don’t know what your real application is, but if the indices array `Part` is that simple, you could use a simpler and much faster version as `update4!` below. You don’t need to use `Part` at all in this case.

``````function update4!(A_field, A)
for i in eachindex(A)
@views A_field[2i-1:2i] = A[i]
end
end

N = 10
shapeA = (N, N)
Part = [[i, i+1] for i in 1:2:prod(shapeA)]
A = [rand(length(J)) for J in Part]
A_field = zeros(shapeA)

@btime update4!(y, \$A) setup = (y = copy(A_field))
218.200 ns (0 allocations: 0 bytes)
``````

The actual partition `Part` can be quite general, so in the real implementation I can’t do this. But thanks a lot for taking the time, it’s really nice to see what works and what doesn’t

That’s very interesting, specially in view of the answer of Seif_Shebl below: Counterintuitive array update allocations - #7 by Seif_Shebl, which implies that for ranges there’s no allocations… Somehow it still feels a bit of an odd difference in behavior.

It doesn’t have anything to do with whether or not what you’re indexing with is a range or not (and in any case, your original `J` was not a range anyway). Note also that in the code from @Seif_Shebl , there’s no `+=` but only an assignment. I bet it would have the same performance characteristics without the `@views`, since single index indexing doesn’t allocate (after all, there’s no intermediary collection to be created, only a single element has to be retrieved), only slicing/indexing with arrays does. In your original code, it’s the expansion from `A[J] += B` to `A[J] = A[J] + B` and the subsequent slicing on the right hand side of `=` that allocates, since `J` is an array here, not a single index.

2 Likes

Yes, you’re right, @Seif_Shebl’s solution adapted to sum the entries allocates the same as the original. Thank you very much for the detailed discussion!

The only solution I can think of that would keep the highest performance and allocations to zero is using `StaticArrays`, but the caveat here is that the length of subvectors should not exceed 100 to gain the benefit of using it.

``````using StaticArrays

function update5!(A_field, Part, A)
for (i,j) in pairs(Part)
A_field[j] += A[i]
end
end
N = 4000
Part = [SA[i, i+1] for i in 1:2:N*N]
A = [SA[rand(),rand()] for j in Part]
A_field = zeros(N*N)

@btime update5!(y, \$Part, \$A) setup = (y = copy(A_field))
25.975 ms (0 allocations: 0 bytes)
``````

I wonder why Julia doesn’t support this out-of-the box, the only reason for me personally to continue using Fortran till today is that Fortran’s arrays are way optimized in cases like this out-of-the box without me even thinking about it.
Not to derail this discussion, but here is a simple Fortran code supporting my claims. Note that I used 1D array for `A_field ` for simplicity but it makes no difference if I use 2D array.

``````program test_update
type i_nvec
integer :: a(2)
end type
type r_nvec
real(8) :: a(2)
end type

integer, parameter :: N = 4000
integer :: i, t0, t1, count_rate, count_max
real(8) :: A_field(N*N)
type(i_nvec) :: Part(N*N/2)
type(r_nvec) :: A(N*N/2)

do i = 1, N*N/2
Part(i)%a = [2*i-1,2*i]
call random_number(A(i)%a)
end do
A_field = 0

call system_clock(t0, count_rate, count_max)
do i = 1,size(Part)
A_field(Part(i)%a) = A_field(Part(i)%a) + A(i)%a
end do
call system_clock(t1)

print '(a,f7.5)', 'Time: ', real(t1-t0)/count_rate
print *, A_field(1:5)
end

! Time: 0.01500  (gfortran on Windows 10, oscillates from 15 ms to 31 ms)
! Time: 0.02500  (gfortran on Linux)
!   0.99755959      0.56682470      0.96591537     0.74792768      0.3673908973``````

StrideArrays.jl may also be an option for fast and larger stack-allocated arrays