Counterintuitive array update allocations

I’ve run into a kind-of counterintuitive behavior when coding some example. I have an array A_field, which I want to update by adding certain values in certain positions. These are provided by a an array of arrays of (flattened) indices Part and an array of arrays of values A, as follows:

N = 10
shapeA = (N, N)
A_field = zeros(shapeA)

Part = [[i, i+1] for i in 1:2:prod(shapeA)]
A = [rand(length(J)) for J in Part]

You may want to think of this as having split some big problem for A_field into small chunks, solved them independently and then wanting to join the results.

The most straightforward implementation of the updating I can think of is the following:

function update1!(A_field, Part, A)
    for (i, J) in enumerate(Part)
        A_field[J] += A[i]
    end
end

But this strangely results in as many allocations as entries of A_field:

julia> @time update1!(A_field, Part, A)
  0.000010 seconds (100 allocations: 9.375 KiB)

Which of course is not nice for big problems. (Disclaimer: I used all combinations of @views, using .+= instead of += and so on I could think of, and this is the one that allocated the least). A second possibility is to introduce a second loop over each J:

function update2!(A_field, Part, A)
    for (k, J) in enumerate(Part)
        for (i, j) in enumerate(J)
            A_field[j] += A[k][i]
        end
    end
end

Which is slightly more cumbersome to read and understand. This of course introduces no allocations:

julia> @time update2!(A_field, Part, A)
  0.000006 seconds

Finally, reshaping the array and using views seems to get rid of the allocations (apart from the ones needed for the reshaping, of course):

function update3!(A_field, Part, A)
    A_temp = reshape(A_field, :)
    for (i, J) in enumerate(Part)
        @views A_temp[J] .+= A[i]
    end
end
julia> @time update3!(A_field, Part, A)
  0.000009 seconds (2 allocations: 80 bytes)

So at this point I am a bit confused and have a lot of questions.

  1. Why does the 3rd version work well and not the 1st one?
  2. Is there any way (macro, trick) to obtain the performance of the 2nd version without sacrificing the readability of the 1st version?

Thanks!

Using @views in the first version does not help because it basically does the reshape part of the third version, but on every loop iteration. This is because a view of a matrix with “1D indices” (i.e., A[x] instead of A[x,y]) will internally reshape the matrix as a vector, and reshaping an array allocates.

1 Like

Is this a consequence of reshape(::Array, ...) is slow and allocates · Issue #36313 · JuliaLang/julia · GitHub?

1 Like

Thanks for your answer, and sorry for the lateness. Then, as far as I understand, to modify a slice of a N-d array written as its flattened indices, one has either to explicitly cycle over the flattened indices (udpate2!) or allocate (update!). Is this correct?

A_field[J] += A[i] is just A_field[J] = A_field[J] + A[i], which results in an intermediary that has to be allocated (plus A_field[J] creates a slice, which copies). The third version doesn’t create that intermediary because the slice on the right hand side is now a view - it is equivalent to @views A_field[J] .= A_field[J] .+ A[i], so it additionally writes the result directly back into A_field[J].

That depends on how general it has to be. Personally, I think update3! is the best version you’re going to get for the most general cases, but further improvements may depend on how Part slices into your arrays in general.

I don’t know what your real application is, but if the indices array Part is that simple, you could use a simpler and much faster version as update4! below. You don’t need to use Part at all in this case.

function update4!(A_field, A)
   for i in eachindex(A)
       @views A_field[2i-1:2i] = A[i]
   end
end

N = 10
shapeA = (N, N)
Part = [[i, i+1] for i in 1:2:prod(shapeA)]
A = [rand(length(J)) for J in Part]
A_field = zeros(shapeA)

@btime update4!(y, $A) setup = (y = copy(A_field))
  218.200 ns (0 allocations: 0 bytes)

The actual partition Part can be quite general, so in the real implementation I can’t do this. But thanks a lot for taking the time, it’s really nice to see what works and what doesn’t :slight_smile:

That’s very interesting, specially in view of the answer of Seif_Shebl below: Counterintuitive array update allocations - #7 by Seif_Shebl, which implies that for ranges there’s no allocations… Somehow it still feels a bit of an odd difference in behavior.

It doesn’t have anything to do with whether or not what you’re indexing with is a range or not (and in any case, your original J was not a range anyway). Note also that in the code from @Seif_Shebl , there’s no += but only an assignment. I bet it would have the same performance characteristics without the @views, since single index indexing doesn’t allocate (after all, there’s no intermediary collection to be created, only a single element has to be retrieved), only slicing/indexing with arrays does. In your original code, it’s the expansion from A[J] += B to A[J] = A[J] + B and the subsequent slicing on the right hand side of = that allocates, since J is an array here, not a single index.

2 Likes

Yes, you’re right, @Seif_Shebl’s solution adapted to sum the entries allocates the same as the original. Thank you very much for the detailed discussion!

The only solution I can think of that would keep the highest performance and allocations to zero is using StaticArrays, but the caveat here is that the length of subvectors should not exceed 100 to gain the benefit of using it.

using StaticArrays

function update5!(A_field, Part, A)
    for (i,j) in pairs(Part)
        A_field[j] += A[i]
    end
end
N = 4000
Part = [SA[i, i+1] for i in 1:2:N*N]
A = [SA[rand(),rand()] for j in Part]
A_field = zeros(N*N)

@btime update5!(y, $Part, $A) setup = (y = copy(A_field))
  25.975 ms (0 allocations: 0 bytes)

I wonder why Julia doesn’t support this out-of-the box, the only reason for me personally to continue using Fortran till today is that Fortran’s arrays are way optimized in cases like this out-of-the box without me even thinking about it.
Not to derail this discussion, but here is a simple Fortran code supporting my claims. Note that I used 1D array for A_field for simplicity but it makes no difference if I use 2D array.

program test_update
    type i_nvec
        integer :: a(2) 
    end type    
    type r_nvec
        real(8) :: a(2) 
    end type

    integer, parameter :: N = 4000
    integer :: i, t0, t1, count_rate, count_max
    real(8) :: A_field(N*N)
    type(i_nvec) :: Part(N*N/2)
    type(r_nvec) :: A(N*N/2)

    do i = 1, N*N/2 
        Part(i)%a = [2*i-1,2*i]
        call random_number(A(i)%a)
    end do
    A_field = 0

    call system_clock(t0, count_rate, count_max)
    do i = 1,size(Part)
        A_field(Part(i)%a) = A_field(Part(i)%a) + A(i)%a
    end do
    call system_clock(t1)

    print '(a,f7.5)', 'Time: ', real(t1-t0)/count_rate
    print *, A_field(1:5)
end

! Time: 0.01500  (gfortran on Windows 10, oscillates from 15 ms to 31 ms)
! Time: 0.02500  (gfortran on Linux)
!   0.99755959      0.56682470      0.96591537     0.74792768      0.3673908973

StrideArrays.jl may also be an option for fast and larger stack-allocated arrays