Understanding the allocation behavior on arrays vs scalars

I am confused about the behavior of the allocation, as a significant gain in efficiency seems to be achieved by accumulating values into 2 scalars rather than a vector of 2 values.

For the context, the function scans over a vector x of values and accumulate values of the associated matrix δ which has 2 columns and number of rows = size(x)

First function tracks values on a vector and performs poorly. Second function tracks the values of each column in 2 seperate scaler and performs much better.

I have difficulty understanding why so many allocations are made in first function given the usage of the .+= which I though would simply mutate the tracked values. I also attempted using a view on δ but it didn’t had any impact.

function find_split_1(x::AbstractArray{T, 1}, δ::AbstractArray{S, 2}) where {T<:Real, S<:AbstractFloat}
    x1 = zeros(S, 2)
    for i in 1:(size(x, 1) - 1)
        x1 .+= δ[i, :]
    end
    return x1
end

function find_split_2(x::AbstractArray{T, 1}, δ::AbstractArray{S, 2}) where {T<:Real, S<:AbstractFloat}
    x1 = zero(S)
    x2 = zero(S)
    for i in 1:(size(x, 1) - 1)
        x1 += δ[i,1]
        x2 += δ[i,2]
    end
    return x1, x2
end

x = rand(1000000)
δ = rand(1000000, 2)

@time find_split_1(x, δ)
@time find_split_2(x, δ)

First function should return something similar to:

0.056580 seconds (1.00 M allocations: 91.553 MiB, 11.92% gc time)

And the second, much more efficient:

0.001558 seconds (5 allocations: 192 bytes)

Is it possible for function 1 to achieve same performance as function 2 while maintaining the “vectorized” approach? Some rationale on the difference in behavior would be helpful.

1 Like

Use StaticArrays for small fixed-size arrays and it will be efficiently unrolled by the compiler as if you had used individual variables.

When you use a general Array type to hold a small vector like x1, the size of the array is a runtime value — the compile has to allocate the array on the heap, and every vector operation like x1 .+= 1 gets turned into a loop like for j = 1:length(x1); x1[j] += 1; end. For such a small operation (adding two numbers), the overhead of checking/updating the loop counter, and dereferencing the array x1[j] to access a memory location in the heap, is significant.

Also, δ[i, :] allocates a copy (a 2-element array on the heap). @views δ[i, :] gets rid of this copy, but currently still allocates a view object (a SubArray object) on the heap, so you still have the overhead of an allocation. Compared to the cost of adding two numbers, allocating any object on the heap, no matter how small, is expensive.

In contrast, when x1 and x2 are scalars, the compiler probably stores them directly in registers. There is no for j = 1:2 loop counter that has to be incremented and checked. There is no memory load/store to read/write x1[j]. You get the raw cost of two additions plus the cost of reading δ[i,1] and `δ[i,2] from memory.

A StaticArray type is a type of array where the length of the array is part of the type, so that it is known to the compiler — the compiler can completely unroll any loops over a StaticArray (no loop counters), and can store the array members on the stack or in registers (no heap access), just as if you had declared the elements of the array as scalar variables.

PS. Using @btime find_split_1($x, $δ) with the @btime macro from the BenchmarkTools package will generally give more accurate timing results.

6 Likes

Thanks so much for the detailed explanations.
I’m afraid I have misunderstood the consequences of them as my attempts to integrate a static arrays resulted in the same poor performance as for the normal arrays (1M allocations).

function find_split_static3(x::AbstractArray{T, 1}, δ::AbstractArray{S, 2}) where {T<:Real, S<:AbstractFloat}
    x1 = @SVector zeros(S, 2)
    for i in 1:(size(x, 1) - 1)
        x1 += δ[i, :]
    end
    return x1
end

I guess that the addition operation on a static and normal arrays is causing issues, though I’m struggling figuring out how the update could be otherwise performed (broadcasting doesn’t seem to work). Sorry, it feels like I need to be taken by the hand on this one!

Yes, to get the full benefit you need to tell the compiler the size of the δ slices too.

One elegant and efficient approach is to make δ an array of SVector as well, e.g.

function find_split(δ::AbstractVector{<:SVector{L,S}}) where {L,S}
    x1 = zero(SVector{L,S})
    for i in eachindex(δ)
        x1 += δ[i]
    end
    return x1
end

δ = rand(SVector{2,Float64}, 1000000)

I then get:

julia> @btime find_split($δ);
  1.175 ms (0 allocations: 0 bytes)
2 Likes

Wow, I’d never realized large collection of static arrays was functional. That’s an eye opener!

An array of SVectors is actually extremely efficient: all the elements are stored “inline” consecutively in memory. (It’s not an array of “pointers” to other objects.)

For example, the array [SVector(1,2,3), SVector(4,5,6), SVector(7,8,9)] is stored in memory exactly like the array [1,2,3,4,5,6,7,8,9].

This sort of thing is part of the power of Julia’s immutable types in general.

5 Likes