I am confused about the behavior of the allocation, as a significant gain in efficiency seems to be achieved by accumulating values into 2 scalars rather than a vector of 2 values.
For the context, the function scans over a vector x of values and accumulate values of the associated matrix δ which has 2 columns and number of rows = size(x)
First function tracks values on a vector and performs poorly. Second function tracks the values of each column in 2 seperate scaler and performs much better.
I have difficulty understanding why so many allocations are made in first function given the usage of the .+= which I though would simply mutate the tracked values. I also attempted using a view on δ but it didn’t had any impact.
function find_split_1(x::AbstractArray{T, 1}, δ::AbstractArray{S, 2}) where {T<:Real, S<:AbstractFloat}
x1 = zeros(S, 2)
for i in 1:(size(x, 1) - 1)
x1 .+= δ[i, :]
end
return x1
end
function find_split_2(x::AbstractArray{T, 1}, δ::AbstractArray{S, 2}) where {T<:Real, S<:AbstractFloat}
x1 = zero(S)
x2 = zero(S)
for i in 1:(size(x, 1) - 1)
x1 += δ[i,1]
x2 += δ[i,2]
end
return x1, x2
end
x = rand(1000000)
δ = rand(1000000, 2)
@time find_split_1(x, δ)
@time find_split_2(x, δ)
First function should return something similar to:
0.056580 seconds (1.00 M allocations: 91.553 MiB, 11.92% gc time)
And the second, much more efficient:
0.001558 seconds (5 allocations: 192 bytes)
Is it possible for function 1 to achieve same performance as function 2 while maintaining the “vectorized” approach? Some rationale on the difference in behavior would be helpful.