# A question about how arrays work, how memory is allocated and what happen when chunks of code inside a function are moved into another function

I am trying to understand how to speed up some code when going multithreading and I posted a question about this a few days ago here.

while trying things out I encountered some behavior that I can’t explain, here it goes.

``````using BenchmarkTools
``````

Let’s say we have a structure of the type

``````struct C3D{P}
Nx::NTuple{P,Array{<:Number,1}}
Ny::NTuple{P,Array{<:Number,1}}
Nz::NTuple{P,Array{<:Number,1}}
wgt::NTuple{P,Number}
end

function C3D(P, N)
Nx  = tuple([randn(N) for ii=1:P]...)
Ny  = tuple([randn(N) for ii=1:P]...)
Nz  = tuple([randn(N) for ii=1:P]...)
wgt = tuple(randn(P)...)
C3D{P}(Nx,Ny,Nz,wgt)
end
;
``````

and a function that does some operations on it, of the kind:

``````function getϕ_a(elem::C3D{P}, N=24)  where P

wgt     = elem.wgt
res     = zeros(N)
Kt      = zeros(N,N)
∇F      = zeros(N,9)

for ii=1:P
Nx,Ny,Nz    = elem.Nx[ii],elem.Ny[ii],elem.Nz[ii]
∇F[1:3:N,1] = ∇F[2:3:N,2] = ∇F[3:3:N,3] = Nx
∇F[1:3:N,4] = ∇F[2:3:N,5] = ∇F[3:3:N,6] = Ny
∇F[1:3:N,7] = ∇F[2:3:N,8] = ∇F[3:3:N,9] = Nz

for jj=1:9,i1=1:N
res[i1] += wgt[ii]*∇F[i1,jj]
for kk=1:9,i2=1:N
Kt[i1,i2] += wgt[ii]*∇F[i1,jj]*∇F[i2,kk]
end
end

end
res, Kt
end
;
``````

let’s create an instance of the struct `C3D`

``````elem = C3D(8,8);
``````

If I try to benchmark `getϕ_a` I get

``````@btime getϕ_a(elem);
``````
``````  118.326 ms (2621452 allocations: 40.01 MiB)
``````

but if I move the loops over `jj` into a separate function in this way:

``````function getϕ_b(elem::C3D{P}, N=24)  where P

wgt     = elem.wgt
res     = zeros(N)
Kt      = zeros(N,N)
∇F      = zeros(N,9)

for ii=1:P
Nx,Ny,Nz    = elem.Nx[ii],elem.Ny[ii],elem.Nz[ii]
∇F[1:3:N,1] = ∇F[2:3:N,2] = ∇F[3:3:N,3] = Nx
∇F[1:3:N,4] = ∇F[2:3:N,5] = ∇F[3:3:N,6] = Ny
∇F[1:3:N,7] = ∇F[2:3:N,8] = ∇F[3:3:N,9] = Nz

accumulate!(res, Kt, wgt[ii], ∇F, N)
end
res, Kt
end

function accumulate!(res, Kt, wgt, ∇F, N)
for jj=1:9,i1=1:N
res[i1] += wgt*∇F[i1,jj]
for kk=1:9,i2=1:N
Kt[i1,i2] += wgt*∇F[i1,jj]*∇F[i2,kk]
end
end
end
;
``````

I get this type of performance

``````@btime getϕ_b(elem, 24);
``````
``````  493.000 μs (84 allocations: 9.06 KiB)
``````

which is 240 time faster with 100k less allocations!

My question is why the first version is so bad, and why does it need to allocate all of that memory, if I am only operating on elements of existing arrays ?

or conversely why does that not happen in the second version, and is there a way to obtain the same performances without the moving those loops into a separate function? (I would like to keep things compact and to control what is going on)

many thanks to all for reading and particularly to those who’ll respond

This is a type stability problem. What you want is

``````struct C3D{P, T<:Number}
Nx::NTuple{P, Array{T,1}}
Ny::NTuple{P, Array{T,1}}
Nz::NTuple{P, Array{T,1}}
wgt::NTuple{P, T}
end
``````

Otherwise every access to a field of `elem` is type unstable. The reason your second version is faster is because it adds a function barrier so `accumulate!` is type stable. With this change (and the corresponding change to the constructor), `getϕ_a` takes `253.749 μs (4 allocations: 6.67 KiB)`

6 Likes

thanks!, that makes things clearer!

but if `struct` are immutable why is the first definition type unstable, after all after instantiating an object its type cannot change ad should remain the type at construction time

also is there a way to make chunks of code within a function type stable without enclosing it into another function?

1 Like

it’s a matter of representation. The `struct` that you wrote can be very useful if you are trying to lower compilation time and have a large number of types that might be used. The one I wrote will have better type stability and runtime performance .

Yes, For example

``````f(x::Vector{Any})
y = x[1]
if y isa Int
return y+1
end
end
``````

In this case, Julia will be smart enough to know that within the `if` statement, `y` is an `Int`.

1 Like

If you create a new `C3D{P}` object with `P = 3` and three `Vector{Int}`, you have an object of type `C3D{3}`. If you create a new `C3D{P}` object with `P = 3` and three `Vector{Float64}`, you also have an object of type `C3D{3}`. The way the struct is defined does not allow the compiler to optimize based on the type of the object, because objects carrying completely distinct `Vector`s will have the same type. Julia often optimizes on the type of the object (for all objects of the same type) not on a specific instance of the object.

4 Likes

the `P` in my definition of `C3D` is the number of elements in the tuples, the type of the vectors within the tuples in my definition can be different for `Nx`, `Ny` and `Nz` but once an object is instantiated it cannot change, I must be missing something with respect to type stability.

Sorry, I did not see that `P` was supposed to be a number, but this does not change my explanation.

You can have two objects of type `C3D{5}`, one with `Vector{Int}` inside `Nx`, `Ny`, and `Nz` and another with `Vector{Float64}` instead. You are right that after each of these objects is instanced the type of these `Vector`s inside them cannot change, however this is not relevant at all for type stability. Because type stability means code inside methods is compiled based on the types of the arguments passed to the function. If you call a function passing an object of type `C3D{5}` to it, the function will be compiled only with the information available in the type `C3D{5}` which means it will not know the type of the `Vector`s inside it, and will not be able to optimize on them. Does not matter that the specific object that triggered the method body compilation has `Vector{Int}` fixed inside them, because this information is not available looking only at the type which is the only thing the compiler is looking at when it does such specialization.

Does this helps your understanding of the topic?

4 Likes

I think I understand better now, so the reason why the external function is faster and does not allocate additional memory is because I am passing Nx,Ny and Nz, separately thus the compiler is aware of their type, but in `getϕ_a` those tuples are hidden within `C3D`, and the compiler cannot specialize the code,

that means that in defining a structure one should make the type of its members visible as structure parameters or it might face this sort of problems?!

thanks for the explanation!

4 Likes

exactly.

3 Likes

Yes, perfect. Your definition of `C3D` allows fields to have distinct types inside it without this reflecting into the outer type `C3D` or its type parameters, effectively hiding these types from most compiler optimizations. Note this is not a big problem if these fields are rarely used (i.e., you can use this intentionally to cut compilation costs while accepting overhead when that field is obtained from the struct and used). All that we discussed is more or less what is explained in this performance tip: Performance Tips · The Julia Language

3 Likes

thanks for the explanation and for the link!

2 Likes