A question about how arrays work, how memory is allocated and what happen when chunks of code inside a function are moved into another function

I am trying to understand how to speed up some code when going multithreading and I posted a question about this a few days ago here.

while trying things out I encountered some behavior that I can’t explain, here it goes.

using BenchmarkTools

Let’s say we have a structure of the type

struct C3D{P}
  Nx::NTuple{P,Array{<:Number,1}}
  Ny::NTuple{P,Array{<:Number,1}}
  Nz::NTuple{P,Array{<:Number,1}}
  wgt::NTuple{P,Number}  
end

function C3D(P, N)
  Nx  = tuple([randn(N) for ii=1:P]...)
  Ny  = tuple([randn(N) for ii=1:P]...)
  Nz  = tuple([randn(N) for ii=1:P]...)
  wgt = tuple(randn(P)...)
  C3D{P}(Nx,Ny,Nz,wgt)
end
;

and a function that does some operations on it, of the kind:

function getϕ_a(elem::C3D{P}, N=24)  where P

  wgt     = elem.wgt
  res     = zeros(N)
  Kt      = zeros(N,N)
  ∇F      = zeros(N,9)
  
  for ii=1:P
    Nx,Ny,Nz    = elem.Nx[ii],elem.Ny[ii],elem.Nz[ii]
    ∇F[1:3:N,1] = ∇F[2:3:N,2] = ∇F[3:3:N,3] = Nx
    ∇F[1:3:N,4] = ∇F[2:3:N,5] = ∇F[3:3:N,6] = Ny
    ∇F[1:3:N,7] = ∇F[2:3:N,8] = ∇F[3:3:N,9] = Nz
    
    for jj=1:9,i1=1:N
      res[i1] += wgt[ii]*∇F[i1,jj]
      for kk=1:9,i2=1:N
        Kt[i1,i2] += wgt[ii]*∇F[i1,jj]*∇F[i2,kk]
      end   
    end
    
  end
  res, Kt
end
;

let’s create an instance of the struct C3D

elem = C3D(8,8);

If I try to benchmark getϕ_a I get

@btime getϕ_a(elem);
  118.326 ms (2621452 allocations: 40.01 MiB)

but if I move the loops over jj into a separate function in this way:

function getϕ_b(elem::C3D{P}, N=24)  where P

  wgt     = elem.wgt
  res     = zeros(N)
  Kt      = zeros(N,N)
  ∇F      = zeros(N,9)
  
  for ii=1:P
    Nx,Ny,Nz    = elem.Nx[ii],elem.Ny[ii],elem.Nz[ii]
    ∇F[1:3:N,1] = ∇F[2:3:N,2] = ∇F[3:3:N,3] = Nx
    ∇F[1:3:N,4] = ∇F[2:3:N,5] = ∇F[3:3:N,6] = Ny
    ∇F[1:3:N,7] = ∇F[2:3:N,8] = ∇F[3:3:N,9] = Nz    
    
    accumulate!(res, Kt, wgt[ii], ∇F, N)
  end
  res, Kt
end


function accumulate!(res, Kt, wgt, ∇F, N)
  for jj=1:9,i1=1:N
  res[i1] += wgt*∇F[i1,jj]
    for kk=1:9,i2=1:N
      Kt[i1,i2] += wgt*∇F[i1,jj]*∇F[i2,kk]
    end   
  end
end
;

I get this type of performance

@btime getϕ_b(elem, 24);
  493.000 μs (84 allocations: 9.06 KiB)

which is 240 time faster with 100k less allocations!

My question is why the first version is so bad, and why does it need to allocate all of that memory, if I am only operating on elements of existing arrays ?

or conversely why does that not happen in the second version, and is there a way to obtain the same performances without the moving those loops into a separate function? (I would like to keep things compact and to control what is going on)

many thanks to all for reading and particularly to those who’ll respond

This is a type stability problem. What you want is

struct C3D{P, T<:Number}
  Nx::NTuple{P, Array{T,1}}
  Ny::NTuple{P, Array{T,1}}
  Nz::NTuple{P, Array{T,1}}
  wgt::NTuple{P, T}  
end

Otherwise every access to a field of elem is type unstable. The reason your second version is faster is because it adds a function barrier so accumulate! is type stable. With this change (and the corresponding change to the constructor), getϕ_a takes 253.749 μs (4 allocations: 6.67 KiB)

6 Likes

thanks!, that makes things clearer!

but if struct are immutable why is the first definition type unstable, after all after instantiating an object its type cannot change ad should remain the type at construction time

also is there a way to make chunks of code within a function type stable without enclosing it into another function?

1 Like

it’s a matter of representation. The struct that you wrote can be very useful if you are trying to lower compilation time and have a large number of types that might be used. The one I wrote will have better type stability and runtime performance .

Yes, For example

f(x::Vector{Any})
    y = x[1]
    if y isa Int
        return y+1
    end
end

In this case, Julia will be smart enough to know that within the if statement, y is an Int.

1 Like

If you create a new C3D{P} object with P = 3 and three Vector{Int}, you have an object of type C3D{3}. If you create a new C3D{P} object with P = 3 and three Vector{Float64}, you also have an object of type C3D{3}. The way the struct is defined does not allow the compiler to optimize based on the type of the object, because objects carrying completely distinct Vectors will have the same type. Julia often optimizes on the type of the object (for all objects of the same type) not on a specific instance of the object.

4 Likes

the P in my definition of C3D is the number of elements in the tuples, the type of the vectors within the tuples in my definition can be different for Nx, Ny and Nz but once an object is instantiated it cannot change, I must be missing something with respect to type stability.

Sorry, I did not see that P was supposed to be a number, but this does not change my explanation.

You can have two objects of type C3D{5}, one with Vector{Int} inside Nx, Ny, and Nz and another with Vector{Float64} instead. You are right that after each of these objects is instanced the type of these Vectors inside them cannot change, however this is not relevant at all for type stability. Because type stability means code inside methods is compiled based on the types of the arguments passed to the function. If you call a function passing an object of type C3D{5} to it, the function will be compiled only with the information available in the type C3D{5} which means it will not know the type of the Vectors inside it, and will not be able to optimize on them. Does not matter that the specific object that triggered the method body compilation has Vector{Int} fixed inside them, because this information is not available looking only at the type which is the only thing the compiler is looking at when it does such specialization.

Does this helps your understanding of the topic?

4 Likes

I think I understand better now, so the reason why the external function is faster and does not allocate additional memory is because I am passing Nx,Ny and Nz, separately thus the compiler is aware of their type, but in getϕ_a those tuples are hidden within C3D, and the compiler cannot specialize the code,

that means that in defining a structure one should make the type of its members visible as structure parameters or it might face this sort of problems?!

thanks for the explanation!

4 Likes

exactly.

3 Likes

Yes, perfect. Your definition of C3D allows fields to have distinct types inside it without this reflecting into the outer type C3D or its type parameters, effectively hiding these types from most compiler optimizations. Note this is not a big problem if these fields are rarely used (i.e., you can use this intentionally to cut compilation costs while accepting overhead when that field is obtained from the struct and used). All that we discussed is more or less what is explained in this performance tip: Performance Tips · The Julia Language

3 Likes

thanks for the explanation and for the link!

2 Likes