Memory usage in nested loops

Hey everyone,

i would like to have some advise how to perform loops over vector multiplication.
I need to do this for alot of vectors, so i try to use multi threading.
The memory consumption of my function is really high and i do not really understand why.

To be more specific, I have 4 arrays with data (a,b,c,d) with dimension (L, N, N).
I use loops over the indices (N,N) and elementwise multiplication in the direction (L).
For multi threading i creat 2 auxillary arrays (aux1,aux2) for each thread which store elementwise multiplication along L-direction of two element of (a,b,c,d).
Afterwards I take a shifted scalar product of the auxillary arrays.

Increasing N leads to a rapid growth in the memory consumption, which is far larger than the increase in size of the arrays. I would appreciate alot suggestions how to improve here.

The example code looks like this

function test()

    in_p=100
    
 N=12
 L=2^14

    a=rand(Complex{Float64},L,N,N)
    b=rand(Complex{Float64},L,N,N)
    c=rand(Complex{Float64},L,N,N)
    d=rand(Complex{Float64},L,N,N)
    
   aux=zeros(Complex{Float64},Threads.nthreads() )
   aux1=[ zeros(Complex{Float64},L) for i=1:Threads.nthreads() ] 
   aux2=[ zeros(Complex{Float64},L) for i=1:Threads.nthreads() ] 
        
Threads.@threads for i1=1:N        
        t=Threads.threadid()      
      for i2=1:N , i3=1:N  , i4=1:N
    aux1[t] .=  a[:,i1,i2] .* b[:,i3,i4]
    aux2[t] .=  c[:,i2,i3] .* d[:,i4 ,i1]             
    aux[t] +=  transpose(@views aux1[t][1:end-in_p]) * (@views aux2[t][1+in_p:end])  
            end
end
    
    return sum(aux)
end
aux1[t] .= a[:,i1,i2] .* b[:,i3,i4]
aux2[t] .= c[:,i2,i3] .* d[:,i4 ,i1]

causes memory allocation on rhs before broadcasting *. It doesn’t seem like but

aux[t] += transpose(@views aux1[t][1:end-in_p]) * (@views aux2[t][1+in_p:end])

also allocates in rhs with aux[t] + transpose(...)*(...)

You can just put @views in front of function test() in order to use views for slices everywhere in the function. (Creating a view still allocates a small amount of memory.)

In any case, doing 3 array passes just to compute a dot product at the end seems pretty wasteful, especially when you don’t even use all of the products in your aux1 and aux2 arrays. It looks like you can get the same result with just:

s = aux[t]
for i = 1:L-in_p
    s += a[i,i1,i2]*b[i,i3,i4] * c[i+in_p,i2,i3]*d[i+in_p,i4,i1]
end
aux[t] = s

(You might also want @simd for ... on this loop as well as @inbounds s += ..., for speed.)

5 Likes