Memory usage in nested loops

M_Kaj · December 10, 2019, 1:43pm

Hey everyone,

i would like to have some advise how to perform loops over vector multiplication.
I need to do this for alot of vectors, so i try to use multi threading.
The memory consumption of my function is really high and i do not really understand why.

To be more specific, I have 4 arrays with data (a,b,c,d) with dimension (L, N, N).
I use loops over the indices (N,N) and elementwise multiplication in the direction (L).
For multi threading i creat 2 auxillary arrays (aux1,aux2) for each thread which store elementwise multiplication along L-direction of two element of (a,b,c,d).
Afterwards I take a shifted scalar product of the auxillary arrays.

Increasing N leads to a rapid growth in the memory consumption, which is far larger than the increase in size of the arrays. I would appreciate alot suggestions how to improve here.

The example code looks like this

function test()

    in_p=100
    
 N=12
 L=2^14

    a=rand(Complex{Float64},L,N,N)
    b=rand(Complex{Float64},L,N,N)
    c=rand(Complex{Float64},L,N,N)
    d=rand(Complex{Float64},L,N,N)
    
   aux=zeros(Complex{Float64},Threads.nthreads() )
   aux1=[ zeros(Complex{Float64},L) for i=1:Threads.nthreads() ] 
   aux2=[ zeros(Complex{Float64},L) for i=1:Threads.nthreads() ] 
        
Threads.@threads for i1=1:N        
        t=Threads.threadid()      
      for i2=1:N , i3=1:N  , i4=1:N
    aux1[t] .=  a[:,i1,i2] .* b[:,i3,i4]
    aux2[t] .=  c[:,i2,i3] .* d[:,i4 ,i1]             
    aux[t] +=  transpose(@views aux1[t][1:end-in_p]) * (@views aux2[t][1+in_p:end])  
            end
end
    
    return sum(aux)
end

tomaklutfu · December 10, 2019, 2:41pm

aux1[t] .= a[:,i1,i2] .* b[:,i3,i4]
aux2[t] .= c[:,i2,i3] .* d[:,i4 ,i1]

causes memory allocation on rhs before broadcasting *. It doesn’t seem like but

aux[t] += transpose(@views aux1[t][1:end-in_p]) * (@views aux2[t][1+in_p:end])

also allocates in rhs with aux[t] + transpose(...)*(...)

stevengj · December 10, 2019, 3:44pm

You can just put @views in front of function test() in order to use views for slices everywhere in the function. (Creating a view still allocates a small amount of memory.)

In any case, doing 3 array passes just to compute a dot product at the end seems pretty wasteful, especially when you don’t even use all of the products in your aux1 and aux2 arrays. It looks like you can get the same result with just:

s = aux[t]
for i = 1:L-in_p
    s += a[i,i1,i2]*b[i,i3,i4] * c[i+in_p,i2,i3]*d[i+in_p,i4,i1]
end
aux[t] = s

(You might also want @simd for ... on this loop as well as @inbounds s += ..., for speed.)

Topic		Replies	Views
Large allocations using @Threads.threads in a loop leads to slow down New to Julia multithreading	17	1005	August 19, 2023
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	419	November 7, 2022
Elementwise multiplication of arrays across many cores General Usage parallel	5	2338	April 14, 2017
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2480	July 13, 2020
Speeding up array multiplication General Usage multithreading , high-performance	7	420	March 9, 2025

Memory usage in nested loops

Related topics