Hey everyone,

i would like to have some advise how to perform loops over vector multiplication.

I need to do this for alot of vectors, so i try to use multi threading.

The memory consumption of my function is really high and i do not really understand why.

To be more specific, I have 4 arrays with data (a,b,c,d) with dimension (L, N, N).

I use loops over the indices (N,N) and elementwise multiplication in the direction (L).

For multi threading i creat 2 auxillary arrays (aux1,aux2) for each thread which store elementwise multiplication along L-direction of two element of (a,b,c,d).

Afterwards I take a shifted scalar product of the auxillary arrays.

Increasing N leads to a rapid growth in the memory consumption, which is far larger than the increase in size of the arrays. I would appreciate alot suggestions how to improve here.

The example code looks like this

```
function test()
in_p=100
N=12
L=2^14
a=rand(Complex{Float64},L,N,N)
b=rand(Complex{Float64},L,N,N)
c=rand(Complex{Float64},L,N,N)
d=rand(Complex{Float64},L,N,N)
aux=zeros(Complex{Float64},Threads.nthreads() )
aux1=[ zeros(Complex{Float64},L) for i=1:Threads.nthreads() ]
aux2=[ zeros(Complex{Float64},L) for i=1:Threads.nthreads() ]
Threads.@threads for i1=1:N
t=Threads.threadid()
for i2=1:N , i3=1:N , i4=1:N
aux1[t] .= a[:,i1,i2] .* b[:,i3,i4]
aux2[t] .= c[:,i2,i3] .* d[:,i4 ,i1]
aux[t] += transpose(@views aux1[t][1:end-in_p]) * (@views aux2[t][1+in_p:end])
end
end
return sum(aux)
end
```