I am trying to understand how to speed up some code when going multithreading and I posted a question about this a few days ago here.

while trying things out I encountered some behavior that I can’t explain, here it goes.

```
using BenchmarkTools
```

Let’s say we have a structure of the type

```
struct C3D{P}
Nx::NTuple{P,Array{<:Number,1}}
Ny::NTuple{P,Array{<:Number,1}}
Nz::NTuple{P,Array{<:Number,1}}
wgt::NTuple{P,Number}
end
function C3D(P, N)
Nx = tuple([randn(N) for ii=1:P]...)
Ny = tuple([randn(N) for ii=1:P]...)
Nz = tuple([randn(N) for ii=1:P]...)
wgt = tuple(randn(P)...)
C3D{P}(Nx,Ny,Nz,wgt)
end
;
```

and a function that does some operations on it, of the kind:

```
function getϕ_a(elem::C3D{P}, N=24) where P
wgt = elem.wgt
res = zeros(N)
Kt = zeros(N,N)
∇F = zeros(N,9)
for ii=1:P
Nx,Ny,Nz = elem.Nx[ii],elem.Ny[ii],elem.Nz[ii]
∇F[1:3:N,1] = ∇F[2:3:N,2] = ∇F[3:3:N,3] = Nx
∇F[1:3:N,4] = ∇F[2:3:N,5] = ∇F[3:3:N,6] = Ny
∇F[1:3:N,7] = ∇F[2:3:N,8] = ∇F[3:3:N,9] = Nz
for jj=1:9,i1=1:N
res[i1] += wgt[ii]*∇F[i1,jj]
for kk=1:9,i2=1:N
Kt[i1,i2] += wgt[ii]*∇F[i1,jj]*∇F[i2,kk]
end
end
end
res, Kt
end
;
```

let’s create an instance of the struct `C3D`

```
elem = C3D(8,8);
```

If I try to benchmark `getϕ_a`

I get

```
@btime getϕ_a(elem);
```

```
118.326 ms (2621452 allocations: 40.01 MiB)
```

but if I move the loops over `jj`

into a separate function in this way:

```
function getϕ_b(elem::C3D{P}, N=24) where P
wgt = elem.wgt
res = zeros(N)
Kt = zeros(N,N)
∇F = zeros(N,9)
for ii=1:P
Nx,Ny,Nz = elem.Nx[ii],elem.Ny[ii],elem.Nz[ii]
∇F[1:3:N,1] = ∇F[2:3:N,2] = ∇F[3:3:N,3] = Nx
∇F[1:3:N,4] = ∇F[2:3:N,5] = ∇F[3:3:N,6] = Ny
∇F[1:3:N,7] = ∇F[2:3:N,8] = ∇F[3:3:N,9] = Nz
accumulate!(res, Kt, wgt[ii], ∇F, N)
end
res, Kt
end
function accumulate!(res, Kt, wgt, ∇F, N)
for jj=1:9,i1=1:N
res[i1] += wgt*∇F[i1,jj]
for kk=1:9,i2=1:N
Kt[i1,i2] += wgt*∇F[i1,jj]*∇F[i2,kk]
end
end
end
;
```

I get this type of performance

```
@btime getϕ_b(elem, 24);
```

```
493.000 μs (84 allocations: 9.06 KiB)
```

which is 240 time faster with 100k less allocations!

My question is why the first version is so bad, and why does it need to allocate all of that memory, if I am only operating on elements of existing arrays ?

or conversely why does that not happen in the second version, and is there a way to obtain the same performances without the moving those loops into a separate function? (I would like to keep things compact and to control what is going on)

many thanks to all for reading and particularly to those who’ll respond