I have the following problem that I don’t know how to solve while using multithreading to create a matrix. I have a function that returns the matrix element after operating over arrays that are large in size. I’m using @views to avoid copying the data.

```
function F(m, n, phi0, K_array, P_list)::ComplexF64
return @views ...do simple computation over phi0, K_array, P_list
end
```

Applying F once gives (using @btime form BenchmarkTools)

```
55.524 μs (700 allocations: 39.84 KiB)
```

Now I use a machine with 100 cores to construct a matrix of size Nsite*Nsite using @Threads.threads as follows

```
function F_array(phi0, K_array, P_list, Nsite)
result = zeros(ComplexF64, Nsite, Nsite)
@Threads.threads for m in 1:Nsite
for n in m+1:Nsite
result[m, n] = F(m, n, phi0, K_array, P_list)
end
end
return result
end
```

When I apply F_array(with Nsite = 4000) I get

```
416.801 s (8381104705 allocations: 345.49 GiB)
```

which has a huge allocation that eventually leads to the computation being very slow.

I have tried a simple function instead of F that has similar run time and allocations and when I construct the matrix using this simple function it takes roughly 500 ms so I was expecting that I should get a similar time for my F_array function.

I’m suspecting that it has to do with the fact that the function F uses large arrays in its definition. Is there a way to fix this problem?

Any other insight will be greatly appreciated.