I have a function F that does some computations on its input arrays. First I’m trying to see if I can make this function as fast as possible as this function will be computed 50 times in one of possibly many iterations so making it fast is essential. With the dimension of the matrices I have(~10^4), it takes roughly 60 seconds to compute F. I’m thinking I could use multithreading for running F 50 times as these operations are independent. Here is a MWE for my problem

```
#arrays used in the calculation
using LinearAlgebra, BenchmarkTools
phi0 = rand(ComplexF64, 1000, 10, 50)
P = rand(ComplexF64, 10, 10, 50)
K = rand(ComplexF64, 1000, 1000, 50)
#preallocated arrays for the operation of F
F_out = zeros(ComplexF64, 1000, 1000)
F_mul_tmp = zeros(ComplexF64, 10, 1000)
F_mul_tmp2 = zeros(ComplexF64, 1000, 1000)
#The function F
function F(k, phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
fill!(F_out, 0.0 + 0.0im)
for q in 1:50
@views mul!(F_mul_tmp, P[:, :, q], adjoint(phi0[:, :, q]))
@views mul!(F_mul_tmp2, phi0[:, :, q], F_mul_tmp)
@views F_out .+= broadcast!(*, F_mul_tmp2, K[:, :, k], F_mul_tmp2)
end
return F_out
end
@btime F(2, phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out) #benchmark F
#Compute F many times in parallel
function F_threading(phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
output = zeros(ComplexF64, 1000, 1000, 50)
@Threads.threads for i in 1:50
@views output[:, :, i] = F(i, phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
end
return output
end
```

There is already a problem with my multithreading part of the code as the answer seems to change with every run. I guess this is a case for race conditions but each run of F should be independent of each other. Shouldn’t it be straightforward to do this kind of computation in parallel? Any suggestions for how to optimize F and how to make multithreading work will be appreciated.