Hi, I’m new to Julia and parallel programming and I’m having trouble with a parallel implementation. I need to assemble a matrix by calculating 10 individual blocks, each one taking approximately 18 seconds of computing time in a single thread session. I wish to parallelize this process calculating each block in an available separate thread so I implemented the following functions

```
function construct_U_block(sol,sigma)
A=zeros(length(sol),length(sol))
@inbounds for i in 1:length(sol)
@inbounds for j in 1:i
@inbounds A[j,i]=k(sol[i],sol[j],sigma)
end
end
return Symmetric(A)
end
function K_matrix(sol1,sol2,sigma)
Kcom=zeros(length(sol1),length(sol2))
@inbounds for i in 1:length(sol2)
@inbounds for j in 1:length(sol1)
@inbounds Kcom[j,i]=k(sol1[j],sol2[i],sigma)
end
end
return Kcom
end
function fill_K_completely(soln,sigma,iOO,iOF,iFO,iFF)
D1=@spawn @views construct_U_block(soln[iOO],sigma)
D2=@spawn @views construct_U_block(soln[iOF],sigma)
D3=@spawn @views construct_U_block(soln[iFO],sigma)
D4=@spawn @views construct_U_block(soln[iFF],sigma)
rD1=fetch(D1)
rD2=fetch(D2)
rD3=fetch(D3)
rD4=fetch(D4)
D5=@spawn @views K_matrix(soln[iOO],soln[iOF],sigma)
D6=@spawn @views K_matrix(soln[iOF],soln[iFO],sigma)
D7=@spawn @views K_matrix(soln[iFO],soln[iFF],sigma)
D8=@spawn @views K_matrix(soln[iOO],soln[iFO],sigma)
rD5=fetch(D5)
rD6=fetch(D6)
rD7=fetch(D7)
rD8=fetch(D8)
D9=@spawn @viewsK_matrix(soln[iOF],soln[iFF],sigma)
D10=@spawn @views K_matrix(soln[iOO],soln[iFF],sigma)
rD9=fetch(D9)
rD10=fetch(D10)
t(A)=transpose(A)
return Symmetric([rD1 rD5 rD8 rD10;
t(rD5) rD2 rD6 rD9;
t(rD8) t(rD6) rD3 rD7;
t(rD10) t(rD9) t(rD7) rD4
]),blockdiag(sparse(rD1),sparse(rD2),sparse(rD3),sparse(rD4))
end
```

where soln is an array containing 10000 50x50 float64 matrices and iOO,iOF,iFO,iFF are indices of certain elements of the soln array. However, when I run this parallel version with 4 threads I only got an improvement of 20 seconds i.e now I wait 160 seconds, not 180 as in the single thread version. I expected the time saved with parallelization would be greater, as the block matrices take a bit to be calculated, that’s why I wonder if my parallel implementation is correct and efficient or there is a pitfall in my code.

Note: I tested this implementation with a function returning a random matrix and sleep(1) simulating the time to compute it and it worked,computing the full matrix took 4 seconds, I don’t know where is my blottleneck in the above example