I have the following problem where I want to construct a matrix (roughly of size 10^4*10^4) with a rule to calculate the matrix elements. A serial construction of this matrix takes a lot of time. I thought of using multithreading to construct the matrix in the following way

function F_array(Nsite)
result = zeros(ComplexF64, Nsite, Nsite)
Threads.@threads for m in 1:Nsite
for n in m+1:Nsite
result[m, n] = sum(rand(ComplexF64) for i in 1:10000)
end
end
return result
end

This function already gives a much better time than the serial approach but Iâ€™m thinking could I do better if I can distribute the calculation over multiple machines along with using multithreading? (Iâ€™m working on a computer cluster) but Iâ€™m not sure how to do that.

This very much depends on how expensive the computation of the matrix elements is. Different machines donâ€™t share memory and I assume you want the final matrix to live in the memory of one machine. You would hence need to build parts of the matrix on different machines and then transfer the results back to the â€śmasterâ€ť machine to assemble to final matrix. Whether this approach will be faster depends on the cost of data communication vs the benefit of parallelizing the computation of the matrix elements.

For distributed computing, we have the Distributed standard library and MPI.jl. The former is probably easier to get started with, the latter is the de-factor standard for â€ślarge-scaleâ€ť distributed computing (it also uses fast interconnects, if available in your HPC cluster). The big disadvantage of MPI.jl is that you pretty much canâ€™t use it interactively and thus forces you to change your workflow (itâ€™s a different programming paradigm).