I have a complex numerical simulation which I want to parallelise on large machines. I struggle a bit with the optimal parallelisation strategy. And I am hoping to get some hints what to try/avoid from the community.

To understand Julia’s multi-processing and multi-threading capabilities, I use a simplified workload which calculates `exp.(X*Y)`

for random square matrices `X`

and `Y`

(of dimension 3072).

My multi-threading implementation of that workload is

```
function mat_mult_multi_threaded!(Z, X, Y)
Threads.@threads :static for iter in 0:prod(size(Z))-1
i = (iter % size(Z, 1)) + 1
j = (iter ÷ size(Z, 1)) + 1
z = 0.0
for k in axes(X, 2)
z += X[i, k] * Y[k, j]
end
Z[i, j] = exp(z)
end
end
```

My multi-processing implementation of the workload is

```
@everywhere function mat_mult_chunk!(Z, chunk, X, Y)
W = zeros(length(chunk)) # work
for (idx, iter) in enumerate(chunk)
i = (iter % size(Z, 1)) + 1
j = (iter ÷ size(Z, 1)) + 1
z = 0.0
for k in axes(X, 2)
z += X[i, k] * Y[k, j]
end
W[idx] = exp(z)
end
for (idx, iter) in enumerate(chunk)
i = (iter % size(Z, 1)) + 1
j = (iter ÷ size(Z, 1)) + 1
Z[i, j] = W[idx]
end
end
function mat_mult_distributed!(Z, X, Y)
n_iters = prod(size(Z))
n_workers = nworkers()
chunk_size = n_iters ÷ n_workers
if n_iters % n_workers > 0
chunk_size += 1
end
idx_chunks = Iterators.partition(0:n_iters-1, chunk_size)
Z_ = SharedArray{Float64}(size(Z))
@sync for (chunk, pid) in zip(idx_chunks, workers())
@async remotecall_fetch(mat_mult_chunk!, pid, Z_, chunk, X, Y)
end
Z .= Z_
end
```

I run the calculations on an AWS hpc6a.48xlarge instance. The machine is supposed to have 96 vCPU’s (without hyperthreading).

Now, I see the following run times:

1/ I find it surprising that multi-processing is ~10% faster than multi-threading for small number of threads/processes. Any thoughts why this is the case?

2/ Scaling of multi-processing deteriorates for more than 24 processes. Is this an expected behaviour because of the increased overhead from communication?

3/ Suppose, I want to exploit the better performance of multi-processing for small number of processes with scaling from multi-threading. Would it be a viable approach to add multi-threading to `mat_mult_chunk!`

which is called via multi-processing?

Any comments and suggestios are appreciated.