Parallel without communication using MPI

using MPI

function main()
    MPI.Init()
    comm = MPI.COMM_WORLD
    id = MPI.Comm_rank(comm)

    # compile and initialize
    inv(rand(100,100));
    A = rand(3500,3500)

    @time inv(A);
    println("$id of $(MPI.Comm_size(comm))")

    MPI.Finalize()
end

main()

In this code each process should create and invert a random matrix, independently of other processes. However, the timings are not independent on the number of mpiruns

~/julia/parallel$ mpirun -n 1 julia inv_parallel.jl
  0.960765 seconds (7 allocations: 95.196 MiB, 4.07% gc time)
0 of 1
~/julia/parallel$ mpirun -n 2 julia inv_parallel.jl
  2.218972 seconds (7 allocations: 95.196 MiB, 2.19% gc time)
0 of 2
  2.389026 seconds (7 allocations: 95.196 MiB, 1.92% gc time)
1 of 2
~/julia/parallel$ mpirun -n 4 julia inv_parallel.jl
  6.949281 seconds (7 allocations: 95.196 MiB, 0.74% gc time)
3 of 4
  7.425516 seconds (7 allocations: 95.196 MiB, 0.89% gc time)
1 of 4
  7.520218 seconds (7 allocations: 95.196 MiB, 1.00% gc time)
2 of 4
  7.576838 seconds (7 allocations: 95.196 MiB, 0.73% gc time)
0 of 4

Instead they scale with the number of processes?! Can somebody explain me this behaviour?

The only explanation I can come up with is that MPI is not placing each process on its own core. Have you verified that your MPI implementation isn’t using just a single core, and your OS isn’t limiting you somehow?

Make sure to set OPENBLAS_NUM_THREADS=1 in the shell before launching Julia, otherwise you get quadratic oversubscription: each MPI process loads OpenBlas, which launches as many threads as it thinks there are cores, so you have m MPI processes times n OpenBlas threads.

4 Likes

Yep, that’s it! Now using one process is slower than before (I guess this disables hyperthreading?), 2s instead of 1s but does not increase with n. Thanks!