Speed issues with MPI.jl on slurm cluster

Hello,
I wrote MPI.jl parallel code (my first one) to be used on a Slurm cluster.
I ran the code on my notebook and on the cluster using one node.
It turns out that the code executes 6 times slower on the cluster and I cannot understand why.

I also have a non-MPI version of the code: Its execution time is roughly the same on the cluster and on my notebook, therefore I figure the CPUs or the Memory is not the problem.

So I guess the problem is the MPI implementation. (since i use only one node actual bandwidth does not matter). The mpi versions used differ slightly:
The cluster uses mpiexec (OpenRTE) 4.0.4 and my notebook mpiexec (OpenRTE) 4.1.2.
Can this cause the difference??

I am pretty sure that my Julia MPI setup is ok since I check that Julia uses these implementations by having a look at MPI.Get_library_version(). This means that I set it up correctly, right?

Any ideas what else i could check?
Thanks

That’s hard to say without knowing more.

Your best bet would be to profile the code and find where the bottlenecks are. Nvidia’s Nsight Systems is a fairly powerful MPI profiler that works with Julia. See GitHub - JuliaGPU/NVTX.jl: Julia bindings for NVTX, for instrumenting with the Nvidia Nsight Systems profiler

I finally found the issue. A no-MPI problem.

But thanks for the hint to the MPI profiler. I ll have a look at it!

Mind sharing what was the problem, in case that can be useful for other users in the future?

2 Likes

Sure, the slow performance on the cluster was actually caused by hardware.
The problem was in one specific function.

In my non MPI-parallelized code I use a thread-parallized for loop to speed up computations.
In my MPI-parallelized code, the code block for the for loop is changed. Some expensive part went to another part of the code because of MPI. That was the issue: The cluster hardware needed a lot of time to do this, but my notebook not.

Here is a simplified version of the function whose execution time was much slower on the cluster (it is called very often): The obejcts Irk[z], Fr[z,tp], RHS[ichunk] are arrays with like 50-200 entries.

function MPIfunction!(.....)
    for tp in (tmone-1):-1:simdata.indices[1] 
         simdata.threadranges2 .= splitter(tp+1, (tmone-1) - (tp+1) + 1, simdata.nchunks) #smallest index, nr of indices
       @Threads.threads for ichunk in 1:simdata.nchunks
           simdata.RHS[ichunk] .= 0
           for z in simdata.threadranges2[ichunk]
               s = thesign(z, tp, simdata.NstepsinMemory) 
               simdata.RHS[ichunk] .-= real.( s.* simdata.Irk[z] .* simdata.Fr[z,tp] )
           end
       end
       # collect thread contribution
       simdata.Irk[tp] .= 0 
       for ichunk in 1:simdata.nchunks
           simdata.Irk[tp] .+= simdata.RHS[ichunk] 
       end
       simdata.Irk[tp] .*= disc.dt
       # add local contribution 
       simdata.Irk[tp] .+= simdata.Fr[tmone,tp] * thesign(tmone, tp, simdata.NstepsinMemory) 
       end
end

Maybe anyone knows the reason why this is faster/slower on some hardware?!

Now i removed the threaded for loop and it is also fast on the cluster. Therefore this part of the code uses only 1 thread.

Thanks for the hint to Nsys profiler. I already tried it, and it is very easy to use.

1 Like