I wrote MPI.jl parallel code (my first one) to be used on a Slurm cluster.
I ran the code on my notebook and on the cluster using one node.
It turns out that the code executes 6 times slower on the cluster and I cannot understand why.
I also have a non-MPI version of the code: Its execution time is roughly the same on the cluster and on my notebook, therefore I figure the CPUs or the Memory is not the problem.
So I guess the problem is the MPI implementation. (since i use only one node actual bandwidth does not matter). The mpi versions used differ slightly:
The cluster uses mpiexec (OpenRTE) 4.0.4 and my notebook mpiexec (OpenRTE) 4.1.2.
Can this cause the difference??
I am pretty sure that my Julia MPI setup is ok since I check that Julia uses these implementations by having a look at MPI.Get_library_version(). This means that I set it up correctly, right?
Any ideas what else i could check?
That’s hard to say without knowing more.
Your best bet would be to profile the code and find where the bottlenecks are. Nvidia’s Nsight Systems is a fairly powerful MPI profiler that works with Julia. See GitHub - JuliaGPU/NVTX.jl: Julia bindings for NVTX, for instrumenting with the Nvidia Nsight Systems profiler
I finally found the issue. A no-MPI problem.
But thanks for the hint to the MPI profiler. I ll have a look at it!
Mind sharing what was the problem, in case that can be useful for other users in the future?
Sure, the slow performance on the cluster was actually caused by hardware.
The problem was in one specific function.
In my non MPI-parallelized code I use a thread-parallized for loop to speed up computations.
In my MPI-parallelized code, the code block for the for loop is changed. Some expensive part went to another part of the code because of MPI. That was the issue: The cluster hardware needed a lot of time to do this, but my notebook not.
Here is a simplified version of the function whose execution time was much slower on the cluster (it is called very often): The obejcts Irk[z], Fr[z,tp], RHS[ichunk] are arrays with like 50-200 entries.
for tp in (tmone-1):-1:simdata.indices
simdata.threadranges2 .= splitter(tp+1, (tmone-1) - (tp+1) + 1, simdata.nchunks) #smallest index, nr of indices
@Threads.threads for ichunk in 1:simdata.nchunks
simdata.RHS[ichunk] .= 0
for z in simdata.threadranges2[ichunk]
s = thesign(z, tp, simdata.NstepsinMemory)
simdata.RHS[ichunk] .-= real.( s.* simdata.Irk[z] .* simdata.Fr[z,tp] )
# collect thread contribution
simdata.Irk[tp] .= 0
for ichunk in 1:simdata.nchunks
simdata.Irk[tp] .+= simdata.RHS[ichunk]
simdata.Irk[tp] .*= disc.dt
# add local contribution
simdata.Irk[tp] .+= simdata.Fr[tmone,tp] * thesign(tmone, tp, simdata.NstepsinMemory)
Maybe anyone knows the reason why this is faster/slower on some hardware?!
Now i removed the threaded for loop and it is also fast on the cluster. Therefore this part of the code uses only 1 thread.
Thanks for the hint to Nsys profiler. I already tried it, and it is very easy to use.