Hello,
I wrote MPI.jl parallel code (my first one) to be used on a Slurm cluster.
I ran the code on my notebook and on the cluster using one node.
It turns out that the code executes 6 times slower on the cluster and I cannot understand why.
I also have a non-MPI version of the code: Its execution time is roughly the same on the cluster and on my notebook, therefore I figure the CPUs or the Memory is not the problem.
So I guess the problem is the MPI implementation. (since i use only one node actual bandwidth does not matter). The mpi versions used differ slightly:
The cluster uses mpiexec (OpenRTE) 4.0.4 and my notebook mpiexec (OpenRTE) 4.1.2.
Can this cause the difference??
I am pretty sure that my Julia MPI setup is ok since I check that Julia uses these implementations by having a look at MPI.Get_library_version(). This means that I set it up correctly, right?
Any ideas what else i could check?
Thanks
That’s hard to say without knowing more.
Your best bet would be to profile the code and find where the bottlenecks are. Nvidia’s Nsight Systems is a fairly powerful MPI profiler that works with Julia. See GitHub - JuliaGPU/NVTX.jl: Julia bindings for NVTX, for instrumenting with the Nvidia Nsight Systems profiler
I finally found the issue. A no-MPI problem.
But thanks for the hint to the MPI profiler. I ll have a look at it!
Mind sharing what was the problem, in case that can be useful for other users in the future?
2 Likes
Sure, the slow performance on the cluster was actually caused by hardware.
The problem was in one specific function.
In my non MPI-parallelized code I use a thread-parallized for loop to speed up computations.
In my MPI-parallelized code, the code block for the for loop is changed. Some expensive part went to another part of the code because of MPI. That was the issue: The cluster hardware needed a lot of time to do this, but my notebook not.
Here is a simplified version of the function whose execution time was much slower on the cluster (it is called very often): The obejcts Irk[z], Fr[z,tp], RHS[ichunk] are arrays with like 50-200 entries.
function MPIfunction!(.....)
for tp in (tmone-1):-1:simdata.indices[1]
simdata.threadranges2 .= splitter(tp+1, (tmone-1) - (tp+1) + 1, simdata.nchunks) #smallest index, nr of indices
@Threads.threads for ichunk in 1:simdata.nchunks
simdata.RHS[ichunk] .= 0
for z in simdata.threadranges2[ichunk]
s = thesign(z, tp, simdata.NstepsinMemory)
simdata.RHS[ichunk] .-= real.( s.* simdata.Irk[z] .* simdata.Fr[z,tp] )
end
end
# collect thread contribution
simdata.Irk[tp] .= 0
for ichunk in 1:simdata.nchunks
simdata.Irk[tp] .+= simdata.RHS[ichunk]
end
simdata.Irk[tp] .*= disc.dt
# add local contribution
simdata.Irk[tp] .+= simdata.Fr[tmone,tp] * thesign(tmone, tp, simdata.NstepsinMemory)
end
end
Maybe anyone knows the reason why this is faster/slower on some hardware?!
Now i removed the threaded for loop and it is also fast on the cluster. Therefore this part of the code uses only 1 thread.
Thanks for the hint to Nsys profiler. I already tried it, and it is very easy to use.
1 Like