Speed issues with MPI.jl on slurm cluster

gerhardu · August 22, 2023, 2:10pm

Hello,
I wrote MPI.jl parallel code (my first one) to be used on a Slurm cluster.
I ran the code on my notebook and on the cluster using one node.
It turns out that the code executes 6 times slower on the cluster and I cannot understand why.

I also have a non-MPI version of the code: Its execution time is roughly the same on the cluster and on my notebook, therefore I figure the CPUs or the Memory is not the problem.

So I guess the problem is the MPI implementation. (since i use only one node actual bandwidth does not matter). The mpi versions used differ slightly:
The cluster uses mpiexec (OpenRTE) 4.0.4 and my notebook mpiexec (OpenRTE) 4.1.2.
Can this cause the difference??

I am pretty sure that my Julia MPI setup is ok since I check that Julia uses these implementations by having a look at MPI.Get_library_version(). This means that I set it up correctly, right?

Any ideas what else i could check?
Thanks

simonbyrne · August 22, 2023, 2:46pm

That’s hard to say without knowing more.

Your best bet would be to profile the code and find where the bottlenecks are. Nvidia’s Nsight Systems is a fairly powerful MPI profiler that works with Julia. See GitHub - JuliaGPU/NVTX.jl: Julia bindings for NVTX, for instrumenting with the Nvidia Nsight Systems profiler

gerhardu · August 23, 2023, 8:21am

I finally found the issue. A no-MPI problem.

But thanks for the hint to the MPI profiler. I ll have a look at it!

giordano · August 23, 2023, 10:19am

Mind sharing what was the problem, in case that can be useful for other users in the future?

gerhardu · August 25, 2023, 9:55am

Sure, the slow performance on the cluster was actually caused by hardware.
The problem was in one specific function.

In my non MPI-parallelized code I use a thread-parallized for loop to speed up computations.
In my MPI-parallelized code, the code block for the for loop is changed. Some expensive part went to another part of the code because of MPI. That was the issue: The cluster hardware needed a lot of time to do this, but my notebook not.

Here is a simplified version of the function whose execution time was much slower on the cluster (it is called very often): The obejcts Irk[z], Fr[z,tp], RHS[ichunk] are arrays with like 50-200 entries.

function MPIfunction!(.....)
    for tp in (tmone-1):-1:simdata.indices[1] 
         simdata.threadranges2 .= splitter(tp+1, (tmone-1) - (tp+1) + 1, simdata.nchunks) #smallest index, nr of indices
       @Threads.threads for ichunk in 1:simdata.nchunks
           simdata.RHS[ichunk] .= 0
           for z in simdata.threadranges2[ichunk]
               s = thesign(z, tp, simdata.NstepsinMemory) 
               simdata.RHS[ichunk] .-= real.( s.* simdata.Irk[z] .* simdata.Fr[z,tp] )
           end
       end
       # collect thread contribution
       simdata.Irk[tp] .= 0 
       for ichunk in 1:simdata.nchunks
           simdata.Irk[tp] .+= simdata.RHS[ichunk] 
       end
       simdata.Irk[tp] .*= disc.dt
       # add local contribution 
       simdata.Irk[tp] .+= simdata.Fr[tmone,tp] * thesign(tmone, tp, simdata.NstepsinMemory) 
       end
end

Maybe anyone knows the reason why this is faster/slower on some hardware?!

Now i removed the threaded for loop and it is also fast on the cluster. Therefore this part of the code uses only 1 thread.

Thanks for the hint to Nsys profiler. I already tried it, and it is very easy to use.

Topic		Replies	Views
Parallel without communication using MPI Julia at Scale	3	759	October 8, 2018
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2207	January 3, 2018
Computer specific slowdown on multi-threading on computer cluster (Linux)? Performance installation	27	3053	November 11, 2021
Distributed for loop slower than serial? Julia at Scale	4	1200	August 20, 2018
Slowdown with multiple instances per node? Julia at Scale question , parallel , mpi	2	524	May 27, 2020

Speed issues with MPI.jl on slurm cluster

Related topics