Julia on cluster, only MPI transport allowed

Hi all,

I’m on a PBS cluster where I cannot simply provide a machine file to julia. After a lot of trials, I found that only MPI transport works on the cluster, and I was able to adapt an MPIClusterManagers.jl example with MPI_TRANSPORT_ALL on our cluster.

However, I’ve come across a rather strange phenomenon: If I request PBS for 4 cpus, and start my MPI job as

mpirun -np 4 julia myscript.jl

where myscript.jl contains the following MWE

using MPIClusterManagers, Distributed
import MPI 
MPI.Init()
rank = MPI.Comm_rank(MPI.COMM_WORLD)
size = MPI.Comm_size(MPI.COMM_WORLD)
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
@info "workers are $(workers())"
rmprocs(workers())
MPIClusterManagers.stop_main_loop(manager)
exit()

I get [Info: workers are [2,3,4]

I don’t get to use the 4th cpu! Of course, if I ask PBS for 5 cpus, I get to use 4. So I’m paying for 1/ncpus_requested more cpu time with every PBS request! addprocs(1) simply oversubscribes without using the last CPU.

Any ideas on how to fix this?

2 Likes

What is size? Do you get communicator size equal to the number of requested processes? Recall that workers returns one less than the number of processes (i.e. minus the main one).

2 Likes

You can try submitting a 4 CPU job but running with:

mpirun --oversubscribe -np 5

I think that this is normal for MPIManager. From https://github.com/JuliaParallel/MPIClusterManagers.jl, one of the modes executes code only on the workers:
"

MPIManager: only workers execute MPI code

An example is provided in examples/juliacman.jl . The julia master process is NOT part of the MPI cluster. The main script should be launched directly, MPIManager internally calls mpirun to launch julia/MPI workers. All the workers started via MPIManager will be part of the MPI cluster."

1 Like

@mcreel, that’s almost the right answer – you sent me down the right path. I wasn’t using the paradigm you linked, where only workers use MPI code. I was using instead, the one where the master and workers all use MPI through TCP/IP transport.

When I switched to workers-only-use MPI, this MWE produces results like I expected:

using MPIClusterManagers, Distributed
manager = MPIManager(np=4)
addprocs(manager)
@info "workers are $(workers())"
exit()

I get [Info: workers are [2,3,4,5]

Please note that this is NOT run through mpirun. I simply ran this, after requesting PBS for 4 cpus, and then doing
julia myscript.jl >& outfile

My parallel Julia code with @sync @async and remotecall_fetch worked as expected with near 90% CPU utilisation on all 4 requested cores on the PBS assigned node,

1 Like

Hi Sparrowhawk. Can you say more about what you mean by you cannot supply a machine file?

Is this related to the MPI job launch mechanism - are you using an authentication method called munge?

Hi @johnh, I’m not sure what the authentication method is. However, if I log in with an interactive PBS job using qsub -I, and try specifying a machine file like so I get an error:

julia --machine-file=$PBS_NODEFILE
Host key verification failed.

Though the PBS_NODEFILE exists, and I can echo $PBS_NODEFILE
An admin on our cluster told me that only MPI communication is allowed between nodes, which is why I resort to using MPIClusterManagers.jl

Cheers

It does sound like Munge authentication is being used here, but I may be on the wrong track.