I’m on a PBS cluster where I cannot simply provide a machine file to julia. After a lot of trials, I found that only MPI transport works on the cluster, and I was able to adapt an MPIClusterManagers.jl example with MPI_TRANSPORT_ALL on our cluster.
However, I’ve come across a rather strange phenomenon: If I request PBS for 4 cpus, and start my MPI job as
I don’t get to use the 4th cpu! Of course, if I ask PBS for 5 cpus, I get to use 4. So I’m paying for 1/ncpus_requested more cpu time with every PBS request! addprocs(1) simply oversubscribes without using the last CPU.
What is size? Do you get communicator size equal to the number of requested processes? Recall that workers returns one less than the number of processes (i.e. minus the main one).
An example is provided in examples/juliacman.jl . The julia master process is NOT part of the MPI cluster. The main script should be launched directly, MPIManager internally calls mpirun to launch julia/MPI workers. All the workers started via MPIManager will be part of the MPI cluster."
@mcreel, that’s almost the right answer – you sent me down the right path. I wasn’t using the paradigm you linked, where only workers use MPI code. I was using instead, the one where the master and workers all use MPI through TCP/IP transport.
When I switched to workers-only-use MPI, this MWE produces results like I expected:
using MPIClusterManagers, Distributed
manager = MPIManager(np=4)
addprocs(manager)
@info "workers are $(workers())"
exit()
I get [Info: workers are [2,3,4,5]
Please note that this is NOT run through mpirun. I simply ran this, after requesting PBS for 4 cpus, and then doing julia myscript.jl >& outfile
My parallel Julia code with @sync@async and remotecall_fetch worked as expected with near 90% CPU utilisation on all 4 requested cores on the PBS assigned node,
Hi @johnh, I’m not sure what the authentication method is. However, if I log in with an interactive PBS job using qsub -I, and try specifying a machine file like so I get an error:
julia --machine-file=$PBS_NODEFILE
Host key verification failed.
Though the PBS_NODEFILE exists, and I can echo $PBS_NODEFILE
An admin on our cluster told me that only MPI communication is allowed between nodes, which is why I resort to using MPIClusterManagers.jl