I’ve used OpenMP in C/C++ and multi-threading in Julia, but have never learned MPI or other distributed frameworks (excepting submitting arrays of jobs to clusters for trivially parallelizable problems). What is the “Julian” way to do distributed memory parallel programming? What are the recommended packages?
Short answer: Julia has built-in distributed computing functionality (see Multi-processing and Distributed Computing · The Julia Language and maybe Jupyter Notebook Viewer). Otherwise, there is GitHub - JuliaParallel/MPI.jl: MPI wrappers for Julia.
My opinion: Julias built-in features are great for small scale parallelisation of certain tasks. If you want to do actual HPC across hundreds of nodes of a supercomputer I’d go with MPI.jl because it utilises infiniband and (please someone correct me if I’m wrong) no one has really pushed Julias built-in distributed computing facilities to the limit and managed or proven to be competitive. (If anyone than perhaps the https://clima.caltech.edu people?)
Skimming through your first link, I noticed that Julia launches remote worker via password-less SSH. How do I let Julia know which cluster nodes are available for use, given that only a subset of nodes is assigned to me by the cluster queuing system when I request a multi-node run? I suppose ClusterManager.jl can launch workers one by one by submitting separate jobs to the queue, but this is not very feasible given the unpredictable wait time in the queue, and I would prefer to use a bunch of machines assigned to me in a single multi-node request. Or do I inevitably end up with MPI.jl, since the mpirun
command will know which nodes to use?
No, say you use the SlurmManager
, it uses srun
to start workers within your SLURM allocation (came up here as well). It doesn’t submit a new job to the queue. Note that SLURM also stores the allocations details (which nodes, how many cpus etc.) in environment variables (e.g. SLURM_JOB_NODELIST
) that you can query in your job submission script and/or Julia itself. See for example the section “OUTPUT ENVIRONMENT VARIABLES” here or the table here.