I am preparing a weak scaling test of a distributed Julia application. The test will use up to a couple of thousands of compute nodes and will need to run multiple times for each configuration (to obtain statistically significant data). The application is parallelized with
MPI.jl (and uses
CUDAdrv.jl). SLURM is the available job scheduler.
Introductory high-level questions
- How can I minimize the job setup time and avoid potential congestions as thousands of processes will need to access the same files of the Julia installation?
- How can I ensure that all precompilation is done before the job submission and then read from cache? (Is it enough to run the same application once on a single node, before running it at scale?)
Entering the specifics
Julia is installed on my home (GPFS filesystem), which is mounted on the compute nodes (note that I could instead install it on the scratch, a Lustre filesystem, which is also mounted on the compute nodes). The compute nodes do not have any persistent storage, only a RAM disk which is cleaned at the end of every job.
In the past, I have done such large scale scaling tests with MPI-CUDA C applications. To minimize the job setup time, it has proven to be very effective to broadcast the executable and all its dependencies (found with
ldd) to the node-local RAM disk at the beginning of the job. Then, the executable could be run loading the dependencies exclusively from the node-local RAM disk (by modifying the
LD_LIBRARY_PATH). Could a similar approach be done somehow with a Julia application? If all precompilation is cached (by running the application first once on a single node), can I broadcast the precompilation cache and the Julia executable to the RAM disks of the compute nodes and make Julia use this node-local cache? How can I change the used cache folder etc.?
Thanks for sharing your experiences and thoughts!