Overall context
I am preparing a weak scaling test of a distributed Julia application. The test will use up to a couple of thousands of compute nodes and will need to run multiple times for each configuration (to obtain statistically significant data). The application is parallelized with MPI.jl (and uses CUDAnative.jl/CuArrays.jl/CUDAdrv.jl). SLURM is the available job scheduler.
Introductory high-level questions
How can I minimize the job setup time and avoid potential congestions as thousands of processes will need to access the same files of the Julia installation?
How can I ensure that all precompilation is done before the job submission and then read from cache? (Is it enough to run the same application once on a single node, before running it at scale?)
Entering the specifics
Julia is installed on my home (GPFS filesystem), which is mounted on the compute nodes (note that I could instead install it on the scratch, a Lustre filesystem, which is also mounted on the compute nodes). The compute nodes do not have any persistent storage, only a RAM disk which is cleaned at the end of every job.
In the past, I have done such large scale scaling tests with MPI-CUDA C applications. To minimize the job setup time, it has proven to be very effective to broadcast the executable and all its dependencies (found with ldd) to the node-local RAM disk at the beginning of the job. Then, the executable could be run loading the dependencies exclusively from the node-local RAM disk (by modifying the LD_LIBRARY_PATH). Could a similar approach be done somehow with a Julia application? If all precompilation is cached (by running the application first once on a single node), can I broadcast the precompilation cache and the Julia executable to the RAM disks of the compute nodes and make Julia use this node-local cache? How can I change the used cache folder etc.?
The first step would be to use PackageCompiler to compile a custom system image (shared library). To do this you will need to specify all the functions and all of their arguments types that will be called at runtime. SnoopCompile is a useful tool for creating a such a list. Note that you have to be careful about run-time vs compile-time initialization of external libraries when doing this (ie. you have to use the __init__() function when required).
Once you have the system image you launch julia on the compute nodes with julia -J /path/to/systemimage.so. Any functions compiled into the system image wonât need to be compiled on the compute node.
If you want to do the dependency broadcasting to the compute node rampdisk, you can do that as normal because the system image is a regular shared library.
One thing I am not sure about is whether libraries that are called via ccall function functions compiled in the system image are handled. I donât know if they show up in the output of ldd or not (I suspect they are dlopenâd). It is also possible that some Julia packages that wrap C libraries record the path to the shared library when the package is installed, rather than figuring it out at runtime. Even if you copy the library to the ramdisk and update LD_LIBRARY_PATH, it may still load the library from the original location.
Something else to be careful of is architecture mismatches. Some HPC machines have front end nodes with different architectures than the compute nodes, and you will want to compile the system image on the same architecture as the compute node.
I should also add that I havenât actually done any of this. My application takes around 3 minutes to compile, so I have just been eating the compile time cost rather than using PackageCompiler and SnoopCompile.
Thanks @JaredCrean2! This would certainly be the most complete solution to this problem. However, it seems to require quit some work to get this properly working.
Thus, I would prefer a solution that is only partial but simpler. I would like
to find a simple technique to avoid congestion effects when thousand of processes run a Julia application and therefore need to access the same files. It is not a big issue if each process does some compilation as long as this does not interfere with other processes and cause congestion at scale. To put it simple, a certain constant job setup time is fine; however, it should not significantly increase at large scale.
I would hope to find a solution along the following lines:
Precompile all required modules by invoking Base.compilecache(modulename) [1].
Broadcast the precompiled modules, the julia exe (and its dependencies found with ldd) and the main application code [2] to the RAM-disks of the compute nodes.
Set the Julia variable DEPOT_PATH[1] to point to the local RAM-disk (the precompiled modules must be inside DEPOT_PATH[1]/compiled/ [1])
To my understanding, there is (at least) one issue with this: the Julia processes will not directly take the precompiled modules from DEPOT_PATH[1]/compiled/, but they will first check inside DEPOT_PATH[1]/packages/ if the module needs to be precompiled again. So, I would like to ask you:
Do I understand the situation right?
If yes, is there a way to deactivate this check, i.e. to force the usage of these precompiled modules without any check?
And if this check cannot be deactivated, would a symlink in DEPOT_PATH[1]/packages/ to the original packages path (~/.julia/packages/) fix it (unfortunately, then all processes would need to access these same files simultaneously what could cause congestion)?
Do you see any other issues with this approach? What else will all Julia processes need to access from the installation on home?
Thanks!!!
[1] Modules ¡ The Julia Language
[2] The majority of the application code should also be put in a module itself to have it precompiled (there could even be just a call to main() in the application code).
If the goal is only to avoid filesystem congestion, then there may be a simpler solution, although I am not familiar enough with the internals of code loading/precompile file loading to know what it is. Hopefully someone more knowledgeable about this topic can chime in. I am also curious to know more about exactly how this works.
The answer seems to be to create a local cache. You can do this even if you are running the compute nodes in a RAMdisk - you should still have a writeable /tmp
Iâm not sure this is a complete solution, because the docs for module precompilation suggest the mtime of all source files will be checked, and metadata operations may not scale well on parallel file systems. It avoids cache files overwriting each other, but not the scaling issue.
Thanks @jonh for looking up the thread. To create a local cache is what I had in mind in the solution sketch in my second post.
Good to know that the DEPOT_PATH can be set via JULIA_DEPOT_PATH (it does not appear in environment variables in the Julia doc, but only in the glossary of the Pkg doc).
To my understanding, --compiled-modules=no does completely deactivate the usage of the precompilation cache [1] rather then to force the usage of the available precompiled modules without any check. So, unfortunately, this will not solve the issues raised in my second post.