Hello everyone,
I’m currently experiencing a significant slowdown when launching Julia with MPI on my cluster. I suspect that the issue might be due to the long HDD access times.
To mitigate this, I am considering copying the Julia program to the SSD-based scratch space shared among the nodes. The scratch space is wiped after every job, so I would like to avoid the overhead of re-precompiling packages each time. Therefore, I want to copy the precompiled environment as well.
Here are my specific questions:
- Which Julia files or directories should I copy to the scratch space to ensure smooth and fast startup?
- Which environment variables need to be set or adjusted when running Julia from the scratch space?
- Are there any best practices or recommendations for managing precompiled packages in this context?
I would greatly appreciate any advice or insights from those who have faced similar challenges.
Thank you in advance!
That’s what we’ve been doing as well.
The depot.
JULIA_DEPOT_PATH
Go to a node with access to local scratch, e.g., a single compute node. Set the JULIA_DEPOT_PATH
to a folder on the local scratch (note: if you’re using Julia v1.10, it needs to be a stable path, i.e., one that exists on all nodes). Set up your depot and precompile all packages. tar
the entire depot folder and save it to regular (non-temporary) storage.
Then, when running a job, in your jobscript run a once-per-node job that unpacks the prepared depot folder into local scratch again. Make sure to keep the paths the same, otherwise you will need to precompile again (this was fixed in v1.11 or v1.12, though I do not recommend these versions for production runs for various reasons). Set JULIA_DEPOT_PATH
. Execute your regular, parallel job, benefiting from local, ultra-fast package loading without precompilation.
Note: We are currently in the process of writing this up properly for publication.
2 Likes
I’m also just running into what I think is this issue. Without some workaround it looks like Julia is almost unusable above a few hundred MPI ranks. Very glad to hear someone is working on it already! I’d vote for creating an issue on MPI.jl to document the problem and your progress on solutions @sloede?
In that case I think you’re parallel file system is not properly set up. Even the worst case setup we’ve found so far did not see any measurable slowdowns below 2000 MPI ranks. What are you using as the parallel FS - something HPC-worthy such as Lustre or GPFS, or rather something like NFS? The latter is known to scale rather poorly if not set up properly - in that case it might make sense to move your Julia depot from your home to a scratch directory with a better I/O system.
Sorry, I didn’t check carefully enough - was trying to do strong/weak scaling scans and saw lots of failures. Looking more carefully, 2048 processes is OK, and I have problems* at 4096, which sounds like it fits with what you’re saying (the file system is Lustre).
Unfortunately I don’t see a per-node scratch on this system. I might have to contact the cluster admins.
* by ‘problems’ I mean that the job runs for 30 minutes (which was the timeout I’d set in the submission script) without completing when it should have finished in about 10. After about 10 minutes, my code does start running (I get the output from a print statement), but then a few warnings like
┌ Warning: failed to remove pidfile on close
│ path = ".../.julia/logs/manifest_usage.toml.pid"
│ removed = false
└ @ FileWatching.Pidfile .../julia-1.11.5/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:347
(where I’ve truncated the file paths for privacy), then nothing until the job times out.