Speeding Up Julia Startup with MPI: Using Scratch Space to Mitigate HDD Access Delay

0samuraiE · May 13, 2025, 5:25pm

Hello everyone,

I’m currently experiencing a significant slowdown when launching Julia with MPI on my cluster. I suspect that the issue might be due to the long HDD access times.

To mitigate this, I am considering copying the Julia program to the SSD-based scratch space shared among the nodes. The scratch space is wiped after every job, so I would like to avoid the overhead of re-precompiling packages each time. Therefore, I want to copy the precompiled environment as well.

Here are my specific questions:

Which Julia files or directories should I copy to the scratch space to ensure smooth and fast startup?
Which environment variables need to be set or adjusted when running Julia from the scratch space?
Are there any best practices or recommendations for managing precompiled packages in this context?

I would greatly appreciate any advice or insights from those who have faced similar challenges.

Thank you in advance!

sloede · May 17, 2025, 4:05am

That’s what we’ve been doing as well.

The depot.

JULIA_DEPOT_PATH

Go to a node with access to local scratch, e.g., a single compute node. Set the JULIA_DEPOT_PATH to a folder on the local scratch (note: if you’re using Julia v1.10, it needs to be a stable path, i.e., one that exists on all nodes). Set up your depot and precompile all packages. tar the entire depot folder and save it to regular (non-temporary) storage.

Then, when running a job, in your jobscript run a once-per-node job that unpacks the prepared depot folder into local scratch again. Make sure to keep the paths the same, otherwise you will need to precompile again (this was fixed in v1.11 or v1.12, though I do not recommend these versions for production runs for various reasons). Set JULIA_DEPOT_PATH. Execute your regular, parallel job, benefiting from local, ultra-fast package loading without precompilation.

Note: We are currently in the process of writing this up properly for publication.

johnomotani · June 3, 2025, 3:23pm

I’m also just running into what I think is this issue. Without some workaround it looks like Julia is almost unusable above a few hundred MPI ranks. Very glad to hear someone is working on it already! I’d vote for creating an issue on MPI.jl to document the problem and your progress on solutions @sloede?

sloede · June 3, 2025, 3:35pm

In that case I think you’re parallel file system is not properly set up. Even the worst case setup we’ve found so far did not see any measurable slowdowns below 2000 MPI ranks. What are you using as the parallel FS - something HPC-worthy such as Lustre or GPFS, or rather something like NFS? The latter is known to scale rather poorly if not set up properly - in that case it might make sense to move your Julia depot from your home to a scratch directory with a better I/O system.

johnomotani · June 3, 2025, 3:42pm

Sorry, I didn’t check carefully enough - was trying to do strong/weak scaling scans and saw lots of failures. Looking more carefully, 2048 processes is OK, and I have problems* at 4096, which sounds like it fits with what you’re saying (the file system is Lustre).

Unfortunately I don’t see a per-node scratch on this system. I might have to contact the cluster admins.

* by ‘problems’ I mean that the job runs for 30 minutes (which was the timeout I’d set in the submission script) without completing when it should have finished in about 10. After about 10 minutes, my code does start running (I get the output from a print statement), but then a few warnings like

 ┌ Warning: failed to remove pidfile on close
│   path = ".../.julia/logs/manifest_usage.toml.pid"
│   removed = false
└ @ FileWatching.Pidfile .../julia-1.11.5/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:347

(where I’ve truncated the file paths for privacy), then nothing until the job times out.

johnh · June 4, 2025, 8:51am

@johnomotani Hello. It might help if you could describe the cluster which you are using. Numbe off CPUs / GPUs. Interconnect network?

johnomotani · June 4, 2025, 9:44am

@johnh Some info about the cluster I’m working on:
Processor: 2× AMD EPYC 7742, 2.25 GHz, 64-core
Cores per node: 128 (2× 64-core processors)
NUMA structure: 8 NUMA regions per node (16 cores per NUMA region)
Memory per node: 256 GiB
Interconnect: HPE Cray Slingshot, 2× 100 Gbps bi-directional per node
work File Systems: 14.5 PB HPE Cray ClusterStor
Operating system: HPE Cray Linux Environment (based on SLES 15)
Scheduler: Slurm configured to be node exclusive (smallest unit of resource is a full node)

The interesting issue I’m having trying to implement a workaround like @sloede suggested is that on this cluster there is no local storage on the compute nodes. I’m trying to use /dev/shm instead. At the moment struggling with making sure I clean up my temporary files 100% reliably, but that’s a bash/SLURM issue, not a Julia one!

johnomotani · June 5, 2025, 1:01pm

Can confirm @sloede’s solution does work for me as well. The cluster I’m working on doesn’t have any node-local disk, but it does have a /tmp/ that is stored in RAM. Copying the depot into that /tmp/ fixes the startup problems. I’ve been able to run on up to 8192 cores - I didn’t try going higher because my test case is already scaling badly by that point, but this is already 4x the cores I could run on before (with some uncertainty in the exact values - I’ve only been trying 2048, 4096, 8192 cores!).

Thanks all, and especially @sloede!

giordano · June 5, 2025, 1:31pm

How many MPI ranks?

johnomotani · June 5, 2025, 1:36pm

The same, so 8192 MPI ranks on 8192 physical cores, across 64 nodes.

Topic		Replies	Views
A good way of running Julia MPI jobs with slurm on HPC platforms General Usage	7	440	June 6, 2024
Run a julia application at large scale (on thousands of nodes) Julia at Scale question	8	2990	August 10, 2020
Precompiled packages when $HOME is shared among different systems? Tooling	4	338	September 17, 2022
Using multi-node Distributed.jl in a slurm cluster Julia at Scale	1	193	January 29, 2025
Installation on managed cluster General Usage cluster , installation	25	1305	July 4, 2023

Speeding Up Julia Startup with MPI: Using Scratch Space to Mitigate HDD Access Delay

Related topics