I’m running a batch script on an HPC cluster where each Julia execution is expected to be <1 min, within a bash loop. I ran into this weird “precompilation gridlock”, shown below. Has anybody experienced something like this?
Precompiling MLJBase
Progress [=======================================> ] 35/36
✓ FixedPointNumbers
✓ ColorTypes
? Distributions → DistributionsChainRulesCoreExt
◐ MLJBase Being precompiled by another machine (hostname: worker6062, pid: 603518, pidfile: /mnt/home/mcranmer/.julia/compiled/v1.10/MLJBase/jaWQl…
Basically it looks like all the workers wait for one worker to finish precompiling. Then when that worker finally finishes[1], the next one decides that the precompilation cache was invalidated, and it needs to precompile again. This process repeats over and over.
The result is that out of 3200 cores across the cluster, only 1 is ever in use, since precompilation ends up taking longer the processing itself:
One other clue is that this gridlock only started when I tried to interact with the environment from another interactive node to visualize some stuff. That seemed to invalidate the cache (maybe due to -O2 vs -O3). After that point the gridlock started (it’s a shared filesystem, so the cache would be shared by both my workers and interactive REPL).
That seemed to make the workers go into a loop where they would keep invalidating the cache of the previous one.
yeah, I want to add that IMO it’s overly tight – in the github issue (last link above), the login and remote nodes are using the same CPUs but one has 2 NUMA nodes the other has only 1, but I don’t believe that should have changed the compile cache hash
Just to note; my cluster is not heterogenous. The node I am visualising things from is the same type as the nodes I am running from.
My guess is that it’s either:
My login shell prescribes different Julia env variables (like -O2) which triggered a re-precompilation, (and thus caused workers to get stuck while waiting for it to finish) or
I added a package while the batch script was already running.
The weird thing is even after one worker pre-compiled, the other ones started precompiling again, one after the other, (even though they are identical worker nodes). I don’t understand that. Maybe it’s about where they were in the precompilation process the moment the global mutable cache got changed, and so their new precompilation layer was somehow immediately invalid?
In either case I’d like to figure out how to prevent this. Is there a way I can freeze the precompilation cache when I execute my job, so that it doesn’t interact with a global mutable cache? Or, just turn off precompilation?