Distributed startup on large clusters

I note that in Julia 1.5 IPV6 is supported for launch_on_machine(). Which is a good thing.
https://github.com/JuliaLang/julia/pull/34430

This leads me to revive and old topic.
Julia uses ssh connections to start processes on other machines. Over in MPI land process launching on large clusters is now done using other mechanisms which cope better with the startup times on large numbers of machines. In the era of half exascale HPC clusters now should we be looking to implement these mechanisms?

Munge is commonly used for authentication in HPC systems using Slurm batch
https://linux.die.net/man/7/munge

I do realise I started another thread on this topic some time ago. I forget what the response was.

1 Like