I’m in a similar situation: I’m starting to look at deploying our simulations in clusters with several nodes connected by infiniband, so I look forward to any response from people with this kind of experience.
(As far as I understand, Infiniband is much like good old ethernet, just much faster, so you can reduce the bottleneck of internode communication.)
Possibly relevant: Running Julia in a SLURM Cluster - #2 by CameronBieganek