What's needed for a HPC cluster & Julia?

I have been asked to describe what I would like in a new HPC environment that my University might contribute to. 90% of my work is now done in Julia. Mostly modelling (GLM.jl, Turing.jl, MixedModels.jl, and Gibbs samplers coded by hand etc), data wrangling (Data frames.jl) and graph analysis (LightGraphs.jl). I have no idea on how to do any of that in a cluster setting (SLURM was mentioned). Also I do some deep learning which I would like to move to Flux.jl in the future. I would be very grateful for any suggestions on what I should ask for? At the moment I use a laptop to connect to a single pc via ssh and that works ok so far but I don’t know how that scales.
Thanks!

Not sure what you are really asking here. You don’t need anything particular from Julia’s point of view. In fact, Julia dosn’t even need a system wide installation and you can run everything from your home directory.

From the system perspective, the systems administrators will likely setup a cluster in the traditional way, likely making use of a workload scheduler like SLURM or PBS. The main requirement that makes it much easier to use Julia is passwordless access to all the compute nodes from the head node.

From Julia’s perspective, you can use ClusterManagers.jl to interface with the workload scheduler like SLURM. It’s fairly easy.

2 Likes

Some questions to think about:

  • Do you anticipate having single jobs that’ll require more compute/memory than a single node can provide? If so, can the job be arranged so that the nodes don’t need to communicate, or will you need high-speed interconnects between machines?
  • If you’re doing DL, will you want access to GPUs?
  • Will the hardware be heterogeneous from node to node? Debugging is always easier on homogeneous hardware, even if that means the facility will need to be split up into multiple clusters (particularly: arranging a separate GPU cluster). That’s a luxury that academic clusters can’t always afford, though.
  • Will nodes be interactive (for debugging/precompiling/etc.), batch-only (no REPL - only output logs for debugging), or a mix thereof?
2 Likes

Everything I have done on my HPC cluster amounts to very minimalist supercomputing I’d say. I just submit a job that takes up a single compute node, which is much more powerful than my personal computer. In this case, I don’t think anything special needs to be done at all. The tricky stuff comes in when, as was stated above, you need communication between compute nodes.

2 Likes

Good questions. First thing I would ask is what proportion of your work will need GPUs.
The second thing I would ask is the sizes of your data sets - compute is relatively easy to specify. Storage is harder, and a robust shared storage environment is key to HPC. Not to say this costs $$$ - a simple NFS share may be sufficient. I would say for Julia though with its requirements for precompilation that something with more go faster stripes is a good idea.

Regarding Slurm definitely I would recommend this. Also think of Kubernetes of course.

Please drop me a message on here. I work for Dell as a Principal Engineer in our HPC Team.
I promise not to give you the company line, but I would love to become involved in speccing this project.
We can also arrange benchmarking time on Intel and AMD CPUs if you want to kick the tyres on some big systems.

5 Likes

I think Julia derives extra benefits from a fast network file system compared to other languages, due to the precompilation cache. It seems like disk access speeds can vary by orders of magnitude between different clusters.

2 Likes

Thank you all for your answers! As @affans has noted I am not 100% certain what I am asking here. I’ve been asked to make suggestions for such an HPC environment and state my needs. As part of that I would like to make sure julia is supported but it sounds like that is fairly easy.
@stillyslalom, I think my problem at the moment is that my experience with parallel computing is limited to @spawn and @threads on a single computer. I think I need to think about what the logic of my programms requires. Essentially this is also meant to make things possible that aren’t right now so it’s kinda tricky.
@johnh GPUs are definitely in the mix and it seems to be what everyone is requesting. :wink: Storage is definitely an issue. We currently have 40TB altogether and that is barely enough. I should also mention that I have no power to make any purchasing decisions (and there seem to be some longstanding contracts in place). But if you are intersted I will come back to you once I know more.