I would like to set up a “compute cluster” with 4-5 computers in my workgroup. It should be able to run more or less arbitrary code, so it’s maybe not a totally julia-specific question, but in the end I would of course like to run julia code on it To the prerequisites and requirements:
- 5 computers with 1 or 2 GPUs of different types, connected by normal gigabit ethernet (I can give more detail if needed).
- 1 data storage server
- Ubuntu 20, used in dual boot with windows, as all but one node is used by students for other work.
- As the computers are used for other work it should be possible to dynamically connect/disconnect the nodes.
- Mostly embarrassingly parallel workload (python / julia scripts for deep learning and satellite image processing)
- Maaayybee distributed deep learning training
- Each node needs a few 100 GB of data to process, coming from a data storage server, so quite a bit of data movement.
- I think what is needed is “just some sort of queue”
- No money available to buy commercial products (of course…)
A student has a training job running on the main computer, using both GPUs. I am connected via remote to the same computer. I now want to start my own training run, also using two GPUs, but don’t want to ssh into the next best pc and start everything manually from there. So a queue which automatically moves my work to the next free resources would be nice.
There are N satellite images to process using deep learning stuff. I would like a central way of starting the processing jobs for these images. When the job is running, someone comes along and urgently needs one computer + windows. He restarts the node without notice. The interrupted job should be finished by some other node.
I would be happy if you could give me guidance towards the right approach, the right software and perhaps some general advice! E.g. should I completely move to the cloud, like AWS or azure? Is all this overkill for only 5 pcs? Of course I can also go more into detail if needed.