Setting up a small compute cluster

Hi everybody!
I would like to set up a “compute cluster” with 4-5 computers in my workgroup. It should be able to run more or less arbitrary code, so it’s maybe not a totally julia-specific question, but in the end I would of course like to run julia code on it :slight_smile: To the prerequisites and requirements:

Hardware:

  • 5 computers with 1 or 2 GPUs of different types, connected by normal gigabit ethernet (I can give more detail if needed).
  • 1 data storage server

Software:

  • Ubuntu 20, used in dual boot with windows, as all but one node is used by students for other work.

Requirements:

  • As the computers are used for other work it should be possible to dynamically connect/disconnect the nodes.
  • Mostly embarrassingly parallel workload (python / julia scripts for deep learning and satellite image processing)
  • Maaayybee distributed deep learning training
  • Each node needs a few 100 GB of data to process, coming from a data storage server, so quite a bit of data movement.
  • I think what is needed is “just some sort of queue”
  • No money available to buy commercial products (of course…) :smiley:

Typical situation:

  • A student has a training job running on the main computer, using both GPUs. I am connected via remote to the same computer. I now want to start my own training run, also using two GPUs, but don’t want to ssh into the next best pc and start everything manually from there. So a queue which automatically moves my work to the next free resources would be nice.

  • There are N satellite images to process using deep learning stuff. I would like a central way of starting the processing jobs for these images. When the job is running, someone comes along and urgently needs one computer + windows. He restarts the node without notice. The interrupted job should be finished by some other node.

I would be happy if you could give me guidance towards the right approach, the right software and perhaps some general advice! :slight_smile: E.g. should I completely move to the cloud, like AWS or azure? Is all this overkill for only 5 pcs? Of course I can also go more into detail if needed.

1 Like

I am happy to give some guidance. I build HPC clusters for Dell.
I will talk about the network in the next post.

Firstly, please uncouple from insisting on Ubuntu as the OS.
Bright Computing now have their cluster manager for free for up to 8 compute nodes. Easy 8


Just register and download.
Happily though Bright now support Ubuntu 18.04 so that might be a good choice for you!
3 Likes

Regarding the network 1Gbps Ethernet is just fine. Looks like you are thinking of distrubiting jobs to the compute servers, rather than doing much parallel work across nodes.
Just for completeness, 25Gbps ethernet is the same cost as 10Gbps ethernet these days so start at 25 if you are going for a higher bandwidth network.
Also if you can scrounge Infiniband cards and cables it is possible to ‘daisy chain’ for servers withotu a switch, which would make for a neat small cluster.

As regards storage you could create a RAID set on your cluster head node and use NFS. That will eb fine.
However if you want to look at BeeGFS thais is easily done, and Bright can configure it out of the box.
Actually quite easy to set up by yourself too.

Which leads me to ask - what hardware do you have for the cluster master? Are there spare slots for some storage drives? Does it have a hardware RAID controller? Though ZFS software RAID might be good here.

1 Like

There are a coupel of alternates to Bright :stuck_out_tongue_winking_eye:

OpenHPC - which works great, but you have to do a bit of self learning. CentOS/Redhat/SuSE though
https://openhpc.community/

Qlustar - Debina/Ubuntu/CentOS


I am less familiar with Qlustar.

My advice - give some serious thought to Bright and Qlustar. Try an installation of both and see which suits you the best.
Remember that Easy 8 and Qlustar are community supported - but you are ready for that.

Thank you for the advice! It will probably take me until the beginning of next year to really try it out, as I’m still waiting for my phd position. So please be patient with me :slight_smile:

That’s nice! All the computers are Dell precision towers with varying hardware config :slight_smile: My university has a frame contract for hardware with Dell, that’s why.

Regarding your questions and answers (total noob warning): Bright and Qlustar seem to somehow underly the OS. I didnt really figure out how the software stack looks like. Is it Cluster Manager > OS > workload manager? I expected more of an answer like “just install slurm/sge/etc. and youre good to go”. Whats the relation of a workload manager and a cluster manager? Pointing me to the answer is enough of an answer :wink: And do the two really work along windows with dual boot config?

We use ubuntu as OS for two reasons: 1) most people (including me) are familiar with deb based distros and 2) it is installable. Debian isn’t (keyboard freeze during install), Manjaro/Arch isn’t (fails to recognize partitions after install), no matter what I do - but I haven’t tried RedHat/Fedora etc. However for cluster purposes there is PXE available. These computers are divas :smiley:

Regarding ethernet: Thanks for the hint! I would have to talk to the university IT administration to get the switches upgraded in case I dont go for the daisy chain. Currently there are just 1Gb/s intel cards, but you’re right: at least used cards are rather cheap nowadays.

Head node: The computer I have in mind as head node is a Dell precision 7920 tower with 12 core xeon and 96GB RAM, but no hardware raid as far as I know. I think one could cram in some more hard drives and go for software raid, but there are also 1TB SSDs available which may be used for prefetching during idle times or so. The storage server is just a synology with 32 disks or so; I don’t know its capabilities. Also, what is best: taking the fastest computer as head node or keeping it among the workers?

Thanks again!

Your requirements for interactive use during the day were normally met by installing a Condor pool.
In answer to your question - yes just install the Slurm queue system and you are good to go.
Remember though you need consistent UIDs across all the nodes and the head node, and Slurm uses something called munge. But all easily set up.

The requirement for a dual boot Windows/Linux system is a bit more tricky. I know this is possible with Bright but I have never done it.
What you CAN do with Bright and OpenHPC is a stateless installation - which means the compute node boots via PXE and runs off a RAMdisk or NGS mounted root drive.
This may be a good option for you

This article describes dual booting with Bright. The suggestion is a second hard drive. With a bit of work you could shrink the Windows partition I think and use the free part. Or stateless boot as I say.