Setting up a small compute cluster

maxfreu · November 23, 2020, 11:59am

Hi everybody!
I would like to set up a “compute cluster” with 4-5 computers in my workgroup. It should be able to run more or less arbitrary code, so it’s maybe not a totally julia-specific question, but in the end I would of course like to run julia code on it To the prerequisites and requirements:

Hardware:

5 computers with 1 or 2 GPUs of different types, connected by normal gigabit ethernet (I can give more detail if needed).
1 data storage server

Software:

Ubuntu 20, used in dual boot with windows, as all but one node is used by students for other work.

Requirements:

As the computers are used for other work it should be possible to dynamically connect/disconnect the nodes.
Mostly embarrassingly parallel workload (python / julia scripts for deep learning and satellite image processing)
Maaayybee distributed deep learning training
Each node needs a few 100 GB of data to process, coming from a data storage server, so quite a bit of data movement.
I think what is needed is “just some sort of queue”
No money available to buy commercial products (of course…)

Typical situation:

A student has a training job running on the main computer, using both GPUs. I am connected via remote to the same computer. I now want to start my own training run, also using two GPUs, but don’t want to ssh into the next best pc and start everything manually from there. So a queue which automatically moves my work to the next free resources would be nice.
There are N satellite images to process using deep learning stuff. I would like a central way of starting the processing jobs for these images. When the job is running, someone comes along and urgently needs one computer + windows. He restarts the node without notice. The interrupted job should be finished by some other node.

I would be happy if you could give me guidance towards the right approach, the right software and perhaps some general advice! E.g. should I completely move to the cloud, like AWS or azure? Is all this overkill for only 5 pcs? Of course I can also go more into detail if needed.

johnh · November 23, 2020, 1:59pm

I am happy to give some guidance. I build HPC clusters for Dell.
I will talk about the network in the next post.

Firstly, please uncouple from insisting on Ubuntu as the OS.
Bright Computing now have their cluster manager for free for up to 8 compute nodes. Easy 8

Just register and download.
Happily though Bright now support Ubuntu 18.04 so that might be a good choice for you!

johnh · November 23, 2020, 2:04pm

Regarding the network 1Gbps Ethernet is just fine. Looks like you are thinking of distrubiting jobs to the compute servers, rather than doing much parallel work across nodes.
Just for completeness, 25Gbps ethernet is the same cost as 10Gbps ethernet these days so start at 25 if you are going for a higher bandwidth network.
Also if you can scrounge Infiniband cards and cables it is possible to ‘daisy chain’ for servers withotu a switch, which would make for a neat small cluster.

As regards storage you could create a RAID set on your cluster head node and use NFS. That will eb fine.
However if you want to look at BeeGFS thais is easily done, and Bright can configure it out of the box.
Actually quite easy to set up by yourself too.

Which leads me to ask - what hardware do you have for the cluster master? Are there spare slots for some storage drives? Does it have a hardware RAID controller? Though ZFS software RAID might be good here.

johnh · November 23, 2020, 2:08pm

There are a coupel of alternates to Bright

OpenHPC - which works great, but you have to do a bit of self learning. CentOS/Redhat/SuSE though
https://openhpc.community/

Qlustar - Debina/Ubuntu/CentOS

I am less familiar with Qlustar.

My advice - give some serious thought to Bright and Qlustar. Try an installation of both and see which suits you the best.
Remember that Easy 8 and Qlustar are community supported - but you are ready for that.

maxfreu · November 23, 2020, 3:11pm

Thank you for the advice! It will probably take me until the beginning of next year to really try it out, as I’m still waiting for my phd position. So please be patient with me

That’s nice! All the computers are Dell precision towers with varying hardware config My university has a frame contract for hardware with Dell, that’s why.

Regarding your questions and answers (total noob warning): Bright and Qlustar seem to somehow underly the OS. I didnt really figure out how the software stack looks like. Is it Cluster Manager > OS > workload manager? I expected more of an answer like “just install slurm/sge/etc. and youre good to go”. Whats the relation of a workload manager and a cluster manager? Pointing me to the answer is enough of an answer And do the two really work along windows with dual boot config?

We use ubuntu as OS for two reasons: 1) most people (including me) are familiar with deb based distros and 2) it is installable. Debian isn’t (keyboard freeze during install), Manjaro/Arch isn’t (fails to recognize partitions after install), no matter what I do - but I haven’t tried RedHat/Fedora etc. However for cluster purposes there is PXE available. These computers are divas

Regarding ethernet: Thanks for the hint! I would have to talk to the university IT administration to get the switches upgraded in case I dont go for the daisy chain. Currently there are just 1Gb/s intel cards, but you’re right: at least used cards are rather cheap nowadays.

Head node: The computer I have in mind as head node is a Dell precision 7920 tower with 12 core xeon and 96GB RAM, but no hardware raid as far as I know. I think one could cram in some more hard drives and go for software raid, but there are also 1TB SSDs available which may be used for prefetching during idle times or so. The storage server is just a synology with 32 disks or so; I don’t know its capabilities. Also, what is best: taking the fastest computer as head node or keeping it among the workers?

Thanks again!

johnh · November 23, 2020, 3:40pm

Your requirements for interactive use during the day were normally met by installing a Condor pool.
In answer to your question - yes just install the Slurm queue system and you are good to go.
Remember though you need consistent UIDs across all the nodes and the head node, and Slurm uses something called munge. But all easily set up.

The requirement for a dual boot Windows/Linux system is a bit more tricky. I know this is possible with Bright but I have never done it.
What you CAN do with Bright and OpenHPC is a stateless installation - which means the compute node boots via PXE and runs off a RAMdisk or NGS mounted root drive.
This may be a good option for you

This article describes dual booting with Bright. The suggestion is a second hard drive. With a bit of work you could shrink the Windows partition I think and use the free part. Or stateless boot as I say.

maxfreu · September 7, 2021, 2:59pm

The time has come: Today I have tried to install bright’s easy8 and Qlustar on a dell precision 3640 and both fails in different ways.

Prerequisites:

Secure boot is off
Legacy boot mode is not available in bios
USB3 key with iso written via dd (or rufus, failed as well)

Failure modes:
Easy8 (ubuntu 20 “dell version” downloaded): After selecting GUI or text installation from GRUB, the screen goes black and stays like that. I tried reinserting the USB stick but it doesn’t help.

Qlustar 12: Before the install script appears I have to reinsert the USB stick, because pre-install hangs on USB device discovery (it is 2021, we have been to the moon). Then I provide all the settings (centos8 or ubuntu20, with or without packages selected), check them on the last page, hit enter and BANG! big wall of running text. I took a photo and could decipher: “stop_fail_installation task_error task_savelog die” monolithically filling the screen. Glorious.

Do you have any advice on what to try next?

johnh · September 8, 2021, 12:42pm

Hmm. Let us slow down here please.
First of all, I can help with Brigght Easy 8. Which OS did you choose with the download?
Can we then check which CPUs are in the Dell 3640?

maxfreu · September 8, 2021, 1:06pm

I would prefer bright as well, as the documentation is best for a noob like me. I chose bright 9.1 with ubuntu 20.04. Bright let’s you choose the hardware vendor; I selected “dell EMC” and also tried out “generic”. The CPU is an intel i7-10700.

johnh · September 8, 2021, 2:57pm

If I had to make a call here I would say the black screen is an incompatibility with the Ubuntu kernel and the CPU. Which would be strange as that CPU is 1 year old

Ubuntu has LTS kernel version for older releases - but again 20.04 is up to date
https://wiki.ubuntu.com/Kernel/LTSEnablementStack

johnh · September 8, 2021, 3:00pm

Dell only say up to 18.04.3 LTS is supported
https://www.dell.com/support/home/en-uk/drivers/supportedos/precision-3640-workstation

Ah - 20.04 is supported says Ubuntu

johnh · September 8, 2021, 3:08pm

How do you set the 3640 to boot from USB?
I think you press F12 during the boot sequence.
The architecture for a Brigth cluster is that you have one ‘head node’. This is the one you install from USB.
All the other compute nodes PXE boot from the head node and no USB is used.

maxfreu · September 8, 2021, 3:09pm

Thanks for your replies! The computer came pre-installed with ubuntu 20.04, so I thought it’s a safe choice. I just downloaded the easy8 centos 8.4 version. This option takes a me a little further: it loads a kernel and then stops working, telling me the installation medium cannot be found (but it’s right there, plugged in…). From there I can switch to a console, which would probably allow me to bootstrap myself… As last resort I just started to burn a good old DVD…

maxfreu · September 8, 2021, 3:11pm

Correct, on the to-be head node (the 3640) I press F12 and select the boot device from there.

johnh · September 8, 2021, 4:02pm

One thing to ask - do you have a fancy graphics card?
I see these have DisplayPort connectors by default - do we have a graphics driver incompatibility?
My advice - download a live USB with Ubuntu and boot from that. The aim is just to check out the hardware, not to install any HPC stuff.

johnh · September 8, 2021, 4:08pm

Sorry - I re-read your message above. You are trying Centos 8.4
You should be aware that there has been somewhat of a … shall we say fuss… over CentOS which is now CentOS Stream which is not an exact clone of Redhat. Lots of folks are waiting to transition to Rocky Linux.

maxfreu · September 17, 2021, 9:54am

Hi! Sorry, it seems that what I wrote was a bit confusing. To clarify and to give you an update: I tried to install bright 9 ubuntu 18, bright 9.1 ubuntu 20 and bright 9.1 centos 8.4 via USB. The only thing that worked in the end is bright 9.1 centos - and to get it working I had to install from a DVD, not USB.

I read your last post regarding centos as a mild warning. So I made a last, and futile, attempt to install ubuntu: I took the ssd from the head node to one of the nodes which has legacy boot, installed in legacy mode with MBR and took the ssd back to the head node. Using a live cd I then changed the partition table to GPT, reinstalled grub and tadaa - it boots into ubuntu. But unfortunately the system is unusable, it’s sluggish for some reason, the CMDeamon doen’t work properly, bright view doesn’t work… so I went back to centos. Seems like it’s cursed, maybe the ubuntu installer is bugged?

Topic		Replies	Views
Help setting up Julia on a cluster Julia at Scale question , parallel , cluster	28	15039	March 4, 2020
Which workflow to launch jobs on a cluster? New to Julia	11	1685	July 27, 2018
Julia crashes when started on the nodes of a HPC cluster General Usage question , hpc , debug , cluster	8	2206	January 3, 2018
Building cluster for Julia parallel computations New to Julia parallel	10	2292	June 22, 2017
GSoC 2017: Ensure that Julia runs smoothly on current large HPC systems Community	0	631	March 17, 2017

Setting up a small compute cluster

Related topics