Error when using "too many" workers

Most cloud servers will have either two or four CPUs (chips). Each CPU chip resides in a single “socket” and has some number of cores on the chip. The total number of cores is therefore the number of CPUs times the number of cores per CPU. It sounds like you have two CPUs (sockets) with 24 cores per CPU, for a total of 48 cores; this would be a common, realistic, albeit somewhat high-end configuration for a cloud server.

For most types of CPUs (and, in particular, Intel Xeon CPUs) can support Hyperthreading. Hyperthreading allows each core to execute two threads (programs or different parts of the same program) at once, interleaved with each other. Hyperthreading usually speeds up performance, but there are some cases where it makes performance worse. On the system you describe, I suspect you have a total of 96 hypertheads on the 48 cores, and this is where the number “CPU(s) 96” comes from in your post above. Sometimes hyperthreads get referred to as “cores”, although this is not technically accurate; Amazon EC2 refers to hyperthreads as “vCPUs”.

So I suspect that what you really want to do is run “julia -p96”. Anything more may overload the cores with too many workers/threads, which will cause performance to suffer as each core attempts to switch between many tasks, which causes a large loss of efficiency due to the overhead of doing the switches.

One exception to what I say above is if you are using multiple clouds servers (nodes) together with MPI and a machinefile: in this case, you’ll have more cores available in total, but the different cloud servers will have to communicate with each other across the network, which adds some overhead. You will need to run your Julia code differently if this is the case; let us know if it is the case and if you’d like details.

Another exception to what I say above is if the cloud server you are using has one or more Intel Xeon Phi accelerator card in it (Knights Corner / Knight’s Landing). Having such a card would be very unusual, but can provide more cores on a single machine than would otherwise be possible.

I hope this helps.

1 Like