Computer specific slowdown on multi-threading on computer cluster (Linux)?

I have a code that runs very fast on my laptop (~20 seconds). As a test, I put it onto a new cluster that I started using. The code took 30 minutes to run.

This code uses a lot of multi-threading in a lot of places. If I turn the number of threads in Julia down to 1 (the laptop uses 4), then the code is fast again. Anything over 1 seems to cause this slowdown on the new cluster (running Linux with the x86 installation of Julia).

Is there a special set of instructions that I should be following to make sure that Julia is fast on a computer cluster? Is there some additional module that should be loaded (I tried a few: GCC, BLAS, LLVM, etc.), but nothing made the code go fast.

Any ideas?

Edit: I should also mention that I have used this code on another cluster and it was fine (same job submission script is used, so it’s not that).

Very strange to see a difference of this magnitude, however, if spawning new threads is done sequentially by the main process, and you are spawning a lot of threads for very little data then the overhead of thread spawning and communication can very well be the origin of the problem.

There is a large difference on the number of cores of the two clusters?

I don’t use the threads in too many places. There’s really only one core function that is much faster with it. However, if this was the cause, I would expect that the code would be equally slow on both machines.

But, I agree. Very strange.

When you say cluster, I presume you don’t actually mean that it is running on multiple computers?
What are the platforms? Laptop versus the cluster computer.

That’s right, I was only reserving one node (with 2,4, and 8 cores).

The laptop is a Mac with the latest operating system. The machine is a generic Linux distribution (RedCent perhaps). I’ve run the code on both operating systems before and never found an issue.

I keep thinking it must have something to do with how Julia was installed? Maybe it doesn’t see how to call threads for BLAS correctly?

Now, how are you starting the Julia process? Do you use the commandline parameters to set the number of threads?

Good question. I was initializing them directly in the script file. I set all three of JULIA_NUM_THREADS, MKL_NUM_THREADS, and BLAS_NUM_THREADS to be the number of cores that I want (2, 4, 8, etc.). I’ll post the slurm submission script here:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --job-name=test
#SBATCH --output=%x-%j.out
#SBATCH --mem=1G
#SBATCH --nodes=1
#SBATCH --ntasks=4

module load Julia/1.5.4-linux-x86_64

export JULIA_NUM_THREADS=4
export MKL_NUM_THREADS=4
export BLAS_NUM_THREADS=4

julia --check-bounds=no ./program.jl

This is completely consistent between the two computer clusters I’ve run this on.

PS The operating system appears to be CentOS, not Redcent…

Note that Julia and BLAS/MKL multithreading are currently separate and don’t cooperate efficiently. Typically you want to set the number of BLAS/MKL threads to 1 when using multiple Julia threads to avoid oversubscription and scheduling inefficiencies.

I don’t know whether this can explain the vastly different machine dependent runtimes though since you seem to be making this “mistake” consistently. In that regard, you haven’t answered the question by @Henrique_Becker: How do the two clusters (i.e. the resources that you actually request) and your laptop compare to each other in terms of number of CPUs? What kind of CPUs are we talking about?

BTW, a cluster node with 2-8 cores sounds strange to me. I’m used to something like 24 cores per CPU and typically 2 CPUs per node, i.e. 48 cores per node (96 if you count hyperthreading).

2 Likes

Thank you and everyone so much for the replies!

Oh my! That’s a great detail to know! I’m sad that I missed it. Setting only the JULIA_NUM_THREADS to 4 and everything else to 1 doesn’t make too much of a difference in this application. I just ran a test and this wasn’t it…but it is good to know.

@Henrique_Becker: How do the two clusters (i.e. the resources that you actually request) and your laptop compare to each other in terms of number of CPUs? What kind of CPUs are we talking about?

Sure. here’s a few more details.The laptop is a regular Intel CPU (no ARM), Intel i7. 2.8 GHz with 16GB. There are 4 physical cores (so, 8 hyper-threaded, but I only call 4).

The cluster is Intel Xeon 6138 20-core 2.0 GHz with 192 GB RAM. There are two processors on each node, so there are 40 physical cores in total for a single node.

BTW, a cluster node with 2-8 cores sounds strange to me. I’m used to something like 24 cores per CPU and typically 2 CPUs per node, i.e. 48 cores per node (96 if you count hyperthreading).

This is true, the cluster has 40 cores per node. I’m only reserving some of them.

Here’s one additional detail: I tried to run two runs of the algorithm. The code runs slow (x10 slower) on the first run. Then, it runs quickly on the second (although perhaps not completely as fast as it should be). I could put in a simple dummy calculation on the first try, but this seems a little unnecessary.

These two runs happens at the same process? Julia compiles any function it uses the first time it is called. This is common for all Julia code (not just multithreaded). So a script that uses a lot of code (this includes what comes from inside libraries) but does little effort will have a time many times longer for the first run than an subsequent runs. However, 20s to 30m is absurd, and the compilation time should not be that different because you are running the code in parallel or not.

I do this for all scientific experiments in Julia. It is necessary to force Julia to compile all the code with a dummy instance if you wanna report timings that make some sense in a Journal.

2 Likes

Potentially related discussion here: @everywhere takes a very long time when using a cluster - #9 by pfarndt

The problem is likely tied to your cluster’s filesystem - if many processes are trying to access the same set of precompilation cache files at the same time, you may see a significant slowdown if the filesystem’s handling of parallel access is particularly poor (more so if the precompilation cache is hosted in your home directory, which might be accessible to the cluster over a relatively slow network link). @johnh is the resident expert in alleviating this sort of pain - he’ll probably recommend something like copying your .julia/ directory to the node before launching Julia to avoid network contention, or at least making sure .julia/ is located on a fast filesystem (e.g. Lustre).

2 Likes

Yes, I have some familiarity with this when using Julia. The code is very large and has many iterations in it. This would apply to the first iteration in the code and that is slow, but subsequent iterations are faster on another computer. So, I still think the problem is with the computer itself.

Aha…that would be a good candidate for a solution here. I had read the post you link to, but I wasn’t using @everywhere, so I thought it must have been something else. In fact, this would make a lot of sense. The cluster is particularly new, so I’m very curious if this is the issue. Very keen!

Is your suggestion to simply copy the .julia file into the working directory? If so, could you provide a link on how to access that version instead of in the home directory? I like this idea!

Take a look at this thread: Run a julia application at large scale (on thousands of nodes) - #5 by johnh

@stillyslalom …and is there any advice that could be given for the manager of a cluster? I think the person I was speaking with would be interested to know about this.

Ok, I will implement this and report back.

It may be even faster to build a custom sysimage for your application and copy just that to your node(s) instead of copying the entirety of .julia/ (which will contain a bunch of unrelated files, and will still require some additional runtime compilation).

1 Like

Just to avoid potential confusion: @everywhere and such are used in multi-processing whereas in this thread we have so far only been talking about multi-threading (with just a single julia process).

Another possible source for this problem: when working on servers that use a NUMA architecture (very common for HPC stuff these days), I have had trouble using multithreading. I have usually solved this by passing special instructions to the scheduler to schedule the entire task on a single NUMA node. You might try that to see if you see improvements.

Please forgive the long reply. Today was a busy day requiring my attention on other things.

I tried this suggestion (loading the necessary commands via the submission script), but this did not help the slow down.

I should also amend one statement I made. There are two functions in the code. The first one is always slow (this is the “big” function). Another function that has a few experimental optimizations is faster. This version of the code is actually more heavily reliant on the Threads.@threads command. So, The first, slow function must be more related to BLAS and MKL.

I’m not sure what is going on here now…but the code is still slow.