Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads

There is the following issue: When the runtime decides to run GC, then it needs to wait until all threads have reached a GC safepoint. The compiler makes no attempt to guarantee a maximal amount of compute between safepoints (as opposed to eg JVM).

Hence you can run into the situation where you have 80 threads, one of which has a long-running tight loop without safepoints, and 79 are waiting for this one to finish the loop until they can finally GC.

It used to be that time spent waiting was not counted towards the GC percentage.

I am not 100% sure that this is still correct.

TL;DR: Even if you get reported 0.1% GC time on an 80 thread machine, this tiny amount of CPU time may account for effective 98.75% of wall-clock time.

You will see this by running top or htop as ~100% CPU utilization (instead of the ideal 8000%). You can also see this by running time julia ... and looking at the times reported by the kernel.

Can you @Uranium238 post the output of time julia ...? That should show us whether the non-ideal scaling is due to threads waiting on locks.

PS. In an output of time like the following

real	0m0.107s
user	0m0.061s
sys	0m0.056s

you can compare real (wall-clock-time) to user (cpu-seconds) to gauge CPU utilization (ideal multithreading means user == real * ncpus). If sys (cpu-seconds spent in kernel that are attributed to you) is large, then something is very broken and you need to think hard about what went wrong. You will e.g. get this if you write single characters to unbuffered stdout. You start debugging that via strace.

6 Likes

Slightly off topic - it is alwasy interesting to run the ‘lstopo’ utility then get a graphical view of the internal layour of your servers
Portable Hardware Locality (hwloc) (open-mpi.org)

Also agree with @foobar_lv2 - run top or htop

FWIW, this is what I get here:

Original version:

leandro@m3g:~/Downloads/uranium% time julia -t 10 --project pal-rv6.jl 
Number of threads = 10
The rate constant is 2.3048765809185653e9
343.218980 seconds (101.98 M allocations: 3.746 TiB, 1.95% gc time, 13.97% compilation time: <1% of which was recompilation)

real    5m45,960s
user    52m58,936s
sys     0m8,036s

Memory-efficient version:

leandro@m3g:~/Downloads/uranium% time julia -t 10 --project pal-rv6_2.jl 
Number of threads = 10
311.750008 seconds (37.96 M allocations: 20.854 GiB, 0.17% gc time, 15.86% compilation time)
The rate constant is 2.304876582130397e9
312.690927 seconds (40.85 M allocations: 21.053 GiB, 0.18% gc time, 16.10% compilation time: <1% of which was recompilation)

real    5m15,405s
user    46m39,963s
sys     0m2,472s

(using dd = 1000)

I’m somewhat intrigued by the compilation time, which seems to increase (in absolute terms) with system size.

1 Like

I tried the exact same code on the supercomputer in question. It took 41.769s !

Did you try both versions, with and without using MKL ?

On my desktop PC (Ryzen 7950X) I get:

ufechner@ufryzen:~/repos/bench$ julia --project bench.jl 
  3.475 s (6 allocations: 767.90 MiB)

with MKL.

My reported value was without using MKL , I shall report the value while using MKL as soon as I am able to get access to the cluster.

1 Like

I tried the exact same code on the supercomputer in question. It took 41.769s !

I would honestly say to start by running ‘top’ or ‘htop’ when this application is running.
Look for any other applications running on the compute node at the same time.
Look for processes hopping around between different CPU cores.

Are you running into ‘time to first X’ here, maybe you have a very slow storage system ?

Cannot be, my benchmark uses the @btime macro:

# using MKL
using BenchmarkTools

n=10000
m=rand(n,n)
@btime inv($m)

@ufechner7 my question was directed to @Uranium238
Apologies

The Fortran code (as you might have seen from the git project I linked in this thread) uses MPI (Not OpenMP). As I stated before, I completely realize that it is not right for me to compare threaded (Base.Threads) performance with the Fortran code using MPI, I should have used MPI.jl .

As as has been stated in this thread, Base.Threads has the lowest overhead out of all parallelization techniques that can be employed in Julia. So switching to MPI.jl should definetly not improve the performance. And if the threaded Julia code is slower than the MPI Fortran code, I do not feel that switching to MPI.jl would serve my primary purpose which is to make the Julia code run faster than Fortran code while using the same number of processors.

We do not have enough information to really know what is going on on the cluster, but your code bottleneck is the matrix multiplication, by far. That suggests communication among process is not really what might be making your code slower, so I would not bet on MPI vs. Threads differences here. The Julia code might become slower because you have a lot of memory usage in the original code, which has to be constantly garbage-collected. Thus the need to check if the more memory-friendly version there solves the scalability issue.

A 10% difference when you go from one machine to another (with the same number of cores?) is not unusual and may not be too important.

Even on shared-memory hardware, if you aren’t careful about memory allocation & access, then using threads with shared memory can hurt you because you have more cache contention (compared to MPI with completely separate address spaces for each process). And this effect varies depending on the machine and its memory architecture, so it could easily change dramatically between your desktop and your server.

You are also allocating zillions of temporary arrays in your inner loop rather than re-using buffers and working in-place, which hurts performance (both serial and parallel). Note also that slices in Julia , like MJSJ[mp,:], allocate copies unless you mark your code with @views (which can be applied to the whole function). This is all covered in the Julia performance tips.

(All of your type declarations accomplish exactly nothing for performance, by the way. But missing const on globals like T and t may hurt you. By the way, there is no point in using plan_fft if you are going to use the plan only once, but this does not look performance critical for you anyway.)

3 Likes

In my “supercomputer” (a 12-year old Xenon cluster), I can test the scalability in a machine with 16 real cores. I get:

Original version (with dd = 100):

With 8 cores:

leandro@gn003:~/Downloads/uranium% julia -t8 --project pal-rv6.jl
Number of threads = 8
The rate constant is 2.175134435831123e9
129.216527 seconds (18.28 M allocations: 384.064 GiB, 3.39% gc time, 71.59% compilation time)

With 16 cores:

leandro@gn003:~/Downloads/uranium% julia -t16 --project pal-rv6.jl 
Number of threads = 16
The rate constant is 2.175134435831123e9
 87.336051 seconds (18.43 M allocations: 384.077 GiB, 3.95% gc time, 210.67% compilation time)

A an improvement of 1.4x

The memory-optimized version (with dd = 100):

leandro@gn003:~/Downloads/uranium% julia -t8 --project pal-rv6_2.jl
Number of threads = 8
119.344040 seconds (12.83 M allocations: 2.667 GiB, 0.52% gc time, 90.66% compilation time)
The rate constant is 2.1751344364095407e9

leandro@gn003:~/Downloads/uranium% julia -t16 --project pal-rv6_2.jl 
Number of threads = 16
 69.198384 seconds (12.96 M allocations: 2.679 GiB, 0.83% gc time, 303.68% compilation time)
The rate constant is 2.1751344364095407e9

An improvement of 1.72x for 2x the number of cores.

With a larger system that is probably better. Those compilation times do not make sense (but that’s a separate issue I’m trying to report).

1 Like

I’ve seen similar things with threads. I think the (absolute) compilation time is essentially multiplied by the number of threads (perhaps each thread compiles its own version?). This is then compared to the runtime of the function (not multiplied by the thread count). Not sure about this though! I think I experimented at some point and for that example the math worked out but I can’t recreate anything unfortunately.

2 Likes

Could you please elaborate why that is the case ? As far as I know it is always better to declare the type of every variable that is going to be passed onto any function.

1 Like

No. Julia specializes the code based on the types you call the function with. The annotations on functions only control dispatch.

Type declarations can be helpful, but they do not actually help performance directly in Julia.

Indirectly, they can help ensure your code is type stable, but the same type stable code without typed arguments or type assertions would execute just the same. The main effect of adding type to arguments is restricting the types of potential arguments or multiple dispatch.

Julia uses a late binding call model so functions are specialized based on the arguments they are called with, not the arguments they are declared with.

1 Like

An increase in execution time on a HPC multisocket compute node can be expected in code that is constrained by memory limitations. This is due to the shared memory model on multisocket systems called Non-Uniform Memory Access.

I have found that the issue is particularly pronounced on Intel Skylake and Cascade Lake Xeons. If you have access to the hardware, you might find that a single Xeon system is actually faster than multiple Xeons. In my experience, I got a speedup by removing one Xeon from a two Xeon system.

From a pure software perspective, you can reduce this effect by reducing memory allocations.

Here are few reports about poor memory performance with dual socket Xeons:

2 Likes

I am trying to run the same two codes on AMD systems to get an idea of how much of the problem is caused by the Xeon chipset.

You are getting ahead of your skis here.
Take my advice please - ask your system admins for help in running the ‘lstopo’ utility to produce graphical images of the internal layout of the Xeon and AMD systems.
Dual socket Xeon and AMD both have NUMA setups.
The exact layout in NUMA terms can be altered by BIOS settings.
PowerEdge: NUMA Nodes per Socket | CPU Best Practices | Dell Technologies Info Hub

Yes the NPS Setting does have an effect on performance (I cannot remember if Skylake has an NPS setting)

I would be very surprised if the effects you are seeing are down to the difference in AMD and Xeon chipsets.

Please consider the effects of hyperthreading - which you have enabled.
You really only have 40 physical caores, not the 80 which the sysem presents.
I cant say how much hyperthreading behaves with your code.
Also please consider and tell us the BLAS Libraries being used on each of your systems. From the above discussions BLAS may be dominating things here.