Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads

foobar_lv2 · January 17, 2024, 2:02pm

There is the following issue: When the runtime decides to run GC, then it needs to wait until all threads have reached a GC safepoint. The compiler makes no attempt to guarantee a maximal amount of compute between safepoints (as opposed to eg JVM).

Hence you can run into the situation where you have 80 threads, one of which has a long-running tight loop without safepoints, and 79 are waiting for this one to finish the loop until they can finally GC.

It used to be that time spent waiting was not counted towards the GC percentage.

I am not 100% sure that this is still correct.

TL;DR: Even if you get reported 0.1% GC time on an 80 thread machine, this tiny amount of CPU time may account for effective 98.75% of wall-clock time.

You will see this by running top or htop as ~100% CPU utilization (instead of the ideal 8000%). You can also see this by running time julia ... and looking at the times reported by the kernel.

Can you @Uranium238 post the output of time julia ...? That should show us whether the non-ideal scaling is due to threads waiting on locks.

PS. In an output of time like the following

real	0m0.107s
user	0m0.061s
sys	0m0.056s

you can compare real (wall-clock-time) to user (cpu-seconds) to gauge CPU utilization (ideal multithreading means user == real * ncpus). If sys (cpu-seconds spent in kernel that are attributed to you) is large, then something is very broken and you need to think hard about what went wrong. You will e.g. get this if you write single characters to unbuffered stdout. You start debugging that via strace.

johnh · January 17, 2024, 2:07pm

Slightly off topic - it is alwasy interesting to run the ‘lstopo’ utility then get a graphical view of the internal layour of your servers
Portable Hardware Locality (hwloc) (open-mpi.org)

Also agree with @foobar_lv2 - run top or htop

lmiq · January 17, 2024, 2:44pm

FWIW, this is what I get here:

Original version:

leandro@m3g:~/Downloads/uranium% time julia -t 10 --project pal-rv6.jl 
Number of threads = 10
The rate constant is 2.3048765809185653e9
343.218980 seconds (101.98 M allocations: 3.746 TiB, 1.95% gc time, 13.97% compilation time: <1% of which was recompilation)

real    5m45,960s
user    52m58,936s
sys     0m8,036s

Memory-efficient version:

leandro@m3g:~/Downloads/uranium% time julia -t 10 --project pal-rv6_2.jl 
Number of threads = 10
311.750008 seconds (37.96 M allocations: 20.854 GiB, 0.17% gc time, 15.86% compilation time)
The rate constant is 2.304876582130397e9
312.690927 seconds (40.85 M allocations: 21.053 GiB, 0.18% gc time, 16.10% compilation time: <1% of which was recompilation)

real    5m15,405s
user    46m39,963s
sys     0m2,472s

(using dd = 1000)

I’m somewhat intrigued by the compilation time, which seems to increase (in absolute terms) with system size.

Uranium238 · January 18, 2024, 10:24am

I tried the exact same code on the supercomputer in question. It took 41.769s !

ufechner7 · January 18, 2024, 10:32am

Did you try both versions, with and without using MKL ?

On my desktop PC (Ryzen 7950X) I get:

ufechner@ufryzen:~/repos/bench$ julia --project bench.jl 
  3.475 s (6 allocations: 767.90 MiB)

with MKL.

Uranium238 · January 18, 2024, 10:51am

My reported value was without using MKL , I shall report the value while using MKL as soon as I am able to get access to the cluster.

johnh · January 18, 2024, 2:09pm

I tried the exact same code on the supercomputer in question. It took 41.769s !

I would honestly say to start by running ‘top’ or ‘htop’ when this application is running.
Look for any other applications running on the compute node at the same time.
Look for processes hopping around between different CPU cores.

Are you running into ‘time to first X’ here, maybe you have a very slow storage system ?

ufechner7 · January 18, 2024, 2:30pm

Cannot be, my benchmark uses the @btime macro:

# using MKL
using BenchmarkTools

n=10000
m=rand(n,n)
@btime inv($m)

johnh · January 18, 2024, 2:58pm

@ufechner7 my question was directed to @Uranium238
Apologies

Uranium238 · January 18, 2024, 3:53pm

The Fortran code (as you might have seen from the git project I linked in this thread) uses MPI (Not OpenMP). As I stated before, I completely realize that it is not right for me to compare threaded (Base.Threads) performance with the Fortran code using MPI, I should have used MPI.jl .

As as has been stated in this thread, Base.Threads has the lowest overhead out of all parallelization techniques that can be employed in Julia. So switching to MPI.jl should definetly not improve the performance. And if the threaded Julia code is slower than the MPI Fortran code, I do not feel that switching to MPI.jl would serve my primary purpose which is to make the Julia code run faster than Fortran code while using the same number of processors.

lmiq · January 18, 2024, 4:03pm

We do not have enough information to really know what is going on on the cluster, but your code bottleneck is the matrix multiplication, by far. That suggests communication among process is not really what might be making your code slower, so I would not bet on MPI vs. Threads differences here. The Julia code might become slower because you have a lot of memory usage in the original code, which has to be constantly garbage-collected. Thus the need to check if the more memory-friendly version there solves the scalability issue.

stevengj · January 18, 2024, 4:23pm

A 10% difference when you go from one machine to another (with the same number of cores?) is not unusual and may not be too important.

Even on shared-memory hardware, if you aren’t careful about memory allocation & access, then using threads with shared memory can hurt you because you have more cache contention (compared to MPI with completely separate address spaces for each process). And this effect varies depending on the machine and its memory architecture, so it could easily change dramatically between your desktop and your server.

You are also allocating zillions of temporary arrays in your inner loop rather than re-using buffers and working in-place, which hurts performance (both serial and parallel). Note also that slices in Julia , like MJSJ[mp,:], allocate copies unless you mark your code with @views (which can be applied to the whole function). This is all covered in the Julia performance tips.

(All of your type declarations accomplish exactly nothing for performance, by the way. But missing const on globals like T and t may hurt you. By the way, there is no point in using plan_fft if you are going to use the plan only once, but this does not look performance critical for you anyway.)

lmiq · January 18, 2024, 5:37pm

In my “supercomputer” (a 12-year old Xenon cluster), I can test the scalability in a machine with 16 real cores. I get:

Original version (with dd = 100):

With 8 cores:

leandro@gn003:~/Downloads/uranium% julia -t8 --project pal-rv6.jl
Number of threads = 8
The rate constant is 2.175134435831123e9
129.216527 seconds (18.28 M allocations: 384.064 GiB, 3.39% gc time, 71.59% compilation time)

With 16 cores:

leandro@gn003:~/Downloads/uranium% julia -t16 --project pal-rv6.jl 
Number of threads = 16
The rate constant is 2.175134435831123e9
 87.336051 seconds (18.43 M allocations: 384.077 GiB, 3.95% gc time, 210.67% compilation time)

A an improvement of 1.4x

The memory-optimized version (with dd = 100):

leandro@gn003:~/Downloads/uranium% julia -t8 --project pal-rv6_2.jl
Number of threads = 8
119.344040 seconds (12.83 M allocations: 2.667 GiB, 0.52% gc time, 90.66% compilation time)
The rate constant is 2.1751344364095407e9

leandro@gn003:~/Downloads/uranium% julia -t16 --project pal-rv6_2.jl 
Number of threads = 16
 69.198384 seconds (12.96 M allocations: 2.679 GiB, 0.83% gc time, 303.68% compilation time)
The rate constant is 2.1751344364095407e9

An improvement of 1.72x for 2x the number of cores.

With a larger system that is probably better. Those compilation times do not make sense (but that’s a separate issue I’m trying to report).

abraemer · January 18, 2024, 9:49pm

I’ve seen similar things with threads. I think the (absolute) compilation time is essentially multiplied by the number of threads (perhaps each thread compiles its own version?). This is then compared to the runtime of the function (not multiplied by the thread count). Not sure about this though! I think I experimented at some point and for that example the math worked out but I can’t recreate anything unfortunately.

Uranium238 · January 19, 2024, 3:37am

Could you please elaborate why that is the case ? As far as I know it is always better to declare the type of every variable that is going to be passed onto any function.

Oscar_Smith · January 19, 2024, 3:50am

No. Julia specializes the code based on the types you call the function with. The annotations on functions only control dispatch.

mkitti · January 19, 2024, 3:56am

Type declarations can be helpful, but they do not actually help performance directly in Julia.

Indirectly, they can help ensure your code is type stable, but the same type stable code without typed arguments or type assertions would execute just the same. The main effect of adding type to arguments is restricting the types of potential arguments or multiple dispatch.

Julia uses a late binding call model so functions are specialized based on the arguments they are called with, not the arguments they are declared with.

mkitti · January 19, 2024, 4:09am

An increase in execution time on a HPC multisocket compute node can be expected in code that is constrained by memory limitations. This is due to the shared memory model on multisocket systems called Non-Uniform Memory Access.

I have found that the issue is particularly pronounced on Intel Skylake and Cascade Lake Xeons. If you have access to the hardware, you might find that a single Xeon system is actually faster than multiple Xeons. In my experience, I got a speedup by removing one Xeon from a two Xeon system.

From a pure software perspective, you can reduce this effect by reducing memory allocations.

Here are few reports about poor memory performance with dual socket Xeons:

Uranium238 · January 19, 2024, 8:38am

I am trying to run the same two codes on AMD systems to get an idea of how much of the problem is caused by the Xeon chipset.

johnh · January 19, 2024, 9:18am

You are getting ahead of your skis here.
Take my advice please - ask your system admins for help in running the ‘lstopo’ utility to produce graphical images of the internal layout of the Xeon and AMD systems.
Dual socket Xeon and AMD both have NUMA setups.
The exact layout in NUMA terms can be altered by BIOS settings.
PowerEdge: NUMA Nodes per Socket | CPU Best Practices | Dell Technologies Info Hub

Yes the NPS Setting does have an effect on performance (I cannot remember if Skylake has an NPS setting)

I would be very surprised if the effects you are seeing are down to the difference in AMD and Xeon chipsets.

Please consider the effects of hyperthreading - which you have enabled.
You really only have 40 physical caores, not the 80 which the sysem presents.
I cant say how much hyperthreading behaves with your code.
Also please consider and tell us the BLAS Libraries being used on each of your systems. From the above discussions BLAS may be dominating things here.

Topic		Replies	Views
How to improve performance in a function that repeatedly defines and multiplies matrices Julia at Scale performance , parallel , linearalgebra , complex-numbers , matrix	64	1658	January 6, 2024
How to improve the scaling of Julia code aimed at multi-node parallelization? Julia at Scale linearalgebra , distributed	38	597	August 14, 2024
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1963	January 20, 2023
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	3975	February 11, 2019
Slower @threads than serial for array computations Julia at Scale	26	2690	May 7, 2020

Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads

Original version (with dd = 100):

The memory-optimized version (with dd = 100):

Related topics