There is the following issue: When the runtime decides to run GC, then it needs to wait until all threads have reached a GC safepoint. The compiler makes no attempt to guarantee a maximal amount of compute between safepoints (as opposed to eg JVM).
Hence you can run into the situation where you have 80 threads, one of which has a long-running tight loop without safepoints, and 79 are waiting for this one to finish the loop until they can finally GC.
It used to be that time spent waiting was not counted towards the GC percentage.
I am not 100% sure that this is still correct.
TL;DR: Even if you get reported 0.1% GC time on an 80 thread machine, this tiny amount of CPU time may account for effective 98.75% of wall-clock time.
You will see this by running top or htop as ~100% CPU utilization (instead of the ideal 8000%). You can also see this by running time julia ... and looking at the times reported by the kernel.
Can you @Uranium238 post the output of time julia ...? That should show us whether the non-ideal scaling is due to threads waiting on locks.
PS. In an output of time like the following
real 0m0.107s
user 0m0.061s
sys 0m0.056s
you can compare real (wall-clock-time) to user (cpu-seconds) to gauge CPU utilization (ideal multithreading means user == real * ncpus). If sys (cpu-seconds spent in kernel that are attributed to you) is large, then something is very broken and you need to think hard about what went wrong. You will e.g. get this if you write single characters to unbuffered stdout. You start debugging that via strace.
leandro@m3g:~/Downloads/uranium% time julia -t 10 --project pal-rv6.jl
Number of threads = 10
The rate constant is 2.3048765809185653e9
343.218980 seconds (101.98 M allocations: 3.746 TiB, 1.95% gc time, 13.97% compilation time: <1% of which was recompilation)
real 5m45,960s
user 52m58,936s
sys 0m8,036s
Memory-efficient version:
leandro@m3g:~/Downloads/uranium% time julia -t 10 --project pal-rv6_2.jl
Number of threads = 10
311.750008 seconds (37.96 M allocations: 20.854 GiB, 0.17% gc time, 15.86% compilation time)
The rate constant is 2.304876582130397e9
312.690927 seconds (40.85 M allocations: 21.053 GiB, 0.18% gc time, 16.10% compilation time: <1% of which was recompilation)
real 5m15,405s
user 46m39,963s
sys 0m2,472s
(using dd = 1000)
I’m somewhat intrigued by the compilation time, which seems to increase (in absolute terms) with system size.
I tried the exact same code on the supercomputer in question. It took 41.769s !
I would honestly say to start by running ‘top’ or ‘htop’ when this application is running.
Look for any other applications running on the compute node at the same time.
Look for processes hopping around between different CPU cores.
Are you running into ‘time to first X’ here, maybe you have a very slow storage system ?
The Fortran code (as you might have seen from the git project I linked in this thread) uses MPI (Not OpenMP). As I stated before, I completely realize that it is not right for me to compare threaded (Base.Threads) performance with the Fortran code using MPI, I should have used MPI.jl .
As as has been stated in this thread, Base.Threads has the lowest overhead out of all parallelization techniques that can be employed in Julia. So switching to MPI.jl should definetly not improve the performance. And if the threaded Julia code is slower than the MPI Fortran code, I do not feel that switching to MPI.jl would serve my primary purpose which is to make the Julia code run faster than Fortran code while using the same number of processors.
We do not have enough information to really know what is going on on the cluster, but your code bottleneck is the matrix multiplication, by far. That suggests communication among process is not really what might be making your code slower, so I would not bet on MPI vs. Threads differences here. The Julia code might become slower because you have a lot of memory usage in the original code, which has to be constantly garbage-collected. Thus the need to check if the more memory-friendly version there solves the scalability issue.
A 10% difference when you go from one machine to another (with the same number of cores?) is not unusual and may not be too important.
Even on shared-memory hardware, if you aren’t careful about memory allocation & access, then using threads with shared memory can hurt you because you have more cache contention (compared to MPI with completely separate address spaces for each process). And this effect varies depending on the machine and its memory architecture, so it could easily change dramatically between your desktop and your server.
You are also allocating zillions of temporary arrays in your inner loop rather than re-using buffers and working in-place, which hurts performance (both serial and parallel). Note also that slices in Julia , like MJSJ[mp,:], allocate copies unless you mark your code with @views (which can be applied to the whole function). This is all covered in the Julia performance tips.
(All of your type declarations accomplish exactly nothing for performance, by the way. But missing const on globals like T and t may hurt you. By the way, there is no point in using plan_fft if you are going to use the plan only once, but this does not look performance critical for you anyway.)
In my “supercomputer” (a 12-year old Xenon cluster), I can test the scalability in a machine with 16 real cores. I get:
Original version (with dd = 100):
With 8 cores:
leandro@gn003:~/Downloads/uranium% julia -t8 --project pal-rv6.jl
Number of threads = 8
The rate constant is 2.175134435831123e9
129.216527 seconds (18.28 M allocations: 384.064 GiB, 3.39% gc time, 71.59% compilation time)
With 16 cores:
leandro@gn003:~/Downloads/uranium% julia -t16 --project pal-rv6.jl
Number of threads = 16
The rate constant is 2.175134435831123e9
87.336051 seconds (18.43 M allocations: 384.077 GiB, 3.95% gc time, 210.67% compilation time)
A an improvement of 1.4x
The memory-optimized version (with dd = 100):
leandro@gn003:~/Downloads/uranium% julia -t8 --project pal-rv6_2.jl
Number of threads = 8
119.344040 seconds (12.83 M allocations: 2.667 GiB, 0.52% gc time, 90.66% compilation time)
The rate constant is 2.1751344364095407e9
leandro@gn003:~/Downloads/uranium% julia -t16 --project pal-rv6_2.jl
Number of threads = 16
69.198384 seconds (12.96 M allocations: 2.679 GiB, 0.83% gc time, 303.68% compilation time)
The rate constant is 2.1751344364095407e9
An improvement of 1.72x for 2x the number of cores.
With a larger system that is probably better. Those compilation times do not make sense (but that’s a separate issue I’m trying to report).
I’ve seen similar things with threads. I think the (absolute) compilation time is essentially multiplied by the number of threads (perhaps each thread compiles its own version?). This is then compared to the runtime of the function (not multiplied by the thread count). Not sure about this though! I think I experimented at some point and for that example the math worked out but I can’t recreate anything unfortunately.
Could you please elaborate why that is the case ? As far as I know it is always better to declare the type of every variable that is going to be passed onto any function.
Type declarations can be helpful, but they do not actually help performance directly in Julia.
Indirectly, they can help ensure your code is type stable, but the same type stable code without typed arguments or type assertions would execute just the same. The main effect of adding type to arguments is restricting the types of potential arguments or multiple dispatch.
Julia uses a late binding call model so functions are specialized based on the arguments they are called with, not the arguments they are declared with.
An increase in execution time on a HPC multisocket compute node can be expected in code that is constrained by memory limitations. This is due to the shared memory model on multisocket systems called Non-Uniform Memory Access.
I have found that the issue is particularly pronounced on Intel Skylake and Cascade Lake Xeons. If you have access to the hardware, you might find that a single Xeon system is actually faster than multiple Xeons. In my experience, I got a speedup by removing one Xeon from a two Xeon system.
From a pure software perspective, you can reduce this effect by reducing memory allocations.
Here are few reports about poor memory performance with dual socket Xeons:
You are getting ahead of your skis here.
Take my advice please - ask your system admins for help in running the ‘lstopo’ utility to produce graphical images of the internal layout of the Xeon and AMD systems.
Dual socket Xeon and AMD both have NUMA setups.
The exact layout in NUMA terms can be altered by BIOS settings. PowerEdge: NUMA Nodes per Socket | CPU Best Practices | Dell Technologies Info Hub
Yes the NPS Setting does have an effect on performance (I cannot remember if Skylake has an NPS setting)
I would be very surprised if the effects you are seeing are down to the difference in AMD and Xeon chipsets.
Please consider the effects of hyperthreading - which you have enabled.
You really only have 40 physical caores, not the 80 which the sysem presents.
I cant say how much hyperthreading behaves with your code.
Also please consider and tell us the BLAS Libraries being used on each of your systems. From the above discussions BLAS may be dominating things here.