Scaling for multi-threading

A question for @luraess and @samo .

In the mini-course you gave here you give a lot of different code to solve the diffusion equation in 3D, which was a lot of fun. I know that you have shown that multi-GPU can be efficient on thousands of cores when using MPI. Do you know how high of efficiency we can get with multi-threading?

We have done some preliminary tests and even with large matrices, 512^3, we seem to get low efficiencies and saturation at 16 cores. I am tempted to try playing with the threaded code you discussed yesterday but thought I would ask you first, if you have any experience you can share.

Thank you again.

A very stupid question from me…
With the threaded code on 16 CPU cores have you checked if hyperthreading is disabled on that system.
Also it is worth logging into the system and running the ‘htop’ utility to check that your threads are in fact running on separate cores. They should be of course.

1 Like

Can you post a MWE of such tests?

@johnh I didn’t actually do the calculations but I will do them myself and check this out.

@leandromartinez98 : Happy to post the code. You can find it at the following repo.

Hi @francispoulin, thanks for reaching out!

Do you know how high of efficiency we can get with multi-threading?

Depends how you define efficiency. In the JuliaCon2021 workshop you refer to, we defined efficiency by the effective memory throughput metric T_eff. Looking at memory-bound configurations (mostly the case when solving PDEs on modern (many-core) hardware), 1) GPUs have a much larger peak memory throughput T_peak compared to CPUs and, 2) achieving computation utilising over 90% of the peak memory bandwidth is possible on the GPU without restrictive and non-portable optimisations.

On the CPU, achieving T_eff rates close to T_peak seems more challenging. Besides Base.Threads.@threads, I am aware of LoopVectorisation.jl and KernelAbstractions.jl (amongst certainly others) to expose multi-threading support. Multi-threaded (and more - such as AVX) CPU applications show implementation- and problem-specific variability in efficiency (speaking in terms of T_eff) that I did not further investigate an understand yet. Getting T_eff close to T_peak on the CPU seems more challenging than on the GPU.

So given the higher T_peak of GPUs compared to CPUs and the fact that getting T_eff values closer to T_peak is possible, we did for now prioritize the GPU backend in ParallelStencil and in the workshop.

Workshop recall: The presented 2D “stencil-compiler” implementations solving nonlinear diffusion with ParallelStencil.jl deliver, on the Base.Threads.@threads backend, about 9GB/s using 4 threads -one thread per core- on an Intel i5 4 cores (peak memory bandwidth 25GB/s - thus ~36% of T_peak) and 780GB/s on an Nvidia Tesa V100 16GB PCIe GPU (peak memory throughput 840GB/s - thus ~92% of T_peak).

2 Likes

There are a lot of ways of computing efficiency, and sorry that I wasn’t more specific.

The efficiency that I had in mind is what you have shown with ImplicitGlobalGrid in that when you run it on p cores, you can find that it is p times after. With MPI we have found this works pretty well. With threading, this seems to be a lot harder. I was just curious whether you might have looked at this kind of efficiency, or speedup, using your diffusion codes, or other codes for that matter.

Thanks for the precision.

The efficiency that I had in mind is what you have shown with ImplicitGlobalGrid in that when you run it on p cores […]

The first figure from the ImplicitGlobalGrid README shows a weak scaling to assess parallel efficiency (i.e. how much the execution time deviates from a reference time on a single GPU, without MPI, while growing the number of GPUs used -thus MPI processes- proportionally to increasing the number of grid points). The assumption there is that the local problem (in that case what executes on a single GPU) executes optimally (in terms of T_eff).

I was just curious whether you might have looked at this kind of efficiency, or speedup, using your diffusion codes

With CPUs, more options may be available; one could either solve the local problem in a single threaded fashion on one CPU core and scale purely with MPI or try to make use of multi-threading to for solving the local problem before scaling with MPI. Finding the optimal local problem size for single/multi-threaded CPU execution is not something we (@samo and myself) extensively investigated as it goes back to

Getting T_eff close to T_peak on the CPU seems more challenging than on the GPU

from my previous reply and that T_peak is more than one order of magnitude larger on GPUs than CPUs.

Also, note that the @hide_communication feature, needed to get close to ideal weak scaling, is currently a pure GPU capability combining ParallelStencil.jl and ImplicitGlobalGrid.jl.

I think it is important to clearly separate distributed parallelization, i.e. multi-GPU/CPU scaling, from shared memory parallelization, i.e., optimal usage of a single GPU or CPU (or of the all the CPUs on a node if there are multiple and you want to program NUMA-aware). Note that with “CPU” I always mean a physical CPU (having multiple cores) and never a logical CPU (which can be a core or a hyperthread).
On a single CPU you cannot in all cases expect a (close-to) linear scaling with the number of cores, as the cores normally share performance-essential resources as memory bandwidth and higher-level caches. So the per-core scaling-curve does not seem of much interest to me. What counts is the efficiency of usage of the CPU resources as a whole. Our effective throughput metric (T_eff) allows you to give an idea of it, comparing it with the peak throughput (T_peak) of the CPU.

1 Like

Interesting - there was a topic on the Beowulf list recently where someone said that AVX might not give as good gains as you think it might, due to the downclocking when it is operational.
Also from the same topic there are many varieties of AVX - really too many.

Also interesting comments on memory bandwidth of CPUs versus GPUs.
Fugaku, currently thew fastest supercomputer, uses ARM CPUs with high bandwidth memory.
It would be great to see how things run there, but I guess moist pf us will never get the chance.

Thanks @samo for your comments and sorry if I was using the wrong language.

I know that ImplicitGlobalGrid scales well on many cores, wether it be CPU or GPU, thanks to distributed (MPI) parallelism.

The question that I think I wanted to ask is, are there any scaling that you know of for solving the diffusion equation (or other such problems) usling multi-threading (shared memory parallelism) in Julia?