Poor scaling results with `ImplicitGlobalGrid.jl`

Eurel · July 28, 2022, 12:34pm

Didn’t wait for the scaling experiment on Rome to finish.
This is very odd, the scaling is as bad on Rome

Milan delivers a better scaling but it still doesn’t make sense anyway in this situation

samo · July 28, 2022, 3:52pm

This shows that something fundamentally wrong with how you are launching the jobs. I suppose you are not launching the amount of threads that you intend to launch and/or the threads are not each pinned 1-to-1 to a core.

samo · July 28, 2022, 3:54pm

If I remember right, @carstenbauer has created some tools around thread affinity. @carstenbauer would you mind to chime in?

carstenbauer · July 28, 2022, 7:49pm

You can check on which cores your Julia threads (of each MPI rank) are running with ThreadPinning.jl’s getcpuids(). (And pin them with pinthreads if necessary.)

However, I suspect that this is a (arguably subtle) NUMA / first-touch policy issue. AFAICS, for the CPU case, @zeros(nx, ny, nz) just evaluates to a regular zeros(nx, ny, nz). That’s likely an issue because the first thread alone will initialise the array(s) with zeros and hence “first touch” all the elements. Consequently, the memory pages will get established in one NUMA domain - the one that the core running this first Julia thread belongs to - and accessing (parts of) the array(s) from other threads later in the parallel stencil computation will be slow because they’ll need to grab from this single (potentially far away) NUMA domain. The memory bandwidth of the single NUMA domain will become the bottleneck (the other NUMA domains are note used at all).
Note that this issue is particularly pronounced on AMD clusters because they natively have, e.g., 4 NUMA domains per CPU, i.e. 8 NUMA domains per node (while a regular Intel CPU node has 2 NUMA domains, one per CPU).

How to fix it? Ideally, you should initialise (that is fill with zeros) the array in exactly the same manner as you will later access it in parallel. See STREAMBenchmark.jl for an example:

Initialisation: STREAMBenchmark.jl/benchmarks.jl at master · JuliaPerf/STREAMBenchmark.jl · GitHub
Parallel computation: STREAMBenchmark.jl/kernels_nthreads.jl at master · JuliaPerf/STREAMBenchmark.jl · GitHub

Note that if you would replace the parallel initialisation by simple zeros calls the memory bandwidth obtained by STREAMBenchmark.jl will be dramatically lower (i.e. a factor of 5 or similar!).

Of course, in real applications, the parallel kernel might be much more complicated in which case it isn’t at all obvious which parallel initialisation scheme (and thus which NUMA mapping) is optimal. Good news is that in many cases any parallel initialisation is already dramatically better than serial initialisation. Example: julia-dgemm-noctua/mkl_job_128.out at master · carstenbauer/julia-dgemm-noctua · GitHub. Here, I perform a regular matrix multiplication (dgemm) on an AMD Milan system using Intel MKL and 128 threads (2 CPUs with 64 threads each).
Naive initialisation (i.e. serial): ~668 GFLOPs
Linear parallel initialisation: ~2831 GFLOPs
Note that a parallel matmul is certainly accessing the input matrices (and the output matrix) in a non-linear way.

Hope this helps.

(BTW, AMD gives one the option to tune the effective number of NUMA domains per socket, see the “NPS” setting here).

samo · July 29, 2022, 8:01am

In any case, you should always run a process per CPU as you did. On Intel CPUs you are fine that way. For the AMD CPU case, we will need to add something along the lines what Carsten noted. Until we have it implemented (I think we can add it to the next release of ParallelStencil), you can just initialize yourself the arrays in the CPU case (following Carsten’s notes).
An alternative for the AMD CPUs would be to also run a process per NUMA domain, with the amount of threads matching the amount of cores available per NUMA domain. Of course, it needs careful pinning of the threads to the right cores.

carstenbauer · July 29, 2022, 11:45am

Indeed, should’ve mentioned this above as a workaround as well.

Eurel · August 3, 2022, 11:53am

Taking the NUMA architecture of the CPU into account, I’ve done my experiments again, launching multiple MPI processes per CPU and the scaling is perfectly linear and near Ideal now !!
The default pining policy was apparently sufficient.

Thank you very much @samo and @carstenbauer for your help!

t-bltg · August 3, 2022, 12:11pm

Could you share your launch sequence i.e. how you pinned your processes ?

Eurel · August 3, 2022, 12:34pm

how you pinned your processes ?

I did not tune it myself, I did let the resources manager take care of it. The default process distribution policy being Block by node and Block by socket.

The launching sequence is something along the lines of:

mpirun -p <my cluster>  -x  -n <nb processes> -c <cores per process> julia -t  <cores per process>

Topic		Replies	Views
Benchmark ParallelStencil & ImplicitGlobalGrid performance on cluster Julia at Scale	8	418	April 23, 2024
How to achieve perfect scaling with Threads (Julia 1.7.1) Performance multithreading	33	2452	January 13, 2023
ParallelStencil + ImplicitGlobalGrid with multiple GPUs Julia at Scale	8	888	March 31, 2023
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2634	December 1, 2021
AMD Rome vs Intel Xeon shows bad scaling with threads for AMD Performance multithreading , loopvectorization	20	1872	March 13, 2022

Poor scaling results with `ImplicitGlobalGrid.jl`

Related topics