Didn’t wait for the scaling experiment on Rome to finish.
This is very odd, the scaling is as bad on Rome
Milan delivers a better scaling but it still doesn’t make sense anyway in this situation
Didn’t wait for the scaling experiment on Rome to finish.
This is very odd, the scaling is as bad on Rome
Milan delivers a better scaling but it still doesn’t make sense anyway in this situation
This shows that something fundamentally wrong with how you are launching the jobs. I suppose you are not launching the amount of threads that you intend to launch and/or the threads are not each pinned 1-to-1 to a core.
If I remember right, @carstenbauer has created some tools around thread affinity. @carstenbauer would you mind to chime in?
You can check on which cores your Julia threads (of each MPI rank) are running with ThreadPinning.jl’s getcpuids()
. (And pin them with pinthreads
if necessary.)
However, I suspect that this is a (arguably subtle) NUMA / first-touch policy issue. AFAICS, for the CPU case, @zeros(nx, ny, nz)
just evaluates to a regular zeros(nx, ny, nz)
. That’s likely an issue because the first thread alone will initialise the array(s) with zeros and hence “first touch” all the elements. Consequently, the memory pages will get established in one NUMA domain - the one that the core running this first Julia thread belongs to - and accessing (parts of) the array(s) from other threads later in the parallel stencil computation will be slow because they’ll need to grab from this single (potentially far away) NUMA domain. The memory bandwidth of the single NUMA domain will become the bottleneck (the other NUMA domains are note used at all).
Note that this issue is particularly pronounced on AMD clusters because they natively have, e.g., 4 NUMA domains per CPU, i.e. 8 NUMA domains per node (while a regular Intel CPU node has 2 NUMA domains, one per CPU).
How to fix it? Ideally, you should initialise (that is fill with zeros) the array in exactly the same manner as you will later access it in parallel. See STREAMBenchmark.jl for an example:
Initialisation: STREAMBenchmark.jl/benchmarks.jl at master · JuliaPerf/STREAMBenchmark.jl · GitHub
Parallel computation: STREAMBenchmark.jl/kernels_nthreads.jl at master · JuliaPerf/STREAMBenchmark.jl · GitHub
Note that if you would replace the parallel initialisation by simple zeros
calls the memory bandwidth obtained by STREAMBenchmark.jl will be dramatically lower (i.e. a factor of 5 or similar!).
Of course, in real applications, the parallel kernel might be much more complicated in which case it isn’t at all obvious which parallel initialisation scheme (and thus which NUMA mapping) is optimal. Good news is that in many cases any parallel initialisation is already dramatically better than serial initialisation. Example: julia-dgemm-noctua/mkl_job_128.out at master · carstenbauer/julia-dgemm-noctua · GitHub. Here, I perform a regular matrix multiplication (dgemm) on an AMD Milan system using Intel MKL and 128 threads (2 CPUs with 64 threads each).
Naive initialisation (i.e. serial): ~668 GFLOPs
Linear parallel initialisation: ~2831 GFLOPs
Note that a parallel matmul is certainly accessing the input matrices (and the output matrix) in a non-linear way.
Hope this helps.
(BTW, AMD gives one the option to tune the effective number of NUMA domains per socket, see the “NPS” setting here).
In any case, you should always run a process per CPU as you did. On Intel CPUs you are fine that way. For the AMD CPU case, we will need to add something along the lines what Carsten noted. Until we have it implemented (I think we can add it to the next release of ParallelStencil), you can just initialize yourself the arrays in the CPU case (following Carsten’s notes).
An alternative for the AMD CPUs would be to also run a process per NUMA domain, with the amount of threads matching the amount of cores available per NUMA domain. Of course, it needs careful pinning of the threads to the right cores.
Indeed, should’ve mentioned this above as a workaround as well.
Taking the NUMA architecture of the CPU into account, I’ve done my experiments again, launching multiple MPI processes per CPU and the scaling is perfectly linear and near Ideal now !!
The default pining policy was apparently sufficient.
Thank you very much @samo and @carstenbauer for your help!
Could you share your launch sequence i.e. how you pinned your processes ?
how you pinned your processes ?
I did not tune it myself, I did let the resources manager take care of it. The default process distribution policy being Block by node and Block by socket.
The launching sequence is something along the lines of:
mpirun -p <my cluster> -x -n <nb processes> -c <cores per process> julia -t <cores per process>