I have tried running one of the examples of ImplicitGlobalGrid.jl on 1, 2 and 4 gpus to reproduce some scaling results. I started wtih diffusion3D_multigpu_CuArray_novis.jl with a coarser grid to see how it does. Below you will find my results.
For this weak-scaling problem I compute the efficiencies on 2 and 4 cores to be 57 and 35 percent respectively. I realize this is much coarser but is this surprising?
I have ensured that the system does have CUDA-aware MPI so that would not be a problem, I don’t think.
$ mpiexec -np 1 julia --project diffusion3D_multigpu_CuArrays_novis.jl
Global grid: 32x32x32 (nprocs: 1, dims: 1x1x1)
Simulation time = 41.547354999
$ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_novis.jl
Global grid: 62x32x32 (nprocs: 2, dims: 2x1x1)
Simulation time = Simulation time = 72.528739102
72.494481903
$ mpiexec -np 4 julia --project diffusion3D_multigpu_CuArrays_novis.jl
Global grid: 62x62x32 (nprocs: 4, dims: 2x2x1)
Simulation time = Simulation time = Simulation time = Simulation time = 116.549162142
116.549022381
I was informed by @luraess that there are two problems with what I have done. Below is a copy of the points from the issue that I initially posted.
The weak scaling timings you report are to be expected since:
you are running a very small local problem size (32x32x32) -the ratio of boundary points over total grid points is large and communication time is no longer negligible- and;
the diffusion3D_multigpu_CuArray_novis.jl does not implement the @hide_communication feature to overlap and hide MPI communication by inner domain grid points computations.
I am no retrying it with a larger grid, 256^3, as the code was originally set up so I know that should improve things. However, the second point is a bit harder to figure out. Is there an example of this code that uses @hide_communication that I could run instead?
Hi @francispoulin, thanks for your question and interest in implicitGlobalGrid !
You can find examples of using the @hide_communication feature in the multixpu miniapp codes from ParallelStencil.jl, e.g. the acoustic3D, and in the diffusion example from the JuliaCon2021 workshop on solving PDEs on GPUs.
I find the example below that seems to do what I want using @hide_communication and I’m going to play with that to see how I can get good scaling for the diffusion equation.
I added the following line select_device() into the diffusion code and ran it on the same server with 1, 2 and 4 GPUs with a grid of 256x256x256.
I’m happy to say that for both 2 and 4 GPUs the efficiency was 97%. I didn’t use any special optimization and didn’t use @hide_communication so maybe it can be made even better, but I am very happy with the results.
Is the select_device() line something that you would want to include in the code as it might help others you are looking for optimized code?
Thanks for sharing your results. Regarding your feedback:
for both 2 and 4 GPUs the efficiency was 97%
Good to read you achieved good performance !
didn’t use @hide_communication so maybe it can be made even better
Would be interesting to see how it changes using @hide_communication as you can maybe have close to 100% efficiency.
Is the select_device() line something that you would want to include in the code
Not fo now I guess. The select_device() is only useful in configurations where one server hosts multiple GPUs, as then the GPUs can be assigned to the local MPI ranks using shared memory pool. If it turns out that in the future such GPU arrangement is the most standard, we may indeed update the examples to include it “by default”.
Here some further suggestions to ensure your multi-GPU code runs as expected:
check nvidia-smi or print the device ID to ensure you are running on different GPUs when running multiple MPI processes
use Nvidia visual profiler if you want to ensure that the communication-computation overlap functions properly using @hide_communication
Note that using @hide_communication may not lead to significant performance increase when running multiple GPU-MPI processes on a single node as there inter-process communication may be fast given the high-connectivity of the GPUs (e.g. Nvlink). However, when running on physically different nodes, communication (over Ethernet or Infiniband) network may be significantly slower bringing up the need to hide communication time.
Next, I will try the other diffusion3D code that uses @hide_communication. If it performs better, then I will certainly use that approach in the future. But I must say that 97% is pretty excellent by my standards. However, I imagine this would decrease more rapidly when going on to more gpus, and probably would not perform near as well in going to +5000 cores, like you have done before.
Thanks again for writing such good software that is fun to play with!
One more question for @luraess . In the mini-course yesterday you give a lot of different code to solve the diffusion equation in 3D, which was a lot of fun. I know that you have shown that multi-GPU can be efficient on thousands of cores when using MPI. Do you know how high of efficiency we can get with multi-threading?
We have done some preliminary tests and even with large matrices, 512^3, we seem to get low efficiencies and saturation at 16 cores. I am tempted to try playing with the threaded code you discussed yesterday but thought I would ask you first, if you have any experience you can share.
Do you know how high of efficiency we can get with multi-threading?
Wouldn’t you mind opening another Discourse thread for this different topic? This would help other users to hit relevant responses upon future searches. Doing so, you could also ping @samo who would be able to give you some feedback on that. Thanks!
@luraess I am having trouble getting good performance on this example which uses the hide-comms functionnality, I am runing it on a few nodes composed of four 16core CPUs each, having 128 threads per nodes. (USE_GPU=false)
With some script like this
for N in {1,2,3}
do
mpirun -n ${N} julia -t 128 ../travail/julia-heatdiff/scripts/run.jl
done
I get the following results, the bandwidth decreases when I increase the number of compute nodes
Global grid: 512x512x512 (nprocs: 1, dims: 1x1x1)
time_s=5.118296146392822 T_eff=56.64195353063296
Global grid: 1022x512x512 (nprocs: 2, dims: 2x1x1)
time_s=5.26022481918335 T_eff=55.11366955699977
Global grid: 1532x512x512 (nprocs: 3, dims: 3x1x1)
time_s=8.300725936889648 T_eff=34.92589620283643
Global grid: 1022x1022x512 (nprocs: 4, dims: 2x2x1)
time_s=8.083126068115234 T_eff=35.86611046728351
Thanks @Eurel for reporting this. Nothing wrong on your side I guess.
The relevant doc section from the ParallelStencil README never explicitly state that the hide communication feature is currently only enabled when executing code on Nvidia GPUs, but it should. Because then, one can actually take advantage of async execution of streams as a concise way to prioritise concurrent subdomain execution in order to realise the communication/computation overlap. I will fix it here https://github.com/omlins/ParallelStencil.jl/issues/56
Thank you for your quick answer and sorry for my slow responsiveness…
Makes sense for the hide_communication clause, thank you for making that clear to me now.
But it wasn’t exactly what i had in mind, as i didn’t express myself correctly, I am so sorry for that.
When I increase the number of procs, shouldn’t the compute time decrease?
Is it why the global grid increase?
That behavior isn’t exactly clear to me as I expected the global grid to be distributed amongst each processes, the grid is instead duplicated is that right?
When I increase the number of procs, shouldn’t the compute time decrease?
Is it why the global grid increase?
The behaviour you describe may occur in a strong scaling configuration, i.e., if the global problem size is fixed; the more procs one adds, the smaller the local problem size (one each proc) becomes and one may observe some (limited) speed-up.
That behavior isn’t exactly clear to me as I expected the global grid to be distributed amongst each processes, the grid is instead duplicated is that right?
On GPUs in particular, one gets the best performance for a specific local problem size. In that case, it makes sense to look at the distributed memory problem in a weak scaling approach; that’s what is done in ImplicitGlobalGrid. You define the optimal local problem size and then scale it the resources for your global problem, ensuring optimal execution by construction. Hope this helps (see, e.g., this figure).
I’ve done some scaling experiments on up to 32 64 cores CPUs (2048 threads in total) and 12 GPUs and whilst GPU scaling is excellent, the CPU scaling is way not as good whilst the code is the same.
As on the github page, the scaling is reported on GPUs, Is it expected to be the case for multi CPUs
experiments?
here are some of my results
Thanks or your time, help and all of your work with this Library
I am using ParallelStencil.jl for multi-threading, I set the number of threads to 64 with the -t command line argument, I did not set the environment variable
On these experiments, I ran up to 32 MPI processes.
The kind processor I use is a 64core (hyperthreading disabled) AMD milan [https://www.amd.com/fr/products/cpu/amd-epyc-7763] (when I ploted the graph the milan partitions were out) and i get from 100 GB/s to 120 GB/s on one CPU (one MPI process) out of the 204GB/s announced on the vendor sheet.
Each node is composed of 2 CPUs (dualsocket)
I do not specify to the batch scheduler (custom Slurm ) how many nodes i require, but how many processes I need.
In addition, can you comment all the update_halo! calls in your code and leave all the rest exactly as it is and redo the scaling experiment? That way you have an embarrassingly parallel application without communication, which should scale ideally.
This way we know if the performance loss is in the halo updates or if there is something fundamentally wrong with how you are launching the job.