Poor scaling results with `ImplicitGlobalGrid.jl`

I have tried running one of the examples of ImplicitGlobalGrid.jl on 1, 2 and 4 gpus to reproduce some scaling results. I started wtih diffusion3D_multigpu_CuArray_novis.jl with a coarser grid to see how it does. Below you will find my results.

For this weak-scaling problem I compute the efficiencies on 2 and 4 cores to be 57 and 35 percent respectively. I realize this is much coarser but is this surprising?

I have ensured that the system does have CUDA-aware MPI so that would not be a problem, I don’t think.

$ mpiexec -np 1 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
Global grid: 32x32x32 (nprocs: 1, dims: 1x1x1)
Simulation time = 41.547354999

$ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
Global grid: 62x32x32 (nprocs: 2, dims: 2x1x1)
Simulation time = Simulation time = 72.528739102
$ mpiexec -np 4 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
Global grid: 62x62x32 (nprocs: 4, dims: 2x2x1)
Simulation time = Simulation time = Simulation time = Simulation time = 116.549162142

I was informed by @luraess that there are two problems with what I have done. Below is a copy of the points from the issue that I initially posted.

The weak scaling timings you report are to be expected since:

  • you are running a very small local problem size (32x32x32) -the ratio of boundary points over total grid points is large and communication time is no longer negligible- and;
  • the diffusion3D_multigpu_CuArray_novis.jl does not implement the @hide_communication feature to overlap and hide MPI communication by inner domain grid points computations.

I am no retrying it with a larger grid, 256^3, as the code was originally set up so I know that should improve things. However, the second point is a bit harder to figure out. Is there an example of this code that uses @hide_communication that I could run instead?

Hi @francispoulin, thanks for your question and interest in implicitGlobalGrid !
You can find examples of using the @hide_communication feature in the multixpu miniapp codes from ParallelStencil.jl, e.g. the acoustic3D, and in the diffusion example from the JuliaCon2021 workshop on solving PDEs on GPUs.

Thanks again @luraess.

I find the example below that seems to do what I want using @hide_communication and I’m going to play with that to see how I can get good scaling for the diffusion equation.

1 Like

I wanted to shape an update.

I added the following line select_device() into the diffusion code and ran it on the same server with 1, 2 and 4 GPUs with a grid of 256x256x256.

I’m happy to say that for both 2 and 4 GPUs the efficiency was 97%. I didn’t use any special optimization and didn’t use @hide_communication so maybe it can be made even better, but I am very happy with the results.

Is the select_device() line something that you would want to include in the code as it might help others you are looking for optimized code?

Thanks for sharing your results. Regarding your feedback:

for both 2 and 4 GPUs the efficiency was 97%

Good to read you achieved good performance !

didn’t use @hide_communication so maybe it can be made even better

Would be interesting to see how it changes using @hide_communication as you can maybe have close to 100% efficiency.

Is the select_device() line something that you would want to include in the code

Not fo now I guess. The select_device() is only useful in configurations where one server hosts multiple GPUs, as then the GPUs can be assigned to the local MPI ranks using shared memory pool. If it turns out that in the future such GPU arrangement is the most standard, we may indeed update the examples to include it “by default”.

Here some further suggestions to ensure your multi-GPU code runs as expected:

  • check nvidia-smi or print the device ID to ensure you are running on different GPUs when running multiple MPI processes
  • use Nvidia visual profiler if you want to ensure that the communication-computation overlap functions properly using @hide_communication

Note that using @hide_communication may not lead to significant performance increase when running multiple GPU-MPI processes on a single node as there inter-process communication may be fast given the high-connectivity of the GPUs (e.g. Nvlink). However, when running on physically different nodes, communication (over Ethernet or Infiniband) network may be significantly slower bringing up the need to hide communication time.

Thanks for your insights @luraess .

Next, I will try the other diffusion3D code that uses @hide_communication. If it performs better, then I will certainly use that approach in the future. But I must say that 97% is pretty excellent by my standards. However, I imagine this would decrease more rapidly when going on to more gpus, and probably would not perform near as well in going to +5000 cores, like you have done before.

Thanks again for writing such good software that is fun to play with!

1 Like

One more question for @luraess . In the mini-course yesterday you give a lot of different code to solve the diffusion equation in 3D, which was a lot of fun. I know that you have shown that multi-GPU can be efficient on thousands of cores when using MPI. Do you know how high of efficiency we can get with multi-threading?

We have done some preliminary tests and even with large matrices, 512^3, we seem to get low efficiencies and saturation at 16 cores. I am tempted to try playing with the threaded code you discussed yesterday but thought I would ask you first, if you have any experience you can share.

Thank you again.

Thanks for your feedback !

Do you know how high of efficiency we can get with multi-threading?

Wouldn’t you mind opening another Discourse thread for this different topic? This would help other users to hit relevant responses upon future searches. Doing so, you could also ping @samo who would be able to give you some feedback on that. Thanks!

Very good point @luraess. I will do that right now.