I have tried running one of the examples of
ImplicitGlobalGrid.jl on 1, 2 and 4 gpus to reproduce some scaling results. I started wtih
diffusion3D_multigpu_CuArray_novis.jl with a coarser grid to see how it does. Below you will find my results.
For this weak-scaling problem I compute the efficiencies on 2 and 4 cores to be 57 and 35 percent respectively. I realize this is much coarser but is this surprising?
I have ensured that the system does have CUDA-aware MPI so that would not be a problem, I don’t think.
$ mpiexec -np 1 julia --project diffusion3D_multigpu_CuArrays_novis.jl Global grid: 32x32x32 (nprocs: 1, dims: 1x1x1) Simulation time = 41.547354999 $ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_novis.jl Global grid: 62x32x32 (nprocs: 2, dims: 2x1x1) Simulation time = Simulation time = 72.528739102 72.494481903 $ mpiexec -np 4 julia --project diffusion3D_multigpu_CuArrays_novis.jl Global grid: 62x62x32 (nprocs: 4, dims: 2x2x1) Simulation time = Simulation time = Simulation time = Simulation time = 116.549162142 116.549022381
I was informed by @luraess that there are two problems with what I have done. Below is a copy of the points from the issue that I initially posted.
The weak scaling timings you report are to be expected since:
- you are running a very small local problem size (32x32x32) -the ratio of boundary points over total grid points is large and communication time is no longer negligible- and;
diffusion3D_multigpu_CuArray_novis.jldoes not implement the
@hide_communicationfeature to overlap and hide MPI communication by inner domain grid points computations.
I am no retrying it with a larger grid, 256^3, as the code was originally set up so I know that should improve things. However, the second point is a bit harder to figure out. Is there an example of this code that uses @hide_communication that I could run instead?