ParallelStencil + ImplicitGlobalGrid with multiple GPUs

Hello,

I am currently starting out with GPU programming and working out a multiple GPU implementation of my Electroweak field theory simulation code.

I have followed the examples on the ParallelStencil git for multi-GPU implementation that uses ImplicitGlobalGrid. I have gotten around writing some macros and having them work on a single GPU and I am very happy with the speedup compared to my previous MPI fortran code.
I would like to run this on multiple GPUs since I am running out of memory when simulating larger lattices(>1024^3). However, I am running into problems trying to get the same code working on multiple GPUs. I have tried using mpirun -n X julia *.jl, srun -p X julia *.jl, and just julia -p X *.jl, but these end up throwing either Segmentation faults or internal MPI errors.
I am assuming that I am not using the correct commands when starting the code and would appreciate it if someone could point me in the right direction or provide suggestions.

1 Like

Hi @Teerthal, thanks for your interest in ParallelStencil and ImplicitGlobalGrid! It’s a bit difficult to provide specific help as many things could potentially go wrong.

  1. I would recommend you first try running a Julia MPI test on your cluster/server (such as Hello world · MPI.jl) to make sure you can launch Julia MPI correctly. (see below if you want to have a script that would select a GPU based on MPI local rank - as it’s done in ImplicitGlobalGrid)
  2. Then, if you have CUDA-aware MPI install, you could also test it with the script from this section Usage · MPI.jl.
  3. Then, maybe make just a MWE from your code where you initialise the global grid using ImplicitGlobalGrid, select GPU and print the MPI rank and the GPU ID to make sure you run MPI and select different GPUs.

Once this is done and works successfully, you should be good to go to figure out what did not work.

Note that if you want to use CUDA-aware MPI features, you need to

export JULIA_CUDA_MEMORY_POOL=none
export IGG_CUDAAWARE_MPI=1

and you may need to use the absolute path to the Julia executable you call with mpirun or srun.

I would suggest you try to work this out and post back if you hit issues on the way.


Select GPU script:

using MPI, CUDA
MPI.Init()
comm = MPI.COMM_WORLD
me   = MPI.Comm_rank(comm)
# select device
comm_l = MPI.Comm_split_type(comm, MPI.MPI_COMM_TYPE_SHARED, me)
me_l   = MPI.Comm_rank(comm_l)
GPU_ID = CUDA.device!(me_l)
sleep(0.5me)
println("Hello world, I am $(me) of $(MPI.Comm_size(comm)) using $(GPU_ID)")
MPI.Barrier(comm)
3 Likes

Hello,

Thank you for your comprehensive response and debugging steps and, also amazing work on making GPU coding more accessible for physics simulations, in general. I continue to be amazed on how GPU coding is not as prevalent, in my field atleast, as it should be given the clear performance advantages.

I did already try running helloworld to see if MPI was functioning correctly and it does seem to recognize all the devices and run. I also used

using CUDA
collect(devices())

and this also returned all the devices correctly.

However, when I tried using the script you mentioned, it would not work unless I used the absolute path as you suggested. That is one mystery solved.

I was also setting

export IGG_CUDAAWARE_MPI=1

and when I tried running the test script you linked, it did not work. When I changed to 0, it is now functioning correctly. So that is the second mystery solved.

I now realize that the CUDA-aware MPI is not functioning correctly and I suppose once I get that to work correctly, the code on multiple GPUs would speed up.

Thank you very much for helping me solve these.

1 Like

Thanks for your feedback @Teerthal and glad you could progress on the original issue.

I now realize that the CUDA-aware MPI is not functioning correctly and I suppose once I get that to work correctly, the code on multiple GPUs would speed up.

CUDA-aware (or GPU-aware) MPI is elegant as the code needed in the update_halo! function becomes lighter. Also ones does no longer manually ensure optimal pipelining. However, in ImplicitGlobalGrid, we took care of implementing optimal pipelining such that you should not see any significant performance deprecation from not using GPU-aware capabilities.

I see. When I went from 1 GPU to 2, and thus twice as many points along the x direction, the time required to do the same number of iterations doubled.
I expected that, in the ideal case, the taken taken should be the same since the same work that was being done on 1 GPU is now being done on 2.
Is this expected behavior with the update_halo function?

Interesting. This should not happen. If you increase resources and problem size proportionally (in a weak scaling fashion), then ideal execution time should remain close to constant. One common mistake would be not to time a finite number of iterations or steps, but time the execution of e.g. the simulation of a finite time. The latter case may depend on CFL and other conditions that depend on local grid resolution and thus show a slow down proportional to the global grid resolution. Wise, there may be another issue.

After reading through all the posts, it looks to me very much like you increased the size of the local grid by a factor two, which would obviously straightforwardly explain your observation; note that the size of the global grid is implicitly defined by the size of the local grid and the number of processes in each dimension. See here for a quick visual explanation (part of one of my JuliaCon talks). In near future, we will revise the documentation and see how to bring that out better. Furthermore, I would recommend you to watch the whole talk as it introduces both ParallelStencil and ImplicitGlobalGrid:

amazing work on making GPU coding more accessible for physics simulations, in general.

… and thanks for your nice words!

1 Like

Thank you both for your responses and assistance on my questions.
I still am not sure why the scaling is not as expected as I did not touch the local grid size and only increased the number of processes. Perhaps, I am making some other error and will try to work it out.
I did listen to your talks, through which I found these packages. Look forward to your talks this year as well.

Another reason for your observation could be that both processes run on the same GPU rather than on different ones. However, with the script that @luraess shared with you above, you should be able to debug this in case.

Once you have made sure, that everything is working fine with MPI.jl independently of ImplicitGlobalGrid, I would recommend you to try one of the existing examples (e.g. ParallelStencil.jl/diffusion3D_multigpucpu_novis_noperf.jl at main · omlins/ParallelStencil.jl · GitHub), before continuing with your own application.

1 Like