Help with Multi-GPU FDTD implementation: halo exchange and efficient monitors

Hi everyone!

I am a physicist and photonics engineer. Recently, Meta released a Julia package called Khronos to perform FDTD (Finite-Difference Time-Domain) simulations. The basic idea is to iteratively propagate light through a staggered grid by solving Maxwell’s equations.

Khronos is built on KernelAbstractions.jl, allowing it to run on both CPUs (x86 or Apple Silicon) and GPUs. However, it currently only supports a single GPU, which can be limiting for larger problems.

My goal is to extend Khronos to support multi-GPU simulations. I am reaching out for advice on which Julia packages could be useful for this task. Based on my research, I believe I will need to use something like MPI (MPI.jl) for inter-GPU communication. Following a suggestion from @luraess and @vchuravy on my original issue in the KA.jl GitHub, I am posting here for additional insights.

Here are some of the specific challenges I am facing:

  • Halo Exchange: I need an efficient way to share boundary information (halo exchange) between adjacent subdomains after each time step. This is essential for managing the boundaries between GPUs.
  • Monitors: I need to implement monitors to capture snapshots of the field at regular intervals. One type of monitor will capture the field after a fixed number of iterations, while another one will perform a Fourier transform on the data. To minimize communication overhead, I am considering processing the monitors on each GPU locally and only outputting the results at the end of the simulation.

Chmy.jl was initialy suggested, which seems suitable for managing halo exchange, but I am unsure if it’s the best choice for handling the monitors efficiently.

Another package that looks promising for multi-GPU parallelization is ImplicitGlobalGrid.jl.

Any advice, package recommendations, or insights would be greatly appreciated!

I will continue to dig on the subect on my side. And hopefully, I will get some great news soon!

Thanks for your help,
Lucas

6 Likes

Yes, ImplicitGlobalGrid.jl is exactly made for halo exchange on staggered grid. Halo updates are implemented using [GPU-aware] MPI. It supports any halowidth and overlap, which might be relevant for you. If there is some feature missing for your use case, please share it with us!

Concerning monitoring, it provides a gather function in order to assemble fields more easily on a master node for visualization or analyses: GitHub - eth-cscs/ImplicitGlobalGrid.jl: Almost trivial distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid
If you’re global fields don’t fit into the memory of a single compute node, then you can use ADIOS2.jl in order to connect to your main simulation a visualization or analysis application, which can run on the same nodes or on different [fewer] nodes. Here is a short tutorial showing some nice features (with python on the visualization side, but meanwhile this could also be done in julia): GitHub - omlins/adios2-tutorial: A concise ADIOS2 tutorial enabling a full in-situ visualization workflow

Note that that ImplicitGlobalGrid.jl provides Seamless interoperability with MPI.jl. So, if you are missing a particular feature and that is not of general interest, you can just write it using MPI.jl.

2 Likes