[ANN] PencilFFTs: parallel FFTs of MPI-distributed arrays

I’m happy to announce PencilFFTs.jl, a Julia package for performing parallel fast Fourier transforms using MPI.

This package is an attempt to bring parallel FFTs to Julia. It is well adapted for running on thousands of cores in CPU computing clusters (support for GPUs may arrive in the future!). The package uses a similar strategy as “standard” C/C++/Fortran codes that are commonly used in research. Namely, it decomposes a multidimensional array along one or more dimensions, so that each MPI process handles one of the resulting portions of the array. This is illustrated by the image below, which shows the 2D decomposition of a 3D array over 12 MPI processes. This particular configuration is sometimes called “pencil” decomposition, hence the name of this package.

Given a decomposition, each process can perform serial FFTs along the non-decomposed dimensions. Then, to transform along the remaining dimensions, one has to redistribute (or transpose) data among MPI processes as shown in the image. The method is described in a bit more detail in the package documentation.

This kind of approach is commonly used in spectral methods that may use Fourier or Chebyshev expansions to decompose the solution of the governing equations. Applications in physics include solvers for the Navier–Stokes equations to describe fluid turbulence in periodic or wall-bounded domains, and for the Gross–Pitaevskii equation describing Bose–Einstein condensates. See also this very relevant discussion here on Discourse from a few years ago.

As shown in the benchmarks, the performance and scalability of PencilFFTs are close to those displayed by the P3DFFT library written in Fortran, which is probably the most popular library providing FFTs with pencil decomposition. Further performance improvements are possible and will likely arrive in the near future.

The implementation of PencilFFTs is generic in the sense that it can work efficiently with geometries of arbitrary dimension. Moreover, one can choose among all transforms supported by FFTW.jl, including complex-to-complex, real-to-complex and real-to-real transforms. It’s also possible to combine different transforms along each dimension.

The parallel FFT functionality is implemented on top of a Pencils module that can also be used by itself, e.g. if one only wants to conveniently deal with MPI-distributed arrays without needing FFTs. I’m considering making this module a separate package. Further functionality that I’d like to associate to this module include exchange of ghost cells between neighbouring MPI processes, and parallel I/O using HDF5.jl.

More details are available in the documentation, including a tutorial and a few more examples.

15 Likes

:heart: :heart: :heart: :heart:

That’ is fantastic news. I can’t wait to try it out!

That’s awesome, thanks for doing this! I might have an use for it in the near future.

  • Can you comment on the difference between this approach and the one used in the MPI mode of FFTW? (sorry if this is very naive)
  • Although having a pure julia solution is of course the best solution, I guess linking to a binary P3DFFT is also an option? Any advantages for users in doing the former?
  • It looks like you’ve been able to run this on a large number of nodes. How has your experience been doing that?

Great! I hope it can be useful.

Can you comment on the difference between this approach and the one used in the MPI mode of FFTW? (sorry if this is very naive)

The difference is that the MPI FFTW routines can only perform 1D decomposition, which means that you’re more limited in the number of MPI processes that you can use. For instance, if you have 512^3 grid points, you can use up to 512 MPI processes. Of course, in this case you can combine threads and MPI if you want to go beyond. This is what many people do and it works pretty well.

Although having a pure julia solution is of course the best solution, I guess linking to a binary P3DFFT is also an option? Any advantages for users in doing the former?

I guess you could do that, but I have no idea how that would play with MPI.jl, which currently wraps the MPI C API. Also, the functionality of the Fortran P3DFFT is more limited, and for example it cannot do complex-to-complex transforms. I can’t say a lot about other libraries, only that I tried the more recent C++ version of P3DFFT, which has more functionality, and I didn’t have a very good experience. Actually my first benchmarks were done against that version…

It looks like you’ve been able to run this on a large number of nodes. How has your experience been doing that?

I had the same issues that other people have encountered, related to the fact that when you launch Julia with MPI, all processes precompile the same code, and there are race conditions and some other issues. As suggested in one of those threads, I ended up running Julia with --compiled-modules=no to workaround the issue. It’s not ideal, but it works. You can also take a look at the SLURM submitting script that I used for the benchmarks.