Great! I hope it can be useful.
Can you comment on the difference between this approach and the one used in the MPI mode of FFTW? (sorry if this is very naive)
The difference is that the MPI FFTW routines can only perform 1D decomposition, which means that you’re more limited in the number of MPI processes that you can use. For instance, if you have 512^3 grid points, you can use up to 512 MPI processes. Of course, in this case you can combine threads and MPI if you want to go beyond. This is what many people do and it works pretty well.
Although having a pure julia solution is of course the best solution, I guess linking to a binary P3DFFT is also an option? Any advantages for users in doing the former?
I guess you could do that, but I have no idea how that would play with MPI.jl, which currently wraps the MPI C API. Also, the functionality of the Fortran P3DFFT is more limited, and for example it cannot do complex-to-complex transforms. I can’t say a lot about other libraries, only that I tried the more recent C++ version of P3DFFT, which has more functionality, and I didn’t have a very good experience. Actually my first benchmarks were done against that version…
It looks like you’ve been able to run this on a large number of nodes. How has your experience been doing that?
I had the same issues that other people have encountered, related to the fact that when you launch Julia with MPI, all processes precompile the same code, and there are race conditions and some other issues. As suggested in one of those threads, I ended up running Julia with --compiled-modules=no
to workaround the issue. It’s not ideal, but it works. You can also take a look at the SLURM submitting script that I used for the benchmarks.