Well, google gives these slides for pwscf which show on page 27 (Broadwell row) if I understand them correctly that diagonalization can be as expensive as fft depending on the system, see slide 32 for a counterexample. I am not yet sure what they lumped up in “Other”.
I worked on this for the abinit code a few years ago. For small systems, the FFTs (O(N^2 log N, with N the numbers of electrons) are usually the bottleneck. For larger ones, the diagonalization (O(N^3)) becomes the bottleneck. Somewhere between the two is the nonlocal pseudopotential, depending on how it’s implemented (O(N^2) or O(N^3)).
There are two nontrivial dimensions along which you want to parallelize the code: the bands, and the plane waves. The FFT is embarassingly parallel along the bands, and that usually leads to good scalability of that operation. The linear algebra (diagonalizations & co) usually stops scaling before the FFTs do (eg see Fig 9 of https://arxiv.org/pdf/1406.4350.pdf, although a bit biased by the fact that the nonlocal operator is O(N^3)).
Parallelizing these algorithms is much easier than parallelizing the FFT/Poisson solver, because the only communication required is typically a reduction (e.g. for parallel dot products), which is trivial with MPI.jl or similar. (Indeed, existing iterative solvers in Julia may well work as-is with DistributedArrays, because the latter already supports norm
and dot
and similar linear-algebra operations.)
It’s a bit more complicated than that, because those are block algorithms. Eg you need to diagonalize dense matrices, so need to distribute them and call scalapack or ELPA). I can provide more details if needed.
It’s easy enough to do manually with MPI. The dream would be to do it all with distributed arrays and library code (eg call IterativeSolvers with a DistributedArray), but that seems pretty tricky to me because of the different data distributions involved. Definitely something to try though. It might be also that, with today’s multicore machines, one can skip one level of data distribution in favor of shared memory, which would be simpler.
I have not tried MPI.jl but don’t see why there should be any problems, as MPI.jl just has to call the MPI primitives, which should have little interaction with the julia runtime and work just as well as in fortran/C.