I have just tagged a new version of MPI.jl. Though the user-facing interface is largely the same, there has been extensive work underneath to internally use the C API (instead of the Fortran one). As a result, the build process is much simpler (it no longer requires CMake or a Fortran compiler). Additionally, it also directly supports CUDA-aware MPI libraries, allowing CuArrays to be passed directly as buffers (thanks to Seyoon Ko).
I would greatly appreciate if people are able to try it out, especially with different clusters and MPI implementations.