Custom transport for Distributed.jl to utilize Infiniband and avoid MPI?

I was wondering if there’s any prospect of Distributed.jl and/or Dagger.jl to support interconnects such as Infiniband on HPC clusters?

There’s a nice summary of the downsides of MPI.jl and Distributed.jl in this post:

with the main downside of Distributed.jl being the missing Infiniband support. It would be great to circumvent going to MPI only for this.

I found only one post addressing this, saying that a custom transport could be implemented, but is not easily done:

Now I’m no expert on any of this, but I think if the Julia internal Distributed.jl could leverage Infiniband interconnects it would be a big advantage. Especially to those people (like me) who do not want to rethink their whole code in MPI-style. I’d be happy to hear if there are any projects or intentions to support this!

Thanks :slight_smile:


Shooting in the dark, would GitHub - JuliaParallel/UCX.jl work here?

I don’t know UCX and what it does exactly, the ucx github page mentions Infiniband communication though. If I look at the UCX.jl example it does not look like a significant improvement over MPI.jl? Or do you mean to use ucx as an example for an implementation of the transport for Distributed.jl?

UCX implements support for a large number of network fabrics; OpenMPI uses UCX to access those fabrics and other communication schemes. I think that UCX.jl will probably be the future of custom transports for Distributed data transfer.

@vchuravy has been working on getting UCX.jl to underpin Distributed, and has had recent success with implementing a UCX-backed IOStream that Distributed should be able to use.