I use the iterative solvers on a slurm cluster using DistributedArrays and I have to communicate data between workers every iteration, which is really slow because no InfiniBand support (10~200times slower than in-node data transfer).
Sadly, I don’t know about the InfiniBand thing until all the program is ok. I find the bottleneck of communication these days while optimizing the codes and then the administrator of cluster asked me whether my codes use InfiniBand to send message or not. It’s too late.
So, if there is any plan to add InfiniBand support for Distributed.jl? Or shall I just move all my code to PartitionedArrays.jl and
I guess you mean Infiniband…
Have a look at this thread: Custom transport for Distributed.jl to utilize Infiniband and avoid MPI?
Because Distributed.jl is part of Julia, can you create an issue (feature request) at Issues · JuliaLang/julia · GitHub ?
Yes, I have noticed the discussion about UCX.jl, but there isn’t any documents about it and I’m not familiar with UCX at all. I wish I can do some contribute to the package. But for now, I think I’d better to switch my code to MPI to match the schedule.