I use the iterative solvers on a slurm cluster using DistributedArrays and I have to communicate data between workers every iteration, which is really slow because no InfiniBand support (10~200times slower than in-node data transfer).
Sadly, I don’t know about the InfiniBand thing until all the program is ok. I find the bottleneck of communication these days while optimizing the codes and then the administrator of cluster asked me whether my codes use InfiniBand to send message or not. It’s too late.
So, if there is any plan to add InfiniBand support for Distributed.jl? Or shall I just move all my code to PartitionedArrays.jl and
MPI.jl)?