I am running a distributed workload using Distributed.jl on a cluster that has set up IP over Infiniband (IPoIB). However, when the workers are created, they automatically return the IP address of the slow ethernet interface. A workaround I found is to edit line 1279 of cluster.jl in stdlib/Distributed where the bind address is set from the getipaddr() function to the IP address of the Infiniband. I am aware that you can change the bind-address of a worker via the –bind-to flag, however, this is just for a single worker and is complicated if you can only access a cluster through a reservation system. Now my question is, is there a way to set a preferred interface for Julia workers without having to compile a custom Julia version?
Alternative solutions using for example the ClusterManager are also very welcome.
I had similar problems in the past and solved them by adding these lines in lsf
cmd = `cd $dir ";" hostname -i "|" xargs $exename $exeflags $(worker_arg) --bind-to `
bsub_cmd = `bsub -I -x $(manager.flags) -cwd $dir -J $jobname "$cmd"`
That way the worker was getting its correct ip address and passing to
--bind-to. There may be a cleaner solution (I was much less familiar with julia back then, and there appears to be some cleaner options in the chat YMMV :)); but since then, that system was decommissioned and I thankfully didn’t need that workaround anymore.
A general point about “compiling a custom julia version”. Julia is a dynamic language so if you only needed to edit one line to make this work you can do
@eval Distributed function init_bind_addr()
And that would replace the function and should recompile a bunch of code automatically as needed at runtime.
Hope this helps,