How to configure which port workers listen

I’m trying to use ClusterManagers.jl to distribute jobs on an LSF cluster but I have ran into issues as the ports used by workers (9000-10000?) are not open.

Is it possible to specify a port range when the --worker argument is used in a similar way as the IP can be specified with --bind-to (perhaps `–bind-to $(hostname -i):$(rand[x:y]))?

If not, where can I find what the --worker argument does so I can just write an own version of that? Or is this not going to be feasible?

Uhm, I should probably have tried before asking :slight_smile:

--bind-to <ipaddress>:<port> seems to work.

EDIT: It works, but port number for workers on same IP must also be unique. How to achieve this?

I would be interested to hear about your experiences.
I’m now going to get a bit boring - on large clusters the job launch mechanism uses munge for authentication. And large MPI jobs are launched using mechanisms which scale.
I guess by saying such things I ought to put my money where my mouth is and start to investigate them.

I will happily do so, but the rest of your post kinda makes me feel like this post wasn’t meant for me :slight_smile:

I’m kinda fiddling with this in between other tasks and meetings, so progress is not super fast.

One slight annoyance was how to use the above to generate random port numbers for each worker, just mundane things like this:

julia> `--bind-to \`hostname -i\`:\$\(\(6000+RANDOM%1000\)\)`
`--bind-to \`hostname '-i\`:$((6000+RANDOM%1000))'`

Note the inserted single quotes which prevents the random expression from being evaluated by the shell.

I don’t know enough about ports to know if it is important to randomize them so for now I just try with the same port for all workers.

Ok, when scaling up I realized why one needs different port numbers: Workers on the same IP address of course need to have unique port numbers.

Am I correct in my understanding that it is not possible to use variables like $LSB_BATCH_JID or $RANDOM in a cmd? Any suggestions on what to do?

I managed to work around the issue by using perl to compute the port number instead of bash. For some reason any permutation of bash -c 'echo <expr>' resulted in unwanted single quotes in the final expression and for some reason this does not happen when using perl -le 'print <expr>'

Here is the final expression I ended up using:

exeflags=`--bind-to \`hostname -i\`:\`perl -le 'print 6000+(10*$ENV{"LSB_JOBID"}+$ENV{"LSB_JOBINDEX"})%1000' \`

Perl? Yay - the camel lives again.
I would make a flippant remark about why not use Julia…
Actually a serious question here - if you limit yourself to Base functions only, what is the startup time of a short script like that?
Note to self - again why not do some work and measure it?

I guess I could, but the startup time would be unnecessary to just print a number.

It would be interesting to have a richer worker startup script but I just could not find the effort to dig out what that --worker flag does. Do you think it would allow for not running that annoying @everywhere expression after workers are added? That could possibly alleviate that other issue I’m investigating.

Adding a script when the worker flag is present seems to prevent the script from executing. If all --worker does is printing out the ip and port then it is ofc not much to duplicate. At one point I was contemplating to just feed a malformed IP address and follow the stacktrace to reverse engineer what happens but once I got a working solution there were other problems to focus on.

I know it is not possible to change your cluster scheduler…
Somehow I think array jobs would be useful here. I Am sure LSF supports similar

https://slurm.schedmd.com/job_array.html