How to configure which port workers listen

DrChainsaw · September 17, 2020, 8:16am

I’m trying to use ClusterManagers.jl to distribute jobs on an LSF cluster but I have ran into issues as the ports used by workers (9000-10000?) are not open.

Is it possible to specify a port range when the --worker argument is used in a similar way as the IP can be specified with --bind-to (perhaps `–bind-to $(hostname -i):$(rand[x:y]))?

If not, where can I find what the --worker argument does so I can just write an own version of that? Or is this not going to be feasible?

DrChainsaw · September 17, 2020, 9:17am

Uhm, I should probably have tried before asking

--bind-to <ipaddress>:<port> seems to work.

EDIT: It works, but port number for workers on same IP must also be unique. How to achieve this?

johnh · September 17, 2020, 10:22am

I would be interested to hear about your experiences.
I’m now going to get a bit boring - on large clusters the job launch mechanism uses munge for authentication. And large MPI jobs are launched using mechanisms which scale.
I guess by saying such things I ought to put my money where my mouth is and start to investigate them.

DrChainsaw · September 17, 2020, 10:46am

I will happily do so, but the rest of your post kinda makes me feel like this post wasn’t meant for me

I’m kinda fiddling with this in between other tasks and meetings, so progress is not super fast.

One slight annoyance was how to use the above to generate random port numbers for each worker, just mundane things like this:

julia> `--bind-to \`hostname -i\`:\$\(\(6000+RANDOM%1000\)\)`
`--bind-to \`hostname '-i\`:$((6000+RANDOM%1000))'`

Note the inserted single quotes which prevents the random expression from being evaluated by the shell.

I don’t know enough about ports to know if it is important to randomize them so for now I just try with the same port for all workers.

DrChainsaw · September 18, 2020, 8:23pm

Ok, when scaling up I realized why one needs different port numbers: Workers on the same IP address of course need to have unique port numbers.

Am I correct in my understanding that it is not possible to use variables like $LSB_BATCH_JID or $RANDOM in a cmd? Any suggestions on what to do?

DrChainsaw · September 24, 2020, 8:36am

I managed to work around the issue by using perl to compute the port number instead of bash. For some reason any permutation of bash -c 'echo <expr>' resulted in unwanted single quotes in the final expression and for some reason this does not happen when using perl -le 'print <expr>'

Here is the final expression I ended up using:

exeflags=`--bind-to \`hostname -i\`:\`perl -le 'print 6000+(10*$ENV{"LSB_JOBID"}+$ENV{"LSB_JOBINDEX"})%1000' \`

johnh · September 24, 2020, 11:22am

Perl? Yay - the camel lives again.
I would make a flippant remark about why not use Julia…
Actually a serious question here - if you limit yourself to Base functions only, what is the startup time of a short script like that?
Note to self - again why not do some work and measure it?

DrChainsaw · September 24, 2020, 12:16pm

I guess I could, but the startup time would be unnecessary to just print a number.

It would be interesting to have a richer worker startup script but I just could not find the effort to dig out what that --worker flag does. Do you think it would allow for not running that annoying @everywhere expression after workers are added? That could possibly alleviate that other issue I’m investigating.

Adding a script when the worker flag is present seems to prevent the script from executing. If all --worker does is printing out the ip and port then it is ofc not much to duplicate. At one point I was contemplating to just feed a malformed IP address and follow the stacktrace to reverse engineer what happens but once I got a working solution there were other problems to focus on.

johnh · September 24, 2020, 3:49pm

I know it is not possible to change your cluster scheduler…
Somehow I think array jobs would be useful here. I Am sure LSF supports similar

https://slurm.schedmd.com/job_array.html

Topic		Replies	Views
Port conflict when running several nodes on a Slurm cluster Performance question	3	804	August 4, 2022
Changing default interface for Julia worker Julia at Scale distributed , network	1	240	July 18, 2023
Parallel julia - TCP ports and security Internals & Design question , parallel , security	2	1073	May 9, 2017
Is there any way to describe worker number in each machine? Julia at Scale question	2	842	January 9, 2018
SLURM manager: one node with multiple tasks General Usage slurm	2	163	December 28, 2024

How to configure which port workers listen

Related topics