I’m trying to use ClusterManagers.jl to distribute jobs on an LSF cluster but I have ran into issues as the ports used by workers (9000-10000?) are not open.
Is it possible to specify a port range when the --worker argument is used in a similar way as the IP can be specified with --bind-to (perhaps `–bind-to $(hostname -i):$(rand[x:y]))?
If not, where can I find what the --worker argument does so I can just write an own version of that? Or is this not going to be feasible?
I would be interested to hear about your experiences.
I’m now going to get a bit boring - on large clusters the job launch mechanism uses munge for authentication. And large MPI jobs are launched using mechanisms which scale.
I guess by saying such things I ought to put my money where my mouth is and start to investigate them.
I managed to work around the issue by using perl to compute the port number instead of bash. For some reason any permutation of bash -c 'echo <expr>' resulted in unwanted single quotes in the final expression and for some reason this does not happen when using perl -le 'print <expr>'
Perl? Yay - the camel lives again.
I would make a flippant remark about why not use Julia…
Actually a serious question here - if you limit yourself to Base functions only, what is the startup time of a short script like that?
Note to self - again why not do some work and measure it?
I guess I could, but the startup time would be unnecessary to just print a number.
It would be interesting to have a richer worker startup script but I just could not find the effort to dig out what that --worker flag does. Do you think it would allow for not running that annoying @everywhere expression after workers are added? That could possibly alleviate that other issue I’m investigating.
Adding a script when the worker flag is present seems to prevent the script from executing. If all --worker does is printing out the ip and port then it is ofc not much to duplicate. At one point I was contemplating to just feed a malformed IP address and follow the stacktrace to reverse engineer what happens but once I got a working solution there were other problems to focus on.