I want do some machine learning using MLJ
on a cluster of 200+ nodes. To hook up my workers to the main process I want to use an ElasticManager
from the ClusterManagers
package (this way I can also utilize workers which come late because they had to queue in our queueing system).
I usually initiate my cluster workers and let the worker processes execute some code like
using ClusterManagers
using XYZ
elastic_worker(cookie, ip, port)
where cookie
, ip
, port
have the correct values.
In general this scheme of “using-first-then-connect-worker” works and make the package XYZ available on the worker - this way I do not have to call an @everywhere using XYZ
after my cluster is up and running - depending on XYZ
and the number of nodes such an @everywhere using XYZ
can actually take ages. If you dare try @everywhere using DataFrames
on 200+ workers. Performance is miserable probably because the 200 workers want to load the very same files at the same time.
For a lot of packages this scheme works nicely. However when XYZ
is MLJ
(my preferred machine learning package):
using ClusterManagers
using MLJ
ClusterManagers.init_worker("oo")
I get an error:
**ERROR:** LoadError: AssertionError: isempty(PGRP.refs)
Stacktrace:
[1] **init_worker(** ::String, ::Distributed.DefaultClusterManager **)** at **/buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:376**
...
This is because also MLJ
is trying to do something with workers and populates the PGRP.refs
.
Is there a way to load MLJ
but not initiate any parallel functionality?