MLJ and ClusterManagers

I want do some machine learning using MLJ on a cluster of 200+ nodes. To hook up my workers to the main process I want to use an ElasticManager from the ClusterManagers package (this way I can also utilize workers which come late because they had to queue in our queueing system).

I usually initiate my cluster workers and let the worker processes execute some code like

using ClusterManagers
using XYZ


elastic_worker(cookie, ip, port)

where cookie, ip, port have the correct values.

In general this scheme of “using-first-then-connect-worker” works and make the package XYZ available on the worker - this way I do not have to call an @everywhere using XYZ after my cluster is up and running - depending on XYZ and the number of nodes such an @everywhere using XYZ can actually take ages. If you dare try @everywhere using DataFrames on 200+ workers. Performance is miserable probably because the 200 workers want to load the very same files at the same time.

For a lot of packages this scheme works nicely. However when XYZ is MLJ (my preferred machine learning package):

using ClusterManagers
using MLJ
ClusterManagers.init_worker("oo")

I get an error:

**ERROR:** LoadError: AssertionError: isempty(PGRP.refs)

Stacktrace:

[1] **init_worker(** ::String, ::Distributed.DefaultClusterManager **)** at **/buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:376**
...

This is because also MLJ is trying to do something with workers and populates the PGRP.refs.

Is there a way to load MLJ but not initiate any parallel functionality?

1 Like

In this slack thread @samuel_okon has suggested that the culprit is is a remote channel initialised in MLJBase/src/__init__.jl . This channel is only used in testing - so there’s probably a work-around to get rid of it.

I’ve created a branch MLJBase#no-distributed with the channel initialisation removed. @pfarndt Can you see if this resolves your issue?

yes - MLJBase#no-distributed resolves my issue - I hope you can get rid of this workaround. Thanks!

2 Likes

Great, that’s good to hear

Opened an issue https://github.com/alan-turing-institute/MLJBase.jl/issues/496

Now sorted: https://github.com/JuliaRegistries/General/pull/26575

3 Likes