MLJ and ClusterManagers

pfarndt · December 9, 2020, 5:51pm

I want do some machine learning using MLJ on a cluster of 200+ nodes. To hook up my workers to the main process I want to use an ElasticManager from the ClusterManagers package (this way I can also utilize workers which come late because they had to queue in our queueing system).

I usually initiate my cluster workers and let the worker processes execute some code like

using ClusterManagers
using XYZ


elastic_worker(cookie, ip, port)

where cookie, ip, port have the correct values.

In general this scheme of “using-first-then-connect-worker” works and make the package XYZ available on the worker - this way I do not have to call an @everywhere using XYZ after my cluster is up and running - depending on XYZ and the number of nodes such an @everywhere using XYZ can actually take ages. If you dare try @everywhere using DataFrames on 200+ workers. Performance is miserable probably because the 200 workers want to load the very same files at the same time.

For a lot of packages this scheme works nicely. However when XYZ is MLJ (my preferred machine learning package):

using ClusterManagers
using MLJ
ClusterManagers.init_worker("oo")

I get an error:

**ERROR:** LoadError: AssertionError: isempty(PGRP.refs)

Stacktrace:

[1] **init_worker(** ::String, ::Distributed.DefaultClusterManager **)** at **/buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:376**
...

This is because also MLJ is trying to do something with workers and populates the PGRP.refs.

Is there a way to load MLJ but not initiate any parallel functionality?

ablaom · December 13, 2020, 8:36pm

In this slack thread @samuel_okon has suggested that the culprit is is a remote channel initialised in MLJBase/src/__init__.jl . This channel is only used in testing - so there’s probably a work-around to get rid of it.

I’ve created a branch MLJBase#no-distributed with the channel initialisation removed. @pfarndt Can you see if this resolves your issue?

pfarndt · December 14, 2020, 10:54am

yes - MLJBase#no-distributed resolves my issue - I hope you can get rid of this workaround. Thanks!

ablaom · December 14, 2020, 10:33pm

Great, that’s good to hear

Opened an issue https://github.com/alan-turing-institute/MLJBase.jl/issues/496

ablaom · December 17, 2020, 8:16pm

Now sorted: https://github.com/JuliaRegistries/General/pull/26575

Topic		Replies	Views
Are there any Julia parallel frameworks which allow adding workers while a computation is running? General Usage	8	232	August 5, 2025
Initializing "late" workers General Usage distributed	0	312	March 27, 2020
Code that works fine locally causes an error on a cluster Julia at Scale question	3	463	May 14, 2020
Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia? General Usage question , package	44	2144	July 13, 2024
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1400	May 19, 2022

MLJ and ClusterManagers

Related topics