Are there any Julia parallel frameworks which allow adding workers while a computation is running?

marius311 · July 31, 2025, 6:55pm

I have a large trivially parallel computation I’m running with pmap. Say its been going for a while with 64 workers, then another 64 cores become available on a shared cluster.

Are there any Julia frameworks that allow me to attach 64 new workers to the running process for a total of 128, and have the scheduler just get those into the loop so the pmap finishes the remaining computation twice as fast as it would have?

samtkaplan · July 31, 2025, 8:02pm

You can give the “elastic” parallel map in Schedulers.jl a try. I usually use it with AzManagers.jl. But, it should work with any cluster manager. There are some caveats with respect to what type of code is auto-loaded onto the new workers that you need to be careful of.

marius311 · July 31, 2025, 8:52pm

That looks great, thank you! Going to take a look at it.

drandran12 · August 1, 2025, 9:36am

I was chatting on slack with @jpsamaroo and he proposed:

using Distributed, ClusterManagers
# Start and initialize a few workers
ws = addprocs(
    SGEManager(2, `-l avx512 -l h_vmem=3G`, projectdir()); topology=:master_worker
)
@everywhere ws begin
    # Init worker
end
# Create initial worker pool
pool = WorkerPool(ws)
t = @async begin
    # Start and initialize additional workers
    new_ws = addprocs(
        SGEManager(100, `-l avx512 -l h_vmem=3G`, projectdir()); topology=:master_worker
    )
    @everywhere new_ws begin
        # Init worker
    end
    # Add to WorkerPool
    for w in new_ws
        push!(pool, w)
    end
end
result = @showprogress pmap(sweep_range, pool) do parameter
    heavy_computation(parameter)
end

However, I was trying it out today and the hurdle I have now is statement like using PackageA need to be at top level. Hence, syntax as

t = @async begin
    @everywhere using PackageA
end

is not valid. If you don’t need to load packages at every worker the code works.

sgaure · August 1, 2025, 9:43am

Have you tried

t = @async begin
    @everywhere @eval using PackageA
end

drandran12 · August 1, 2025, 9:45am

@samtkaplan How does shedulers handle the prerequisites that workers needs, e.g., @everywhere using PackageA? Will it automatic run the previous everywhere statements?

Edit: Nevermind, you can provide an init method.

drandran12 · August 1, 2025, 10:09am

@eval works

samtkaplan · August 1, 2025, 12:20pm

Actually, it doesn’t use the “init” method for that. It tries to detect what is loaded and automatically load those things on new workers. It does not work for everything (e.g. struct definitions that are not defined in a package). There are two functions that try to do this auto loading:

github.com/ChevronETC/Schedulers.jl

src/Schedulers.jl

9736fd4eb


      
                  if fault
                      journal["pids"][pid][stage]["faults"] += 1
                  end
              end
          
              if stage ∈ ("tasks", "reduced")
                  epmap_journal_task_callback(journal["tasks"][tsk])
              end
          end
          
          function load_modules_on_new_workers(pid)
              _names = names(Main; imported=true)
              for _name in _names
                  try
                      if isa(Base.eval(Main, _name), Module) && _name ∉ (:Base, :Core, :InteractiveUtils, :VSCodeServer, :Main, :_vscodeserver)
                          remotecall_fetch(Base.eval, pid, Main, :(using $_name))
                      end
                  catch e
                      @debug "caught error in load_modules_on_new_workers for module $_name"
                      logerror(e, Logging.Debug)
                  end

github.com/ChevronETC/Schedulers.jl

src/Schedulers.jl

9736fd4eb


      
                          remotecall_fetch(Base.eval, pid, Main, :(using $_name))
                      end
                  catch e
                      @debug "caught error in load_modules_on_new_workers for module $_name"
                      logerror(e, Logging.Debug)
                  end
              end
              nothing
          end
          
          function load_functions_on_new_workers(pid)
              ignore = (Symbol("@enter"), Symbol("@run"), :ans, :eval, :include, :vscodedisplay)
              _names = filter(name->name ∉ ignore && isa(Base.eval(Main, name), Function), names(Main; imported=true))
          
              for _name in _names
                  try
                      remotecall_fetch(Base.eval, pid, Main, :(function $_name end))
                  catch e
                      @debug "caught error in load_functions_on_new_workers (function) for pid '$pid' and function '$_name'"
                      logerror(e, Logging.Debug)
                  end

The “init” merhod is used for things like loading data from storage onto the new machine.

marius311 · August 5, 2025, 3:58pm

@samtkaplan just tested Schedulers.jl and it’s exactly what I needed and worked great, many thanks!

#!/usr/bin/env julia -p 2 
# start with 2 workers

using Distributed, ElasticClusterManager, Schedulers

em = ElasticManager(addr=:auto, port=0)
# after the loop is running connect more workers with this command:
println(ElasticClusterManager.get_connect_cmd(em)) 

@everywhere function foo(i)
    println("Did work $i")
    sleep(2)
end

opt = SchedulerOptions(reporttasks=false)
epmap(opt, foo, 1:10)

rmprocs(workers())

Topic		Replies	Views
Initializing "late" workers General Usage distributed	0	319	March 27, 2020
Parallel Good Practice Julia at Scale	22	4067	November 30, 2018
Multi-threaded worker processes General Usage	14	4580	November 9, 2023
[Ann] julia in parallel batch mode: job schedulers, etc Julia at Scale announcement	2	1757	November 26, 2018
Several questions on distributed computing from a beginner General Usage question , parallel-computing	21	659	May 28, 2024

Are there any Julia parallel frameworks which allow adding workers while a computation is running?

Related topics