I have a large trivially parallel computation I’m running with pmap. Say its been going for a while with 64 workers, then another 64 cores become available on a shared cluster.
Are there any Julia frameworks that allow me to attach 64 new workers to the running process for a total of 128, and have the scheduler just get those into the loop so the pmap finishes the remaining computation twice as fast as it would have?
You can give the “elastic” parallel map in Schedulers.jl a try. I usually use it with AzManagers.jl. But, it should work with any cluster manager. There are some caveats with respect to what type of code is auto-loaded onto the new workers that you need to be careful of.
I was chatting on slack with @jpsamaroo and he proposed:
using Distributed, ClusterManagers
# Start and initialize a few workers
ws = addprocs(
SGEManager(2, `-l avx512 -l h_vmem=3G`, projectdir()); topology=:master_worker
)
@everywhere ws begin
# Init worker
end
# Create initial worker pool
pool = WorkerPool(ws)
t = @async begin
# Start and initialize additional workers
new_ws = addprocs(
SGEManager(100, `-l avx512 -l h_vmem=3G`, projectdir()); topology=:master_worker
)
@everywhere new_ws begin
# Init worker
end
# Add to WorkerPool
for w in new_ws
push!(pool, w)
end
end
result = @showprogress pmap(sweep_range, pool) do parameter
heavy_computation(parameter)
end
However, I was trying it out today and the hurdle I have now is statement like using PackageA need to be at top level. Hence, syntax as
t = @async begin
@everywhere using PackageA
end
is not valid. If you don’t need to load packages at every worker the code works.
@samtkaplan How does shedulers handle the prerequisites that workers needs, e.g., @everywhere using PackageA? Will it automatic run the previous everywhere statements?
Actually, it doesn’t use the “init” method for that. It tries to detect what is loaded and automatically load those things on new workers. It does not work for everything (e.g. struct definitions that are not defined in a package). There are two functions that try to do this auto loading:
The “init” merhod is used for things like loading data from storage onto the new machine.