Distributed computing for functions in scripts inside a local module?

Nothing I can share publicly unfortunately, but here’s a lightly sanitized version:

module MyStuff

using Distributed
using Pkg

# async create n_workers worker processes
function create_worker_processes(n_workers, manager=Disributed.LocalManager(); revise=false)
    tasks = map(1:n_workers) do n
        @async create_worker_process(manager; revise, n)
    end

    return @async map(fetch, tasks)
end

function create_worker_process(manager; revise=false, n=nothing)
    n_str = n === nothing ? "" : " $(n)"
    worker_str = string("worker", n_str)
    @info "Requesting $(worker_str)..."
    pid = only(addprocs(manager))
    # make sure we activate the ACTUAL PROJECT that's active on the manager,
    # which may be different than `@.` during e.g. CI runs
    project = Pkg.project().path
    Distributed.remotecall_eval(Main, pid,
                                :(using Pkg; Pkg.activate($(project))))
    if revise
        @info "Loading Revise on $(worker_str)..."
        Distributed.remotecall_eval(Main, pid, :(using Revise))
    end
    @info "Loading MyStuff on $(worker_str)..."
    Distributed.remotecall_eval(Main, pid, :(using MyStuff))
    @info "$(worker_str) ready, PID $(pid)"
    return pid
end

end # module

passing around the manager is a bit of extra cognitive overhead but v. useflu when you’re juggling, say, different kinds of K8s resources (GPU-equipped pods for training, CPU-only for batching etc.). in that case, we usually have another layer like

provision_workers(config)
    train_workers = create_worker_processes(train_manager(config), config.n_train_workers)
    batch_workers = create_worker_processes(batch_manager(config), config.n_batch_workers)
    return (; train_workers = fetch(train_workers), batch_workers = fetch(batch_workers), config)
end

then we pass this “harness” to the functions that actually do the training/batching work so they know what workers to use. that is, there’s no magical distributed execution: the user/driver script has to specify which resources to use; this just makes it a bit more convenient to set up those resources so that they can be used effectively so the user does not need to do @everywhere using MyStuff, set the project, etc.

3 Likes