Distributed computing for functions in scripts inside a local module?

dave.f.kleinschmidt · March 29, 2023, 2:59pm

Nothing I can share publicly unfortunately, but here’s a lightly sanitized version:

module MyStuff

using Distributed
using Pkg

# async create n_workers worker processes
function create_worker_processes(n_workers, manager=Disributed.LocalManager(); revise=false)
    tasks = map(1:n_workers) do n
        @async create_worker_process(manager; revise, n)
    end

    return @async map(fetch, tasks)
end

function create_worker_process(manager; revise=false, n=nothing)
    n_str = n === nothing ? "" : " $(n)"
    worker_str = string("worker", n_str)
    @info "Requesting $(worker_str)..."
    pid = only(addprocs(manager))
    # make sure we activate the ACTUAL PROJECT that's active on the manager,
    # which may be different than `@.` during e.g. CI runs
    project = Pkg.project().path
    Distributed.remotecall_eval(Main, pid,
                                :(using Pkg; Pkg.activate($(project))))
    if revise
        @info "Loading Revise on $(worker_str)..."
        Distributed.remotecall_eval(Main, pid, :(using Revise))
    end
    @info "Loading MyStuff on $(worker_str)..."
    Distributed.remotecall_eval(Main, pid, :(using MyStuff))
    @info "$(worker_str) ready, PID $(pid)"
    return pid
end

end # module

passing around the manager is a bit of extra cognitive overhead but v. useflu when you’re juggling, say, different kinds of K8s resources (GPU-equipped pods for training, CPU-only for batching etc.). in that case, we usually have another layer like

provision_workers(config)
    train_workers = create_worker_processes(train_manager(config), config.n_train_workers)
    batch_workers = create_worker_processes(batch_manager(config), config.n_batch_workers)
    return (; train_workers = fetch(train_workers), batch_workers = fetch(batch_workers), config)
end

then we pass this “harness” to the functions that actually do the training/batching work so they know what workers to use. that is, there’s no magical distributed execution: the user/driver script has to specify which resources to use; this just makes it a bit more convenient to set up those resources so that they can be used effectively so the user does not need to do @everywhere using MyStuff, set the project, etc.

Topic		Replies	Views
Making code and packages available to workers inside module General Usage parallel , module , distributed	4	813	March 2, 2021
Unable to create parallel package General Usage question , package , distributed	1	148	April 19, 2024
Import a module within that same module for parallelisation General Usage parallel	4	594	August 18, 2019
Using local module with @everywhere Julia at Scale distributed	2	934	September 12, 2022
Error running distributed code inside of a module General Usage	6	754	January 31, 2021

Distributed computing for functions in scripts inside a local module?

Related topics