The ultimate guide to distributed computing

@Shazman the repository contains a minimum example that you can download and run. You can modify the example to your needs, but this is off-topic here.

@stephenll There is a possible use case: if you are running computation in parallel where the individual computation times vary significantly.

From the @threads documentation:

The iteration space is split among the threads, after which each thread writes its thread ID to its assigned locations

So basically, when running your parallel computation with Threads.@threads each computation is assigned to a thread a priori whereas using pmap will assign a computation to one of the workers a soon when it is available. E.g. If you have one computation that takes an order of magnitude longer than the other ones, using Threads.@threads will be running the long one and the thread’s additional assigned tasks, whereas with pmap on worker will work on the long computation and the other workers will deal with the faster ones.

I’ve added an example below. One thing that I have not considered is the possible slowdown due to the additional overhead, as the significance of this depends highly on the application you’re dealing with.

┌ Warning: running threaded
└ @ Main ~/Desktop/parallelcompare.jl:29
#= /Users/bart/Desktop/parallelcompare.jl:30 =# @benchmark(threaded_tasks()) = Trial(55.044 s)
┌ Warning: running distrubuted
└ @ Main ~/Desktop/parallelcompare.jl:31
#= /Users/bart/Desktop/parallelcompare.jl:32 =# @benchmark(distributed_task()) = Trial(18.013 s)
using Distributed
using BenchmarkTools

@everywhere begin
    using Logging
    Logging.disable_logging(LogLevel(0))
    function task(id::Int; duration=1)
        @info "starting task $(id) on $(Threads.threadid()) (duration: $(duration))"
        sleep(duration)
        @info "finished task $(id) on $(Threads.threadid())"
    end
end

function threaded_tasks(n=10)
    @info "running on $(Threads.nthreads()) threads"
    Threads.@threads for id in 1:10
        task(id, duration=id)
    end
end

function distributed_task(n=10)
    @info "running on $(length(Distributed.workers())) workers"
    pmap(x->task(x, duration=x), 1:n) 
end

function main()
    @warn "running threaded"
    @show @benchmark threaded_tasks()
    @warn "running distrubuted"
    @show @benchmark distributed_task()
end

main()
1 Like

Just chiming in, check out this package: https://github.com/LCSB-BioCore/DistributedData.jl, it is pretty awesome and makes lots of distributed computing tasks very straightforward… Potentially a good inclusion in the ultimate guide :slight_smile:

2 Likes

There’s ThreadPools.jl to handle the varying work case.

1 Like

PRs are welcome in the repository to add links to these packages. The more centralized are these resources the better for new users.

2 Likes