Reducing latency in parallel computing

rdeits · July 6, 2018, 3:56pm

I have an embarrassingly parallel problem that I’d like to solve with Julia’s parallelism constructs, but I’m seeing a lot of latency and unexpected allocations. To demonstrate my issue, I’ll construct a dummy problem which does the following:

On each worker, create a “controller”, a closure which just multiplies its input by some captured scale value
Then, for a given input x, send it to each controller and get the resulting output y

Here’s my best attempt at that pattern so far, using RemoteChannels:

addprocs(4)

using BenchmarkTools

@everywhere function scalar_controller(scale)
    C = RemoteChannel{Channel{Float64}}
    input_channel::C = RemoteChannel(() -> Channel{Float64}(1), myid())
    output_channel::C = RemoteChannel(() -> Channel{Float64}(1), myid())
    controller = let scale = scale
        function (x)
            x * scale
        end
    end
    let input=input_channel, output=output_channel, controller=controller
        @async while true
            x = take!(input)
            y = controller(x)
            put!(output, y)
        end
    end
    input_channel, output_channel
end

channels = NTuple{2, RemoteChannel{Channel{Float64}}}[]

for i in 2:nprocs()
    push!(channels, @fetchfrom(i, scalar_controller(i)))
end

function scalar_parallel_control(channels, x)
    @sync begin
        for (input, output) in channels
            let input=input, output=output, x=x
                @async begin 
                    put!(input, x)
                    take!(output)
                end
            end
        end
    end
end

Unfortunately, this is pretty slow:

julia> @btime scalar_parallel_control($channels, 1.0)
  1.872 ms (632 allocations: 44.63 KiB)

I’ve checked @code_warntype and added let blocks to fix various closure-boxing issues, but with no real effect on latency. Running the same code without addprocs(4) gives 411.700 ns (4 allocations: 192 bytes), so the issue would seem to be related to data movement between workers.

I guess I have two questions:

Is it unreasonable to hope for better than 1ms latency in any kind of parallel application?
Is there something inefficient in the way I’m setting up my problem? The use of RemoteChannels seems kind of wasteful here, but i’m not sure how to avoid it.

Topic		Replies	Views
The speed of inter-process communication via RemoteChannels Performance	4	341	February 20, 2024
Multi process parallel computing with RemoteChannel New to Julia parallel	3	477	November 16, 2019
Parallel Map/Reduce with RemoteChannels - comments on MWE? Julia at Scale	3	1044	February 3, 2018
Pmap vs RemoteChannel performance? Julia at Scale parallel	2	939	November 22, 2018
Parallel code seems slow Performance	3	1421	October 20, 2017

Reducing latency in parallel computing

Related topics