Reducing latency in parallel computing

I have an embarrassingly parallel problem that I’d like to solve with Julia’s parallelism constructs, but I’m seeing a lot of latency and unexpected allocations. To demonstrate my issue, I’ll construct a dummy problem which does the following:

  • On each worker, create a “controller”, a closure which just multiplies its input by some captured scale value
  • Then, for a given input x, send it to each controller and get the resulting output y

Here’s my best attempt at that pattern so far, using RemoteChannels:

addprocs(4)

using BenchmarkTools

@everywhere function scalar_controller(scale)
    C = RemoteChannel{Channel{Float64}}
    input_channel::C = RemoteChannel(() -> Channel{Float64}(1), myid())
    output_channel::C = RemoteChannel(() -> Channel{Float64}(1), myid())
    controller = let scale = scale
        function (x)
            x * scale
        end
    end
    let input=input_channel, output=output_channel, controller=controller
        @async while true
            x = take!(input)
            y = controller(x)
            put!(output, y)
        end
    end
    input_channel, output_channel
end

channels = NTuple{2, RemoteChannel{Channel{Float64}}}[]

for i in 2:nprocs()
    push!(channels, @fetchfrom(i, scalar_controller(i)))
end

function scalar_parallel_control(channels, x)
    @sync begin
        for (input, output) in channels
            let input=input, output=output, x=x
                @async begin 
                    put!(input, x)
                    take!(output)
                end
            end
        end
    end
end

Unfortunately, this is pretty slow:

julia> @btime scalar_parallel_control($channels, 1.0)
  1.872 ms (632 allocations: 44.63 KiB)

I’ve checked @code_warntype and added let blocks to fix various closure-boxing issues, but with no real effect on latency. Running the same code without addprocs(4) gives 411.700 ns (4 allocations: 192 bytes), so the issue would seem to be related to data movement between workers.

I guess I have two questions:

  1. Is it unreasonable to hope for better than 1ms latency in any kind of parallel application?
  2. Is there something inefficient in the way I’m setting up my problem? The use of RemoteChannels seems kind of wasteful here, but i’m not sure how to avoid it.