I have an embarrassingly parallel problem that I’d like to solve with Julia’s parallelism constructs, but I’m seeing a lot of latency and unexpected allocations. To demonstrate my issue, I’ll construct a dummy problem which does the following:
- On each worker, create a “controller”, a closure which just multiplies its input by some captured
scale
value - Then, for a given input
x
, send it to each controller and get the resulting outputy
Here’s my best attempt at that pattern so far, using RemoteChannels:
addprocs(4)
using BenchmarkTools
@everywhere function scalar_controller(scale)
C = RemoteChannel{Channel{Float64}}
input_channel::C = RemoteChannel(() -> Channel{Float64}(1), myid())
output_channel::C = RemoteChannel(() -> Channel{Float64}(1), myid())
controller = let scale = scale
function (x)
x * scale
end
end
let input=input_channel, output=output_channel, controller=controller
@async while true
x = take!(input)
y = controller(x)
put!(output, y)
end
end
input_channel, output_channel
end
channels = NTuple{2, RemoteChannel{Channel{Float64}}}[]
for i in 2:nprocs()
push!(channels, @fetchfrom(i, scalar_controller(i)))
end
function scalar_parallel_control(channels, x)
@sync begin
for (input, output) in channels
let input=input, output=output, x=x
@async begin
put!(input, x)
take!(output)
end
end
end
end
end
Unfortunately, this is pretty slow:
julia> @btime scalar_parallel_control($channels, 1.0)
1.872 ms (632 allocations: 44.63 KiB)
I’ve checked @code_warntype
and added let
blocks to fix various closure-boxing issues, but with no real effect on latency. Running the same code without addprocs(4)
gives 411.700 ns (4 allocations: 192 bytes)
, so the issue would seem to be related to data movement between workers.
I guess I have two questions:
- Is it unreasonable to hope for better than 1ms latency in any kind of parallel application?
- Is there something inefficient in the way I’m setting up my problem? The use of RemoteChannels seems kind of wasteful here, but i’m not sure how to avoid it.