The speed of inter-process communication via RemoteChannels

I wish to roughly measure the latency and throughput of process communication in Julia, to decide how to use multi-processing in my implementation.

In the following script, I compare main and main_with_channels:

using Distributed, BenchmarkTools
addprocs(1)

@everywhere function transform!(X)
    # useful work
    for i in 1:50 X .= (div.(X, i) .+ i).^5 end
    X
end

@everywhere function worker(X, results_out)
    result = transform!(X)
    put!(results_out, result)
end

function main(n)
    X = ones(UInt64, n)
    transform!(X)
end

function main_with_channels(n)
    X = ones(UInt64, n)
    id1 = workers()[1]
    rc1 = RemoteChannel(()->Channel{typeof(X)}(1), id1)
    @spawnat id1 worker(X, rc1)
    result1 = take!(rc1)
    result1
end

for i in 4:4:24
    n = 2^i
    println("2^$i -- $(8 * 2^i) bytes")
    @btime main($n)
    @btime main_with_channels($n)
end

### output ###
2^4 -- 128 bytes
  1.738 μs (1 allocation: 192 bytes)
  54.674 μs (176 allocations: 9.05 KiB)
2^8 -- 2048 bytes
  25.941 μs (1 allocation: 2.12 KiB)
  83.650 μs (185 allocations: 25.37 KiB)
2^12 -- 32768 bytes
  411.262 μs (2 allocations: 32.05 KiB)
  504.764 μs (188 allocations: 272.73 KiB)
2^16 -- 524288 bytes
  6.878 ms (2 allocations: 512.05 KiB)
  6.867 ms (194 allocations: 1.01 MiB)
2^20 -- 8388608 bytes
  112.129 ms (2 allocations: 8.00 MiB)
  108.031 ms (210 allocations: 16.01 MiB)
2^24 -- 134217728 bytes
  1.766 s (2 allocations: 128.00 MiB)
  1.870 s (409 allocations: 256.01 MiB)

This is Julia 1.10.1 running on 32 × 13th Gen Intel(R) Core™ i9-13900 on Linux.

Results seem to suggest that the latency of a channel is <100 microseconds, and that it scales fine up to 128 MB (at least). On Windows and/or weaker machines latency seems to be worse.

In my application, I would have around 30 processes, each sending 1k messages to the master process, and the message size is <1 MB. So I guess that channels would suit this well.

However, I am not experienced in multi-process computing. Are the benchmark and conclusions correct, or am I missing something ?

EDIT: I only use processes on the host machine.

Thanks !

Any reason why you aren’t using multi-threading here? Then the overhead is ~0 because you have shared memory. This benchmark seems fine (although when actually using Distributed for it’s primary purpose of multi-node communication, the network speed and cable length will start to matter a lot more).

2 Likes

Any reason why you aren’t using multi-threading here?

I have a multi-threaded version. The computation allocates quite a lot; the reported GC time in the multi-threaded code goes up to 30% with 8 threads. It does not seem to scale too well when I increase the number of threads up to say 16.

The computation itself is quite optimized for memory allocation, one cannot do much better there (I think).

Ah, right, I am only using local processes, thanks, I will update the post.

Have you tried playing with --gcthreads? The default is half as many gc threads as worker threads, but maybe more gcthreads help in your case?

You can always post it here (best make it a runnable toy problem) and maybe some people will have some ideas how to improve :slight_smile:

1 Like

Have you tried playing with --gcthreads? The default is half as many gc threads as worker threads, but maybe more gcthreads help in your case?

I should experiment with --gcthreads more, thanks for the suggestion.
The program I write is a package; asking the user to pick --gcthreads seemed not ideal to me; but in fact we already ask to specify --threads, so maybe it’s fine

You can always post it here (best make it a runnable toy problem) and maybe some people will have some ideas how to improve

Thanks. The picture is a bit technical :^), but here is a gist.

I compute many Groebner bases in parallel using Groebner.jl (I am an author).
Three notes:

  • Groebner bases are huge. A single computation may construct 10 sparse matrices with 1e8 nonzeros each (in total around 10 * 400 MB = 4 GB)

  • In principle, it is possible to pre-allocate and reuse all 10 matrices. This is quite costly in memory

  • What one really should do is create a big chunk of space for 1 matrix (say 400 MB), and reuse this space for the other 9 matrices

Now, I could not do the last step.

Allocating a chunk of memory and then placing new vectors in that chunk bypassing the GC already seems to me very hacky and perhaps not worth complicating the code.

I also tried Bumper.jl, it is a really nice package, but does not seem to suit my workflow yet ( Slowdown when using alloc! · Issue #33 · MasonProtter/Bumper.jl (github.com)).

If you have suggestions I would be glad to try them :^)