I wish to roughly measure the latency and throughput of process communication in Julia, to decide how to use multi-processing in my implementation.
In the following script, I compare main
and main_with_channels
:
using Distributed, BenchmarkTools
addprocs(1)
@everywhere function transform!(X)
# useful work
for i in 1:50 X .= (div.(X, i) .+ i).^5 end
X
end
@everywhere function worker(X, results_out)
result = transform!(X)
put!(results_out, result)
end
function main(n)
X = ones(UInt64, n)
transform!(X)
end
function main_with_channels(n)
X = ones(UInt64, n)
id1 = workers()[1]
rc1 = RemoteChannel(()->Channel{typeof(X)}(1), id1)
@spawnat id1 worker(X, rc1)
result1 = take!(rc1)
result1
end
for i in 4:4:24
n = 2^i
println("2^$i -- $(8 * 2^i) bytes")
@btime main($n)
@btime main_with_channels($n)
end
### output ###
2^4 -- 128 bytes
1.738 μs (1 allocation: 192 bytes)
54.674 μs (176 allocations: 9.05 KiB)
2^8 -- 2048 bytes
25.941 μs (1 allocation: 2.12 KiB)
83.650 μs (185 allocations: 25.37 KiB)
2^12 -- 32768 bytes
411.262 μs (2 allocations: 32.05 KiB)
504.764 μs (188 allocations: 272.73 KiB)
2^16 -- 524288 bytes
6.878 ms (2 allocations: 512.05 KiB)
6.867 ms (194 allocations: 1.01 MiB)
2^20 -- 8388608 bytes
112.129 ms (2 allocations: 8.00 MiB)
108.031 ms (210 allocations: 16.01 MiB)
2^24 -- 134217728 bytes
1.766 s (2 allocations: 128.00 MiB)
1.870 s (409 allocations: 256.01 MiB)
This is Julia 1.10.1 running on 32 × 13th Gen Intel(R) Core™ i9-13900 on Linux.
Results seem to suggest that the latency of a channel is <100 microseconds, and that it scales fine up to 128 MB (at least). On Windows and/or weaker machines latency seems to be worse.
In my application, I would have around 30 processes, each sending 1k messages to the master process, and the message size is <1 MB. So I guess that channels would suit this well.
However, I am not experienced in multi-process computing. Are the benchmark and conclusions correct, or am I missing something ?
EDIT: I only use processes on the host machine.
Thanks !
Any reason why you aren’t using multi-threading here? Then the overhead is ~0 because you have shared memory. This benchmark seems fine (although when actually using Distributed for it’s primary purpose of multi-node communication, the network speed and cable length will start to matter a lot more).
2 Likes
Any reason why you aren’t using multi-threading here?
I have a multi-threaded version. The computation allocates quite a lot; the reported GC time in the multi-threaded code goes up to 30% with 8 threads. It does not seem to scale too well when I increase the number of threads up to say 16.
The computation itself is quite optimized for memory allocation, one cannot do much better there (I think).
Ah, right, I am only using local processes, thanks, I will update the post.
Have you tried playing with --gcthreads
? The default is half as many gc threads as worker threads, but maybe more gcthreads help in your case?
You can always post it here (best make it a runnable toy problem) and maybe some people will have some ideas how to improve
1 Like
Have you tried playing with --gcthreads
? The default is half as many gc threads as worker threads, but maybe more gcthreads help in your case?
I should experiment with --gcthreads
more, thanks for the suggestion.
The program I write is a package; asking the user to pick --gcthreads
seemed not ideal to me; but in fact we already ask to specify --threads
, so maybe it’s fine
You can always post it here (best make it a runnable toy problem) and maybe some people will have some ideas how to improve
Thanks. The picture is a bit technical :^), but here is a gist.
I compute many Groebner bases in parallel using Groebner.jl (I am an author).
Three notes:
-
Groebner bases are huge. A single computation may construct 10 sparse matrices with 1e8 nonzeros each (in total around 10 * 400 MB = 4 GB)
-
In principle, it is possible to pre-allocate and reuse all 10 matrices. This is quite costly in memory
-
What one really should do is create a big chunk of space for 1 matrix (say 400 MB), and reuse this space for the other 9 matrices
Now, I could not do the last step.
Allocating a chunk of memory and then placing new vectors in that chunk bypassing the GC already seems to me very hacky and perhaps not worth complicating the code.
I also tried Bumper.jl, it is a really nice package, but does not seem to suit my workflow yet ( Slowdown when using alloc!
· Issue #33 · MasonProtter/Bumper.jl (github.com)).
If you have suggestions I would be glad to try them :^)