The graph above represents execution time per call of the function once in the program below as a function of the number of workers used on a 32 cpu/64 thread machine. The green line (left y axis) represents the program as displayed and the red line (right y axis) the program with the commented line swapped with the line above it. A flat line is good since it means that there is no overhead in having extra workers.
I’m not surprised that the program as presented is faster than the alternative, though I’m surprised by how much faster it is and by the fact that there are next to no gains from using more than 16 processors in the red curve example.
My questions are:
Is this phenomenon specific to Julia, is it specific to the machine architecture, both, or neither?
What other surprises does parallel computing in Julia have in store for me that are not documented in the official Julia documentation?
What would be a good, exhaustive, source for parallel computing with Julia?
Thanks!
using Distributed, BenchmarkTools
procs = parse(Int64,ARGS[1])
addprocs(procs; topology=:master_worker)
R = parse(Int64,ARGS[2])
@everywhere function once(x::Int64)
z = fill(0.0,10)
for i = 1:1_000_000
for j = 1:10 z[j] = rand() end
#~ z[:] = rand(10)
end
end
@btime pmap(x->once(x), [j for j = 1:R])
Thanks @MatFi. Right, I wasn’t surprised that the version using memory allocation took more time. I was surprised that the version with memory allocation scales so poorly.
btw: your guess turns out to be wrong. Your version is as slow as the red version, presumably because rand! does some internal memory allocation. I had the same expectation as you did.
If you have allocations in threaded code, different processes may need to communicate so that they all are referring to the correct versions of data. This can lead to caches being flushed which can be a massive performance penalty.
Ok, were the confusion comes from is, that @btime pmap(x->once(x),1:R) does not seem to count the allocations happening by the individual workers. when you change your code to
@everywhere using BenchmarkTools
procs = 4
addprocs(procs; topology=:master_worker)
R = 100
@everywhere function once(x::Int64)
@btime begin
z = fill(0.0, 10)
for i = 1:1_000_000
# for j = 1:10 z[j] = rand() end
z[:] = rand(10)
end
end
end
you’ll see that each single call of once is causing (1000001 allocations: 152.59 MiB). So it my possible that you are limited by your memory bandwidth (192*152MiB → ~30GiB/s). But I only have minor knowledge about the details here.
Thanks @MatFi. It must indeed be something like that, though the nominal memory bandwidth is 95Gb/s. Guess my desktop has more computing power relative to memory bandwidth than my laptop. This example was motivated by a much larger program, which I have now made more efficient, also.