Tricks for parallel computing in Julia

Screenshot from 2020-07-06 16-37-53

The graph above represents execution time per call of the function once in the program below as a function of the number of workers used on a 32 cpu/64 thread machine. The green line (left y axis) represents the program as displayed and the red line (right y axis) the program with the commented line swapped with the line above it. A flat line is good since it means that there is no overhead in having extra workers.

I’m not surprised that the program as presented is faster than the alternative, though I’m surprised by how much faster it is and by the fact that there are next to no gains from using more than 16 processors in the red curve example.

My questions are:

  1. Is this phenomenon specific to Julia, is it specific to the machine architecture, both, or neither?

  2. What other surprises does parallel computing in Julia have in store for me that are not documented in the official Julia documentation?

  3. What would be a good, exhaustive, source for parallel computing with Julia?

Thanks!

using Distributed, BenchmarkTools

procs = parse(Int64,ARGS[1])
addprocs(procs; topology=:master_worker)
R = parse(Int64,ARGS[2])

@everywhere function once(x::Int64)
    z = fill(0.0,10)
    for i = 1:1_000_000
        for j = 1:10 z[j] = rand() end
#~         z[:] = rand(10)
    end
end

@btime pmap(x->once(x), [j for j = 1:R])

First guess: z[:] = rand(Int) allocates, use Random.rand!(z) instead and both variants should be equally fast

1 Like

Thanks @MatFi. Right, I wasn’t surprised that the version using memory allocation took more time. I was surprised that the version with memory allocation scales so poorly.

btw: your guess turns out to be wrong. Your version is as slow as the red version, presumably because rand! does some internal memory allocation. I had the same expectation as you did.

Sounds not plausible to me. My tests show no allocs with rand!()

julia> @btime pmap(x->once(x),1:100); # for loop
  5.249 s (742 allocations: 29.44 KiB)

julia> @btime pmap(x->once(x),1:100); # rand!(z)
  3.905 s (742 allocations: 29.44 KiB)

julia> @btime pmap(x->once(x),1:100); # z[:] = rand(10)
  16.459 s (100000742 allocations: 14.90 GiB)

You’re right. I forgot to remove the [:]. That version is faster than the one with the loop.

But my questions remain unanswered, e.g. why do versions with memory allocation scale so poorly.

If you have allocations in threaded code, different processes may need to communicate so that they all are referring to the correct versions of data. This can lead to caches being flushed which can be a massive performance penalty.

Sure, thanks @Oscar_Smith, but how is that relevant in the example above?

R is in your example increasing with the number of workers I guess. Allocations do increase than as well with number of workers.

Thanks @MatFi . No, R is the same. For R=192, I get for the inefficient version,

for 32 workers: 1.066 s (13474 allocations: 544.56 KiB)
for 16 workers: 1.051 s (13432 allocations: 532.81 KiB)

Not much of an increase in allocations, but a slight loss in speed.

Ok, were the confusion comes from is, that @btime pmap(x->once(x),1:R) does not seem to count the allocations happening by the individual workers. when you change your code to

@everywhere using BenchmarkTools

procs = 4
addprocs(procs; topology=:master_worker)
R = 100

@everywhere function once(x::Int64)
    @btime begin
        z = fill(0.0, 10)
        for i = 1:1_000_000
            #    for j = 1:10 z[j] = rand() end
            z[:] = rand(10)
        end
    end
end

you’ll see that each single call of once is causing (1000001 allocations: 152.59 MiB). So it my possible that you are limited by your memory bandwidth (192*152MiB → ~30GiB/s). But I only have minor knowledge about the details here.

1 Like

Thanks @MatFi. It must indeed be something like that, though the nominal memory bandwidth is 95Gb/s. Guess my desktop has more computing power relative to memory bandwidth than my laptop. This example was motivated by a much larger program, which I have now made more efficient, also.

Wonder what other issues are lurking out there…