Tricks for parallel computing in Julia

Joris_Pinkse · July 6, 2020, 8:50pm

Screenshot from 2020-07-06 16-37-53

The graph above represents execution time per call of the function once in the program below as a function of the number of workers used on a 32 cpu/64 thread machine. The green line (left y axis) represents the program as displayed and the red line (right y axis) the program with the commented line swapped with the line above it. A flat line is good since it means that there is no overhead in having extra workers.

I’m not surprised that the program as presented is faster than the alternative, though I’m surprised by how much faster it is and by the fact that there are next to no gains from using more than 16 processors in the red curve example.

My questions are:

Is this phenomenon specific to Julia, is it specific to the machine architecture, both, or neither?
What other surprises does parallel computing in Julia have in store for me that are not documented in the official Julia documentation?
What would be a good, exhaustive, source for parallel computing with Julia?

Thanks!

using Distributed, BenchmarkTools

procs = parse(Int64,ARGS[1])
addprocs(procs; topology=:master_worker)
R = parse(Int64,ARGS[2])

@everywhere function once(x::Int64)
    z = fill(0.0,10)
    for i = 1:1_000_000
        for j = 1:10 z[j] = rand() end
#~         z[:] = rand(10)
    end
end

@btime pmap(x->once(x), [j for j = 1:R])

MatFi · July 6, 2020, 9:31pm

First guess: z[:] = rand(Int) allocates, use Random.rand!(z) instead and both variants should be equally fast

Joris_Pinkse · July 6, 2020, 9:36pm

Thanks @MatFi. Right, I wasn’t surprised that the version using memory allocation took more time. I was surprised that the version with memory allocation scales so poorly.

Joris_Pinkse · July 6, 2020, 9:37pm

btw: your guess turns out to be wrong. Your version is as slow as the red version, presumably because rand! does some internal memory allocation. I had the same expectation as you did.

MatFi · July 6, 2020, 9:58pm

Sounds not plausible to me. My tests show no allocs with rand!()

julia> @btime pmap(x->once(x),1:100); # for loop
  5.249 s (742 allocations: 29.44 KiB)

julia> @btime pmap(x->once(x),1:100); # rand!(z)
  3.905 s (742 allocations: 29.44 KiB)

julia> @btime pmap(x->once(x),1:100); # z[:] = rand(10)
  16.459 s (100000742 allocations: 14.90 GiB)

Joris_Pinkse · July 6, 2020, 10:07pm

You’re right. I forgot to remove the [:]. That version is faster than the one with the loop.

But my questions remain unanswered, e.g. why do versions with memory allocation scale so poorly.

Oscar_Smith · July 6, 2020, 10:11pm

If you have allocations in threaded code, different processes may need to communicate so that they all are referring to the correct versions of data. This can lead to caches being flushed which can be a massive performance penalty.

Joris_Pinkse · July 6, 2020, 10:12pm

Sure, thanks @Oscar_Smith, but how is that relevant in the example above?

MatFi · July 7, 2020, 5:37am

R is in your example increasing with the number of workers I guess. Allocations do increase than as well with number of workers.

Joris_Pinkse · July 7, 2020, 3:03pm

Thanks @MatFi . No, R is the same. For R=192, I get for the inefficient version,

for 32 workers: 1.066 s (13474 allocations: 544.56 KiB)
for 16 workers: 1.051 s (13432 allocations: 532.81 KiB)

Not much of an increase in allocations, but a slight loss in speed.

MatFi · July 7, 2020, 8:56pm

Ok, were the confusion comes from is, that @btime pmap(x->once(x),1:R) does not seem to count the allocations happening by the individual workers. when you change your code to

@everywhere using BenchmarkTools

procs = 4
addprocs(procs; topology=:master_worker)
R = 100

@everywhere function once(x::Int64)
    @btime begin
        z = fill(0.0, 10)
        for i = 1:1_000_000
            #    for j = 1:10 z[j] = rand() end
            z[:] = rand(10)
        end
    end
end

you’ll see that each single call of once is causing (1000001 allocations: 152.59 MiB). So it my possible that you are limited by your memory bandwidth (192*152MiB → ~30GiB/s). But I only have minor knowledge about the details here.

Joris_Pinkse · July 7, 2020, 11:12pm

Thanks @MatFi. It must indeed be something like that, though the nominal memory bandwidth is 95Gb/s. Guess my desktop has more computing power relative to memory bandwidth than my laptop. This example was motivated by a much larger program, which I have now made more efficient, also.

Wonder what other issues are lurking out there…

Topic		Replies	Views
For loop in function and multiplication of larger matrices, slow speed in parallel Performance performance , parallel , loops	3	1306	November 22, 2019
Memory allocation in multi-thread vs single-thread Julia at Scale performance	0	638	August 7, 2018
Threads memory allocations General Usage	2	499	January 22, 2020
@threads vs @parallel, a simple fail case for @threads Performance	3	1413	October 31, 2017
Multithreaded computation significantly slower Performance question	18	4015	October 17, 2020

Tricks for parallel computing in Julia

Related topics