Is this parallel performance correct?

performance
parallel

#1

I saw some strange performance today and I’m wondering what I’m doing wrong here.

julia> using BenchmarkTools

julia> a = [1:1000000;];

julia> @btime map(log, a);
  13.449 ms (3 allocations: 7.63 MiB)

julia> addprocs(4)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> wp=CachingPool(workers())
CachingPool(Channel{Int64}(sz_max:9223372036854775807,sz_curr:4), Set([4, 2, 3, 5]), Dict{Tuple{Int64,Function},RemoteChannel}())

julia> @btime pmap(wp, log, a);
  133.079 s (119748792 allocations: 3.38 GiB)

I did notice that the bottleneck seemed to be the master process (it was at 90+% CPU consistently while the workers were between 30% and 40%), but I was surprised at how much slower the pmap code was. Any ideas? If I had to guess, it’s because the computation is small, and this is the result of lots of data movement between nodes, but it’d be nice to have someone confirm this.


#2

You might want to use the batch_size keyword argument for pmap. It doesn’t completely remove the cost of communications, but it reduces them quite a bit. For example:

julia> using BenchmarkTools

julia> a = [1:1000000;];

julia> @btime map(log, a);
  17.034 ms (3 allocations: 7.63 MiB)

julia> addprocs(4)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @btime pmap(log, a);
  67.329 s (93247393 allocations: 2.65 GiB)

julia> @btime pmap(log, a, batch_size=10000);
  3.189 s (7012233 allocations: 185.70 MiB)