Is this parallel performance correct?

I saw some strange performance today and I’m wondering what I’m doing wrong here.

julia> using BenchmarkTools

julia> a = [1:1000000;];

julia> @btime map(log, a);
  13.449 ms (3 allocations: 7.63 MiB)

julia> addprocs(4)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> wp=CachingPool(workers())
CachingPool(Channel{Int64}(sz_max:9223372036854775807,sz_curr:4), Set([4, 2, 3, 5]), Dict{Tuple{Int64,Function},RemoteChannel}())

julia> @btime pmap(wp, log, a);
  133.079 s (119748792 allocations: 3.38 GiB)

I did notice that the bottleneck seemed to be the master process (it was at 90+% CPU consistently while the workers were between 30% and 40%), but I was surprised at how much slower the pmap code was. Any ideas? If I had to guess, it’s because the computation is small, and this is the result of lots of data movement between nodes, but it’d be nice to have someone confirm this.

You might want to use the batch_size keyword argument for pmap. It doesn’t completely remove the cost of communications, but it reduces them quite a bit. For example:

julia> using BenchmarkTools

julia> a = [1:1000000;];

julia> @btime map(log, a);
  17.034 ms (3 allocations: 7.63 MiB)

julia> addprocs(4)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @btime pmap(log, a);
  67.329 s (93247393 allocations: 2.65 GiB)

julia> @btime pmap(log, a, batch_size=10000);
  3.189 s (7012233 allocations: 185.70 MiB)

1 Like