Very poor performance with DistributedArrays?


I’m following the Parallel Computing course at JuliaAcademy, and I’m trying to replicate the results in the course when using DistributedArrays. In there a simple sum over 1e7 random numbers, when done without Distributed Arrays takes ~7.12 ms, while only ~2.6 ms when using DistributedArrays with 4 procs.

But in my case the speedup is minimal (~5.7 ms to ~4.7ms) (Julia Version 1.4.0 (2020-03-21)). Any idea why it behaves so poorly on my side? Many thanks.

$ julia -p 4
julia> using Distributed, DistributedArrays, BenchmarkTools                                        
julia> addprocs(4);                               
julia> @everywhere using DistributedArrays       
julia> a = rand(10^7); adist = distribute(a);                                                      

julia> bd = @benchmark sum($adist)
  memory estimate:  25.20 KiB
  allocs estimate:  662
  minimum time:     4.774 ms (0.00% GC)
  median time:      4.962 ms (0.00% GC)
  mean time:        5.039 ms (0.00% GC)
  maximum time:     7.551 ms (0.00% GC)
  samples:          990
  evals/sample:     1

julia> bn = @benchmark sum($a)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     5.708 ms (0.00% GC)
  median time:      6.232 ms (0.00% GC)
  mean time:        6.304 ms (0.00% GC)
  maximum time:     8.877 ms (0.00% GC)
  samples:          792
  evals/sample:     1

When you launch Julia with julia -p 4, you already have four worker processes:

$ julia -p 4 -e "@show nworkers()"
nworkers() = 4

…so you end up with eight workers after addprocs(4). If you have more workers than physical processors, you’re unlikely to see any speedup for a memory bandwidth-limited calculation like this one.

Thanks, but I get basically the same result when starting julia without the -p 4 option (4.5ms vs. 5.9ms), and forgot to add, that this is running in a box with an old chip (Intel Xeon W3690), but with 6 cores.

Looks like your chip has a maximum memory bandwidth of 32 GB/s. You’re churning through an 80 MB array, so you have a hard floor of 2.5 ms, and as you approach that floor, it becomes more difficult to wring out additional performance gains by throwing more processors at a problem. Try a more expensive calculation:

julia> @btime sum(a -> tan(a), $a)
  155.060 ms (0 allocations: 0 bytes)

julia> @btime sum(a -> tan(a), $adist)
  75.691 ms (310 allocations: 12.44 KiB)

(results from a dual-core, hyperthreaded i5-6267U with four workers)


Perfect, thanks a lot. Being my first time using DistributedArrays I was not sure where things were going wrong. With your example I get 207 vs 56 ms.

Note that there’s some significant overhead associated with distributing an array, since each worker needs to allocate and copy its portion of the full array.

julia> @btime distribute($a)
  38.067 ms (480 allocations: 76.31 MiB)

In cases where you need to perform many expensive operations on the data, this overhead is easily amortized. In other cases (as with your original example), it’s more productive to use shared-memory multithreading, which doesn’t require the data to be copied.

julia> using Transducers

julia> @btime reduce(+, Map(identity), $a)
  3.765 ms (148 allocations: 7.48 KiB)

Thanks for the pointer to Transducers package. Another one to learn about :slight_smile: