Very poor performance with DistributedArrays?

angelv · May 24, 2020, 5:32pm

Hello,

I’m following the Parallel Computing course at JuliaAcademy, and I’m trying to replicate the results in the course when using DistributedArrays. In there a simple sum over 1e7 random numbers, when done without Distributed Arrays takes ~7.12 ms, while only ~2.6 ms when using DistributedArrays with 4 procs.

But in my case the speedup is minimal (~5.7 ms to ~4.7ms) (Julia Version 1.4.0 (2020-03-21)). Any idea why it behaves so poorly on my side? Many thanks.

$ julia -p 4
julia> using Distributed, DistributedArrays, BenchmarkTools                                        
julia> addprocs(4);                               
julia> @everywhere using DistributedArrays       
julia> a = rand(10^7); adist = distribute(a);                                                      

julia> bd = @benchmark sum($adist)
BenchmarkTools.Trial: 
  memory estimate:  25.20 KiB
  allocs estimate:  662
  --------------
  minimum time:     4.774 ms (0.00% GC)
  median time:      4.962 ms (0.00% GC)
  mean time:        5.039 ms (0.00% GC)
  maximum time:     7.551 ms (0.00% GC)
  --------------
  samples:          990
  evals/sample:     1

julia> bn = @benchmark sum($a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.708 ms (0.00% GC)
  median time:      6.232 ms (0.00% GC)
  mean time:        6.304 ms (0.00% GC)
  maximum time:     8.877 ms (0.00% GC)
  --------------
  samples:          792
  evals/sample:     1

stillyslalom · May 24, 2020, 6:27pm

When you launch Julia with julia -p 4, you already have four worker processes:

$ julia -p 4 -e "@show nworkers()"
nworkers() = 4

…so you end up with eight workers after addprocs(4). If you have more workers than physical processors, you’re unlikely to see any speedup for a memory bandwidth-limited calculation like this one.

angelv · May 24, 2020, 6:33pm

Thanks, but I get basically the same result when starting julia without the -p 4 option (4.5ms vs. 5.9ms), and forgot to add, that this is running in a box with an old chip (Intel Xeon W3690), but with 6 cores.

stillyslalom · May 24, 2020, 7:33pm

Looks like your chip has a maximum memory bandwidth of 32 GB/s. You’re churning through an 80 MB array, so you have a hard floor of 2.5 ms, and as you approach that floor, it becomes more difficult to wring out additional performance gains by throwing more processors at a problem. Try a more expensive calculation:

julia> @btime sum(a -> tan(a), $a)
  155.060 ms (0 allocations: 0 bytes)
6.156371337551294e6

julia> @btime sum(a -> tan(a), $adist)
  75.691 ms (310 allocations: 12.44 KiB)
6.156371337551294e6

(results from a dual-core, hyperthreaded i5-6267U with four workers)

angelv · May 24, 2020, 7:37pm

Perfect, thanks a lot. Being my first time using DistributedArrays I was not sure where things were going wrong. With your example I get 207 vs 56 ms.

stillyslalom · May 24, 2020, 8:08pm

Note that there’s some significant overhead associated with distributing an array, since each worker needs to allocate and copy its portion of the full array.

julia> @btime distribute($a)
  38.067 ms (480 allocations: 76.31 MiB)

In cases where you need to perform many expensive operations on the data, this overhead is easily amortized. In other cases (as with your original example), it’s more productive to use shared-memory multithreading, which doesn’t require the data to be copied.

julia> using Transducers

julia> @btime reduce(+, Map(identity), $a)
  3.765 ms (148 allocations: 7.48 KiB)

angelv · May 25, 2020, 12:58am

Thanks for the pointer to Transducers package. Another one to learn about

Cheers,

Topic		Replies	Views
Using Distributed: computational efficiency Julia at Scale	5	963	June 24, 2019
Best practice for data-parallel computation using DistributedArrays Julia at Scale	1	835	April 3, 2018
Bad performance using DistributedArray SPMD for 1D diffusion problem Performance parallel	2	827	July 18, 2019
Help Optimizing Slow Distributed Array Computation Performance distributed	7	1925	November 14, 2018
The sum of an array is faster in the sequential version than in the parallelized version Julia at Scale	9	1332	October 11, 2018

Very poor performance with DistributedArrays?

Related topics