DistributedArrays: basic element-wise vector operation is slow / fails

Hi all,

I am new to Julia and did the following setup with DistributedArrays:

julia -p 2

julia> @everywhere using DistributedArrays

julia> nx = 6;

julia> A = zeros(nx-1);

julia> B = (1:nx).^2.0;

julia> A = distribute(A)
5-element DArray{Float64,1,Array{Float64,1}}:
 0.0
 0.0
 0.0
 0.0
 0.0

julia> B = distribute(B)
6-element DArray{Float64,1,Array{Float64,1}}:
  1.0
  4.0
  9.0
 16.0
 25.0
 36.0

Then I run:

julia> A .= B[2:end] - B[1:end-1]
5-element DArray{Float64,1,Array{Float64,1}}:
  3.0
  5.0
  7.0
  9.0
 11.0

As you can see, it works. However, it is extremely slow (I tried it with large arrays, i.e. a large nx). This means it does unnecessary allocation(s) and/or data transfer. Ideally, it should not do any allocation in this statement (as A is pre-allocated) and the workers should only fetch one value from each neighboring workers’ local array boundary.

I also tried to do the element-wise subtraction explicitly with the broadcast operator .- to see if this changes anything, but it failed:

julia> A .= B[2:end] .- B[1:end-1]
ERROR: MethodError: no method matching SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}(::Array{Float64,1})
Closest candidates are:
  SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}(::Any, ::Any, ::Any, ::Any) where {T, N, P, I, L} at subarray.jl:14
Stacktrace:
 [1] empty_localpart(::Type, ::Int64, ::Type) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:66
 [2] macro expansion at ./task.jl:264 [inlined]
 [3] macro expansion at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:84 [inlined]
 [4] macro expansion at ./task.jl:244 [inlined]
 [5] DArray(::Tuple{Int64,Int64}, ::Function, ::Tuple{Int64}, ::Array{Int64,1}, ::Array{Tuple{UnitRange{Int64}},1}, ::Array{Array{Int64,1},1}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:82
 [6] DArray(::Function, ::Tuple{Int64}, ::Array{Int64,1}, ::Array{Int64,1}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:169
 [7] #distribute#69(::Array{Int64,1}, ::Array{Int64,1}, ::Function, ::SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:542
 [8] distribute(::SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:535
 [9] _bcdistribute at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:119 [inlined]
 [10] bcdistribute at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:115 [inlined]
 [11] bcdistribute_args at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:122 [inlined]
 [12] bcdistribute at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:111 [inlined]
 [13] copyto! at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:61 [inlined]
 [14] materialize!(::DArray{Float64,1,Array{Float64,1}}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false},SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}}}) at ./broadcast.jl:751
 [15] top-level scope at none:0

My questions are:

  1. What is happening that makes A .= B[2:end] - B[1:end-1] so slow?
  2. Why does A .= B[2:end] .- B[1:end-1] fail?
  3. Is there any remedy to make the statements of this kind fast with DistributedArrays (with minimal changes to the syntax used here)?

Thank you very much!

Sam

PS: note that this setup works probably only with the master version of DistributedArrays (see the concerning post and its solution).

UPDATE

In the following are some benchmark results showing that A .= B[2:end] - B[1:end-1] is extremely slow as noted above. The results show moreover that the even simpler statement A .= B[1:end-1] * 2.0 achieves a very bad performance, while B .= B * 2.0 performs much better. Thus, it seems like DistributedArrays can currently not handle well the selection of sub-arrays. The output of the benchmarking shows that many allocations took place. One explanation could be that a new array is allocated for each sub-array expression (as e.g. B[1:end-1]).

The code used for benchmarking

@everywhere using DistributedArrays
using BenchmarkTools

nx = 1024^2;
A = zeros(nx-1);
B = (1:nx).^2.0;
A = distribute(A);
B = distribute(B);

function f!(A, B)
    A .= B[2:end] - B[1:end-1];
end

function g!(A, B)
    A .= B[1:end-1] * 2.0;
end

function h!(B)
    B .= B * 2.0;
end

function j!(B)
    B .= B .* 2.0;
end

bench_f = @benchmark f!($A, $B)
bench_g = @benchmark g!($A, $B)
bench_h = @benchmark h!($B)
bench_j = @benchmark j!($B)

display(bench_f); println()
display(bench_g); println()
display(bench_h); println()
display(bench_j); println()

The benchmarking results

> srun -u -C gpu -n 1 julia -p 12 test.jl 
srun: job 903966 queued and waiting for resources
srun: job 903966 has been allocated resources
BenchmarkTools.Trial: 
  memory estimate:  5.05 GiB
  allocs estimate:  160694245
  --------------
  minimum time:     187.977 s (0.31% GC)
  median time:      187.977 s (0.31% GC)
  mean time:        187.977 s (0.31% GC)
  maximum time:     187.977 s (0.31% GC)
  --------------
  samples:          1
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  2.53 GiB
  allocs estimate:  80349655
  --------------
  minimum time:     93.760 s (0.37% GC)
  median time:      93.760 s (0.37% GC)
  mean time:        93.760 s (0.37% GC)
  maximum time:     93.760 s (0.37% GC)
  --------------
  samples:          1
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  270.23 KiB
  allocs estimate:  2822
  --------------
  minimum time:     2.194 ms (0.00% GC)
  median time:      2.323 ms (0.00% GC)
  mean time:        4.622 ms (1.78% GC)
  maximum time:     138.691 ms (0.00% GC)
  --------------
  samples:          1081
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  129.36 KiB
  allocs estimate:  1178
  --------------
  minimum time:     1.003 ms (0.00% GC)
  median time:      1.026 ms (0.00% GC)
  mean time:        1.064 ms (1.79% GC)
  maximum time:     42.051 ms (96.83% GC)
  --------------
  samples:          4683
  evals/sample:     1
1 Like

The behaviour of A .= B[2:end] - B[1:end-1] might be in agreement with the performance tips section in the official Julia documentation:

(1) “each * and + operation (…) allocates a new temporary array and executes in a separate loop” [1].
(2) “an array “slice” expression like array[1:5, :] creates a copy of that data (except on the left-hand side of an assignment, where array[1:5, :] = ... assigns in-place to that portion of array)” [2].

To prevent (1) from happening, the dot syntax can normally be used: “nested “dot calls” are fusing: they are combined at the syntax level into a single loop, without allocating temporary arrays” [1]. The current implementation of DistributedArrays does however not support the full dot syntax for this expression: A .= B[2:end] .- B[1:end-1] fails as documented in the topic description.

To prevent (2) from happening views can normally be used [2]. The current implementation of DistributedArrays does though not seem to change the allocation behaviour when using views. See the benchmarking experiment at the end of this message [3, 4].

I believe therefore that the answers to my questions are the following (please correct me!):

  1. What is happening that makes A .= B[2:end] - B[1:end-1] so slow?

- Unneeded allocations occur, which are however in agreement with the official Julia documentation (see above).

  1. Why does A .= B[2:end] .- B[1:end-1] fail?

- The dot syntax is not yet fully supported in DistributedArrays.

  1. Is there any remedy to make the statements of this kind fast with DistributedArrays (with minimal changes to the syntax used here)?

- No, currently not: full support in DistributedArrays for the dot syntax and views would be required.

Please correct me or share your thoughts (agree/disagree?)!

Cheers,

Sam

References

[1] Performance Tips · The Julia Language

[2] Performance Tips · The Julia Language

[3] The code used for benchmarking

@everywhere using DistributedArrays
using BenchmarkTools

nx = 1024^2;
A = zeros(nx-1);
B = (1:nx).^2.0;
A = distribute(A);
B = distribute(B);

function f!(A, B)
    @views A .= B[2:end] - B[1:end-1];
end

function g!(A, B)
    @views A .= B[1:end-1] * 2.0;
end

function h!(B)
    B .= B * 2.0;
end

function j!(B)
    B .= B .* 2.0;
end

bench_f = @benchmark f!($A, $B)
bench_g = @benchmark g!($A, $B)
bench_h = @benchmark h!($B)
bench_j = @benchmark j!($B)

display(bench_f); println()
display(bench_g); println()
display(bench_h); println()
display(bench_j); println()

[4] The benchmarking results

> srun -u -C gpu -n 1 julia -p 12 test_view.jl
srun: job 907107 queued and waiting for resources
srun: job 907107 has been allocated resources
BenchmarkTools.Trial: 
  memory estimate:  5.05 GiB
  allocs estimate:  160684632
  --------------
  minimum time:     191.717 s (0.30% GC)
  median time:      191.717 s (0.30% GC)
  mean time:        191.717 s (0.30% GC)
  maximum time:     191.717 s (0.30% GC)
  --------------
  samples:          1
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  2.53 GiB
  allocs estimate:  80344964
  --------------
  minimum time:     93.661 s (0.36% GC)
  median time:      93.661 s (0.36% GC)
  mean time:        93.661 s (0.36% GC)
  maximum time:     93.661 s (0.36% GC)
  --------------
  samples:          1
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  270.16 KiB
  allocs estimate:  2815
  --------------
  minimum time:     2.123 ms (0.00% GC)
  median time:      2.322 ms (0.00% GC)
  mean time:        4.642 ms (1.57% GC)
  maximum time:     127.487 ms (0.00% GC)
  --------------
  samples:          1080
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  129.39 KiB
  allocs estimate:  1180
  --------------
  minimum time:     1.023 ms (0.00% GC)
  median time:      1.046 ms (0.00% GC)
  mean time:        1.094 ms (1.78% GC)
  maximum time:     42.456 ms (96.81% GC)
  --------------
  samples:          4551
  evals/sample:     1

1 Like