Hi all,
I am new to Julia and did the following setup with DistributedArrays:
julia -p 2
julia> @everywhere using DistributedArrays
julia> nx = 6;
julia> A = zeros(nx-1);
julia> B = (1:nx).^2.0;
julia> A = distribute(A)
5-element DArray{Float64,1,Array{Float64,1}}:
0.0
0.0
0.0
0.0
0.0
julia> B = distribute(B)
6-element DArray{Float64,1,Array{Float64,1}}:
1.0
4.0
9.0
16.0
25.0
36.0
Then I run:
julia> A .= B[2:end] - B[1:end-1]
5-element DArray{Float64,1,Array{Float64,1}}:
3.0
5.0
7.0
9.0
11.0
As you can see, it works. However, it is extremely slow (I tried it with large arrays, i.e. a large nx
). This means it does unnecessary allocation(s) and/or data transfer. Ideally, it should not do any allocation in this statement (as A
is pre-allocated) and the workers should only fetch one value from each neighboring workers’ local array boundary.
I also tried to do the element-wise subtraction explicitly with the broadcast operator .-
to see if this changes anything, but it failed:
julia> A .= B[2:end] .- B[1:end-1]
ERROR: MethodError: no method matching SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}(::Array{Float64,1})
Closest candidates are:
SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}(::Any, ::Any, ::Any, ::Any) where {T, N, P, I, L} at subarray.jl:14
Stacktrace:
[1] empty_localpart(::Type, ::Int64, ::Type) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:66
[2] macro expansion at ./task.jl:264 [inlined]
[3] macro expansion at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:84 [inlined]
[4] macro expansion at ./task.jl:244 [inlined]
[5] DArray(::Tuple{Int64,Int64}, ::Function, ::Tuple{Int64}, ::Array{Int64,1}, ::Array{Tuple{UnitRange{Int64}},1}, ::Array{Array{Int64,1},1}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:82
[6] DArray(::Function, ::Tuple{Int64}, ::Array{Int64,1}, ::Array{Int64,1}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:169
[7] #distribute#69(::Array{Int64,1}, ::Array{Int64,1}, ::Function, ::SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:542
[8] distribute(::SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}) at /users/omlins/.julia/dev/DistributedArrays/src/darray.jl:535
[9] _bcdistribute at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:119 [inlined]
[10] bcdistribute at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:115 [inlined]
[11] bcdistribute_args at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:122 [inlined]
[12] bcdistribute at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:111 [inlined]
[13] copyto! at /users/omlins/.julia/dev/DistributedArrays/src/broadcast.jl:61 [inlined]
[14] materialize!(::DArray{Float64,1,Array{Float64,1}}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false},SubArray{Float64,1,DArray{Float64,1,Array{Float64,1}},Tuple{UnitRange{Int64}},false}}}) at ./broadcast.jl:751
[15] top-level scope at none:0
My questions are:
- What is happening that makes
A .= B[2:end] - B[1:end-1]
so slow? - Why does
A .= B[2:end] .- B[1:end-1]
fail? - Is there any remedy to make the statements of this kind fast with DistributedArrays (with minimal changes to the syntax used here)?
Thank you very much!
Sam
PS: note that this setup works probably only with the master version of DistributedArrays (see the concerning post and its solution).
UPDATE
In the following are some benchmark results showing that A .= B[2:end] - B[1:end-1]
is extremely slow as noted above. The results show moreover that the even simpler statement A .= B[1:end-1] * 2.0
achieves a very bad performance, while B .= B * 2.0
performs much better. Thus, it seems like DistributedArrays can currently not handle well the selection of sub-arrays. The output of the benchmarking shows that many allocations took place. One explanation could be that a new array is allocated for each sub-array expression (as e.g. B[1:end-1]
).
The code used for benchmarking
@everywhere using DistributedArrays
using BenchmarkTools
nx = 1024^2;
A = zeros(nx-1);
B = (1:nx).^2.0;
A = distribute(A);
B = distribute(B);
function f!(A, B)
A .= B[2:end] - B[1:end-1];
end
function g!(A, B)
A .= B[1:end-1] * 2.0;
end
function h!(B)
B .= B * 2.0;
end
function j!(B)
B .= B .* 2.0;
end
bench_f = @benchmark f!($A, $B)
bench_g = @benchmark g!($A, $B)
bench_h = @benchmark h!($B)
bench_j = @benchmark j!($B)
display(bench_f); println()
display(bench_g); println()
display(bench_h); println()
display(bench_j); println()
The benchmarking results
> srun -u -C gpu -n 1 julia -p 12 test.jl
srun: job 903966 queued and waiting for resources
srun: job 903966 has been allocated resources
BenchmarkTools.Trial:
memory estimate: 5.05 GiB
allocs estimate: 160694245
--------------
minimum time: 187.977 s (0.31% GC)
median time: 187.977 s (0.31% GC)
mean time: 187.977 s (0.31% GC)
maximum time: 187.977 s (0.31% GC)
--------------
samples: 1
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 2.53 GiB
allocs estimate: 80349655
--------------
minimum time: 93.760 s (0.37% GC)
median time: 93.760 s (0.37% GC)
mean time: 93.760 s (0.37% GC)
maximum time: 93.760 s (0.37% GC)
--------------
samples: 1
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 270.23 KiB
allocs estimate: 2822
--------------
minimum time: 2.194 ms (0.00% GC)
median time: 2.323 ms (0.00% GC)
mean time: 4.622 ms (1.78% GC)
maximum time: 138.691 ms (0.00% GC)
--------------
samples: 1081
evals/sample: 1
BenchmarkTools.Trial:
memory estimate: 129.36 KiB
allocs estimate: 1178
--------------
minimum time: 1.003 ms (0.00% GC)
median time: 1.026 ms (0.00% GC)
mean time: 1.064 ms (1.79% GC)
maximum time: 42.051 ms (96.83% GC)
--------------
samples: 4683
evals/sample: 1