# Using Distributed: computational efficiency

## Background:

I want to add elements to the matrix and want to use @distributed for acceleration,here is my code:

function fill(N)
S = zeros(N,N)
for i in 1:length(S)
S[i] = i
end
end

@btime fill(1000)
4.334 ms (2 allocations: 7.63 MiB)


The following is my parallel code:

using Distributed
@everywhere using SharedArrays

function fill_shared(N)
S = SharedMatrix{Int64}(N,N)
@sync @distributed for i in 1:length(S)
S[i] = i
end
end

@btime fill_shared(1000)
5.458 ms (552 allocations: 23.44 KiB)


## Question:

Why is the code after my parallelization become longer , and where is the error?

I’m not sure there’s any errors here, but simply writing 1,000 integers into a vector might not be enough to amortize the additional overhead introduced by the parallelization

1 Like

Hi, I think there a two separate issues here. One is that the function you are benchmarking also allocates the arrays to be filled. In this case, I think that allocating a SharedArray may take a bit longer than an usual Array. The other thing, as @nilshg commented, is that there is some overhead when calling the workers, which as I read somewhere was of the order of miliseconds.

I made some benchmarks removing the array allocation to observe the overhead:

using LinearAlgebra, BenchmarkTools, Distributed, SharedArrays

function fill(S)
for i=1:length(S)
S[i] = i
end
end

function bch_fill(N)
S = zeros(N, N)
@btime fill($S) @everywhere GC.gc() return end function dist_fill(S) @sync @distributed for i=1:length(S) S[i] = i end end function bch_dist_fill(N) S = SharedArray{Int}(N, N) @btime dist_fill($S)
@everywhere GC.gc()
return
end

# Start benchmarking.

bch_fill(1000)
bch_fill(10_000)

println("num workers: $(nworkers())") bch_dist_fill(1000) bch_dist_fill(10_000) addprocs(2); println("num workers:$(nworkers())")
bch_dist_fill(1000)
bch_dist_fill(10_000)


I get:

  1.504 ms (0 allocations: 0 bytes)
154.073 ms (0 allocations: 0 bytes)

# Distributed benchmarks.
num workers: 2
1.149 ms (303 allocations: 14.41 KiB)
95.813 ms (311 allocations: 14.53 KiB)
num workers: 4
1.383 ms (619 allocations: 28.63 KiB)
71.406 ms (627 allocations: 28.75 KiB)


So yes, I think there is some overhead time of the order of ms if you use Distributed only for a 1000x1000 array. However, if you would want to write a larger array of size 10000x10000, then the parallel computation would be faster.

>>> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:


Hello, I agree with you. Here I will ask a small question, assuming that I will increase the size of the array, for example:

function fill(N)
S = zeros(N,N)
for i in 1:length(S)
S[i] = i
end
end

@btime fill(100000)


It will return an error:

OutOfMemoryError()


Does creating an array of 100000×100000 require a lot of memory? My computer RAM is 8G,I don’t understand this very well. Is this because my code is wrong or my computer performance is not enough？ julia> sizeof(zeros(10_000, 10_000))
800000000

julia> Base.summarysize(zeros(10_000,10_000))
800000040


so roughly 0.8 GB for a 10,000 x 10,000 Array{Float64, 2} - your array is 100 times this so should be 80GB

Thank you, I understand. 