I want to add elements to the matrix and want to use @distributed for acceleration,here is my code:
function fill(N)
S = zeros(N,N)
for i in 1:length(S)
S[i] = i
end
end
@btime fill(1000)
4.334 ms (2 allocations: 7.63 MiB)
The following is my parallel code:
using Distributed
addprocs(2)
@everywhere using SharedArrays
function fill_shared(N)
S = SharedMatrix{Int64}(N,N)
@sync @distributed for i in 1:length(S)
S[i] = i
end
end
@btime fill_shared(1000)
5.458 ms (552 allocations: 23.44 KiB)
Question:
Why is the code after my parallelization become longer , and where is the error?
I’m not sure there’s any errors here, but simply writing 1,000 integers into a vector might not be enough to amortize the additional overhead introduced by the parallelization
Hi, I think there a two separate issues here. One is that the function you are benchmarking also allocates the arrays to be filled. In this case, I think that allocating a SharedArray may take a bit longer than an usual Array. The other thing, as @nilshg commented, is that there is some overhead when calling the workers, which as I read somewhere was of the order of miliseconds.
I made some benchmarks removing the array allocation to observe the overhead:
using LinearAlgebra, BenchmarkTools, Distributed, SharedArrays
function fill(S)
for i=1:length(S)
S[i] = i
end
end
function bch_fill(N)
S = zeros(N, N)
@btime fill($S)
@everywhere GC.gc()
return
end
function dist_fill(S)
@sync @distributed for i=1:length(S)
S[i] = i
end
end
function bch_dist_fill(N)
S = SharedArray{Int}(N, N)
@btime dist_fill($S)
@everywhere GC.gc()
return
end
# Start benchmarking.
bch_fill(1000)
bch_fill(10_000)
addprocs(2);
println("num workers: $(nworkers())")
bch_dist_fill(1000)
bch_dist_fill(10_000)
addprocs(2);
println("num workers: $(nworkers())")
bch_dist_fill(1000)
bch_dist_fill(10_000)
I get:
1.504 ms (0 allocations: 0 bytes)
154.073 ms (0 allocations: 0 bytes)
# Distributed benchmarks.
num workers: 2
1.149 ms (303 allocations: 14.41 KiB)
95.813 ms (311 allocations: 14.53 KiB)
num workers: 4
1.383 ms (619 allocations: 28.63 KiB)
71.406 ms (627 allocations: 28.75 KiB)
So yes, I think there is some overhead time of the order of ms if you use Distributed only for a 1000x1000 array. However, if you would want to write a larger array of size 10000x10000, then the parallel computation would be faster.
>>> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
JULIA_NUM_THREADS = 12
Hello, I agree with you. Here I will ask a small question, assuming that I will increase the size of the array, for example:
function fill(N)
S = zeros(N,N)
for i in 1:length(S)
S[i] = i
end
end
@btime fill(100000)
It will return an error:
OutOfMemoryError()
Does creating an array of 100000×100000 require a lot of memory? My computer RAM is 8G,I don’t understand this very well. Is this because my code is wrong or my computer performance is not enough?