Using Distributed: computational efficiency

Background:

I want to add elements to the matrix and want to use @distributed for acceleration,here is my code:

function fill(N)
    S = zeros(N,N)
    for i in 1:length(S)
        S[i] = i
    end  
end
@btime fill(1000)
  4.334 ms (2 allocations: 7.63 MiB)

The following is my parallel code:

using Distributed
addprocs(2)
@everywhere using SharedArrays

function fill_shared(N)
    S = SharedMatrix{Int64}(N,N)
    @sync @distributed for i in 1:length(S)
        S[i] = i
    end
end
@btime fill_shared(1000)
  5.458 ms (552 allocations: 23.44 KiB)

Question:

Why is the code after my parallelization become longer :joy:, and where is the error?

I’m not sure there’s any errors here, but simply writing 1,000 integers into a vector might not be enough to amortize the additional overhead introduced by the parallelization

1 Like

Hi, I think there a two separate issues here. One is that the function you are benchmarking also allocates the arrays to be filled. In this case, I think that allocating a SharedArray may take a bit longer than an usual Array. The other thing, as @nilshg commented, is that there is some overhead when calling the workers, which as I read somewhere was of the order of miliseconds.

I made some benchmarks removing the array allocation to observe the overhead:

using LinearAlgebra, BenchmarkTools, Distributed, SharedArrays

function fill(S)
    for i=1:length(S)
        S[i] = i
    end  
end

function bch_fill(N)
    S = zeros(N, N)
    @btime fill($S)
    @everywhere GC.gc()
    return 
end

function dist_fill(S)
    @sync @distributed for i=1:length(S)
         S[i] = i
    end  
end

function bch_dist_fill(N)
    S = SharedArray{Int}(N, N)
    @btime dist_fill($S)
    @everywhere GC.gc()
    return 
end

# Start benchmarking.

bch_fill(1000)
bch_fill(10_000)

addprocs(2);
println("num workers: $(nworkers())")
bch_dist_fill(1000)
bch_dist_fill(10_000)

addprocs(2);
println("num workers: $(nworkers())")
bch_dist_fill(1000)
bch_dist_fill(10_000)

I get:

  1.504 ms (0 allocations: 0 bytes)
  154.073 ms (0 allocations: 0 bytes)

# Distributed benchmarks.
num workers: 2
  1.149 ms (303 allocations: 14.41 KiB)
  95.813 ms (311 allocations: 14.53 KiB)
num workers: 4
  1.383 ms (619 allocations: 28.63 KiB)
  71.406 ms (627 allocations: 28.75 KiB)

So yes, I think there is some overhead time of the order of ms if you use Distributed only for a 1000x1000 array. However, if you would want to write a larger array of size 10000x10000, then the parallel computation would be faster.

>>> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
  JULIA_NUM_THREADS = 12

Hello, I agree with you. Here I will ask a small question, assuming that I will increase the size of the array, for example:

function fill(N)
    S = zeros(N,N)
    for i in 1:length(S)
        S[i] = i
    end  
end

@btime fill(100000)

It will return an error:

OutOfMemoryError()

Does creating an array of 100000×100000 require a lot of memory? My computer RAM is 8G,I don’t understand this very well. Is this because my code is wrong or my computer performance is not enough?:sweat_smile:

julia> sizeof(zeros(10_000, 10_000))
800000000

julia> Base.summarysize(zeros(10_000,10_000))
800000040

so roughly 0.8 GB for a 10,000 x 10,000 Array{Float64, 2} - your array is 100 times this so should be 80GB

Thank you, I understand. :grin: