I am trying to assemble a matrix using multiple processes. To make sure each process writes to the part of the matrix it owns, I am using a callback as one of the arguments to the single process jobs launched by the driver.
function assemble()
M = N = 1000
A = zeros(M,N)
store = (v,m,n) -> (A[m,n] += v)
P = procs()
length(P) > 1 && (P = P[2:end])
splits = [round(Int,s) for s in linspace(0, M, length(P)+1)]
@sync begin
for (i,p) in enumerate(P)
start, stop = splits[i]+1, splits[i+1]
storei = (v,m,n) -> store(v,start+m-1,n)
Mi = stop - start + 1
@async remotecall_wait(assemblechunk, p, Mi, N, storei)
end end
function assemblechunk(M,N,store)
for m in 1:M
for n in 1:N
v = randn()
end end end
assemble(); @time assemble();
I am getting a lot of allocations and a serious slowdown of execution by doing this. The output of @time
0.167571 seconds (5.45 M allocations: 90.761 MiB, 8.63% gc time)
When I replace storei
by store
(which I know gives not the desired result), the penalty disappears:
0.011816 seconds (115 allocations: 7.662 MiB)
Any idea where this originates from? Or even how to start analysing this?