I am trying to assemble a matrix using multiple processes. To make sure each process writes to the part of the matrix it owns, I am using a callback as one of the arguments to the single process jobs launched by the driver.
function assemble() M = N = 1000 A = zeros(M,N) store = (v,m,n) -> (A[m,n] += v) P = procs() length(P) > 1 && (P = P[2:end]) splits = [round(Int,s) for s in linspace(0, M, length(P)+1)] @sync begin for (i,p) in enumerate(P) start, stop = splits[i]+1, splits[i+1] storei = (v,m,n) -> store(v,start+m-1,n) Mi = stop - start + 1 @async remotecall_wait(assemblechunk, p, Mi, N, storei) end end A end function assemblechunk(M,N,store) for m in 1:M for n in 1:N v = randn() store(v,m,n) end end end assemble(); @time assemble();
I am getting a lot of allocations and a serious slowdown of execution by doing this. The output of
0.167571 seconds (5.45 M allocations: 90.761 MiB, 8.63% gc time)
When I replace
store (which I know gives not the desired result), the penalty disappears:
0.011816 seconds (115 allocations: 7.662 MiB)
Any idea where this originates from? Or even how to start analysing this?