Performance hit when using callbacks

I am trying to assemble a matrix using multiple processes. To make sure each process writes to the part of the matrix it owns, I am using a callback as one of the arguments to the single process jobs launched by the driver.

function assemble()
    M = N = 1000
    A = zeros(M,N)
    store = (v,m,n) -> (A[m,n] += v)

    P = procs()
    length(P) > 1 && (P = P[2:end])
    splits = [round(Int,s) for s in linspace(0, M, length(P)+1)]

    @sync begin
        for (i,p) in enumerate(P)
            start, stop = splits[i]+1, splits[i+1]
            storei = (v,m,n) -> store(v,start+m-1,n)
            Mi = stop - start + 1
            @async remotecall_wait(assemblechunk, p, Mi, N, storei)
    end end
    A
end

function assemblechunk(M,N,store)
    for m in 1:M
        for n in 1:N
            v = randn()
            store(v,m,n)
end end end

assemble(); @time assemble();

I am getting a lot of allocations and a serious slowdown of execution by doing this. The output of @time is:

0.167571 seconds (5.45 M allocations: 90.761 MiB, 8.63% gc time)

When I replace storei by store (which I know gives not the desired result), the penalty disappears:

0.011816 seconds (115 allocations: 7.662 MiB)

Any idea where this originates from? Or even how to start analysing this?

Ah…

https://github.com/JuliaLang/julia/issues/15276

Changing the corresponding line to the following fixed it:

start::Int, stop::Int = splits[i]+1, splits[i+1]