Avoiding JIT Compilation of Remotecall Functions

I am implementing a distributed transpose kernel which depends on distributing an array by means of generating equal sized ‘subarrays’ of the whole, distributing them to my worker processes and then performing a remotecall to the workers to a function with the following signature:

@everywhere function do_distributed_transpose(args...)

This function will perform some number of iterations of work on their local section of the array, but also at each iteration will also send transposed ‘blocks’ of their array to other workers by means of RemoteChannels for the other workers to acknowledge. Each worker will also anticipate receiving blocks, calling take! on their RemoteChannel.

do_distributed_transpose will make other function calls on the same worker, also declared by means of:

@everywhere function helper_function()

When I analyse this code under ‘callgrind’ I see that even for large numbers of iterations, 40% of the runtime is spent in calls to the Julia JIT compiler, even after running @everywhere precompile(f, args) for all functions used remotely.

For a distributed workload, is it recommended to compile distributed functions inside of another module and then import the module for all processes?

Can you perhaps provide a minimal working example with dummy data? Is the performance you get now problematic to you or are you just curious where those calls come from? As far as I know, you’re not going to get around compiling the functions or making them available on your workers, but it there may be some strange behaviour.

1 Like

It seems that the proportion of time spent performing JIT compilation does not lower even as a increase the problem size.

Here is a small working example. Sorry about the wait.

using Distributed

addprocs(4)

@everywhere function do_transpose()
    wid = myid() - 2
    range(id) = 1 + id * 25 : (id+1)*25
    outbound = Array{Float64}(undef,25,25)
    local incoming
    # Iterations
    for k in 1:10
        local_transpose(A_column,B_column,range(wid))
        for shift in 1 : nworkers() -1
            send_to = (wid + shift) % nworkers()
            receive_from = (wid - shift + nworkers()) % nworkers()
            # Send and receive off diagonal blocks
            @sync begin
                @async incoming = take!(inboxes[receive_from + 1])
                A_range = range(send_to)
                prepare_block(A_column, outbound, A_range)
                @async put!(outboxes[send_to + 1],outbound)
            end
            B_range = range(receive_from)
            accept_block(B_column, incoming, B_range)
        end
    end
end

@everywhere function local_transpose(A,B,range)
    A_block = view(A,range,:)
    B_block = view(B,range,:)
    for j in 1 : 25
        for i in 1 : 25
            @inbounds B_block[j,i] += A_block[i,j]
            @inbounds A_block[i,j] += 1.0
        end
    end
end

@everywhere function prepare_block(A,out,range)
    A_block = view(A,range,:)
    for j in 1 : 25
        for i in 1 : 25
            @inbounds out[j,i] = A_block[i,j]
            @inbounds A_block[i,j] += 1.0
        end
    end
end

@everywhere function accept_block(B,in,range)
    B_block = view(B,range,:)
    for j in 1 : 25
        for i in 1 : 25
            @inbounds in[i,j] += B_block[i,j]
        end
    end
end

@everywhere function get_channel(proc)
    return inboxes[proc-1]
end

A = [i+j for i in 1:100, j in 1:100]
B = zeros(100,100)

@sync for proc in workers()
    start = (proc - 2) * 25
    range = 1 + start : 25 + start
    @async remotecall_wait(proc,A[:,range],B[:,range]) do m_A, m_B
        global A_column = m_A
        global B_column = m_B
        global inboxes  = [RemoteChannel(()->Channel{Matrix{Float64}}(1)) for proc in workers()]
        global outboxes = [(remotecall_fetch(get_channel,p,myid())) for p in sort(workers())]
    end
end

work = [(@spawnat proc do_transpose()) for proc in workers()]
B = mapreduce(fetch,hcat,work)

This problem kernel has been implemented sequentially here:
https://github.com/ParRes/Kernels
The code given above is a minimal working example of a distributed implementation.