I am implementing a distributed transpose kernel which depends on distributing an array by means of generating equal sized ‘subarrays’ of the whole, distributing them to my worker processes and then performing a remotecall
to the workers to a function with the following signature:
@everywhere function do_distributed_transpose(args...)
This function will perform some number of iterations of work on their local section of the array, but also at each iteration will also send transposed ‘blocks’ of their array to other workers by means of RemoteChannels for the other workers to acknowledge. Each worker will also anticipate receiving blocks, calling take!
on their RemoteChannel
.
do_distributed_transpose
will make other function calls on the same worker, also declared by means of:
@everywhere function helper_function()
When I analyse this code under ‘callgrind’ I see that even for large numbers of iterations, 40% of the runtime is spent in calls to the Julia JIT compiler, even after running @everywhere precompile(f, args)
for all functions used remotely.
For a distributed workload, is it recommended to compile distributed functions inside of another module and then import the module for all processes?