Best strategy to profile parallel code

Does anyone know a relatively cheap way to profile the parallel code?

I’ve seen some old thread on StackOverflow: Julia: How to profile parallel code. The suggestion there was to add profiling to the functions which are spawned on the workers and then aggregate profile data on the master process.

Such an approach, however, requires to write specialized code just for profiling.
Is there a more economical approach to do that?

P.S.

May be it can be done with a macro.
This macro will supposedly be applied to functions. It should inspect the code of the function for @spawnat, @fetchfrom etc. and wrap the spawned functions with the code which facilitates profiling.
But the question is then how does one accesses the code of the function which the macro is applied to?

Hrm this is a bit of a difficult one. There may be better ideas out there but for me the first thing I would try is basically the suggestion from stackover flow. The way I would go about it would be to create a macro that you inline with your parrallel calls.

So something like:

macro parallelprofile(ex)
...
end

function parallelfunc(...)
    @parallelprofile @spawnat ...
end

And what the parallelprofile macro does is create a bit of code around the expression which runs @profile on the parallel process and then passes back the results of Profile.retrieve() to a known RemoteChannel on the main process.

That way you collect all the profiled information from you processes on the main process and you can collect and inspect the data from there. Also there is no need for custom code other than what the macro does.

I would say in most cases trying to inspect the code in a method and then write a new method/function to run in its place is extremely difficult and error prone.