Difficulty figuring out when to use pmap, @sync, @async, @distributed

I’m trying to optimize some code running in parallel. I’m struggling to figure out the best way to lay out this approach, and where to use pmap(), sync, async, distributed etc.

I am applying a null model randomization to each row of my data. The out is written to a separate file, so the order in which the highest level operation is done doesn’t matter. Within each row, I simulate the null model a large number of times (here nsim = 100000), and each simulation is relatively quick, though it can take some time overall, which varies depending on the specific row being randomized.

Here is some mock code that gets the idea across:

using Distributed

@everywhere data
@everywhere nsim = 100000
@everywhere function run_null_model(my_data)
                # function for randomizing the distribution, and summarizing

@everywhere run_simulation(i)
    my_data = data[i,:]
    # Additional manipulation of my_data takes place here.
    # This is somewhere computationally intensive, with the time 
    # varying a lot across rows.
    results = zeros(Float64, size(my_data)[2])
    # these simulations do not need to be conducted in order. they are random draws
    for j = 1:nsim
        results[j] = run_null_model(my_data)
    return nothing

# run code in parallel
pmap(run_simulation, 1:(size(data)[1]))

Right now I am using pmap on the outermost function. I’ve also explored @sync @distributed, but I don’t really need to sync these up, as each process writes its own file (but maybe I’m misunderstanding the sync flag). And I cannot for the life of me figure out how I might be able to do a nested sync/async pair in here.

I’ve also tried nesting another parallel call when running the simulations (e.g., @sync @distributed for j = 1:nsim) but this doesn’t seem to provide any code speedup.

Are there any obvious changes that can speed things up? Is there anything I’m doing “wrong” here?

EDIT: Just to add, the computation time for these operations are not consistent for each row of my data. That is, some rows take a second to complete the whole run_simulation block, some take 60 seconds, depending on the attributes of that specific row. My goal is to minimize downtime, hence putting the pmap at the outermost level.

pmap is extremely useful when your function has a long compute time (to offset the overhead). For example, I have a function foo that takes about 5 seconds to run. I usually run 500 of these separately (on a cluster) by using pmap.

I don’t have much experience with the other functions like @sync, @distributed, but they are more low level, primitive functions. You use them when you want to parallelize a for loop without incurring too much overhead. For parallelization of loops in a shared memory environment, 1.3 now has support for proper threading so that is probably what you want.

I hope someone else chimes in and explains it a bit more.