Distributed parallel loops

I have been trying to learn how to perform distributed parallel programming in Julia and would like to clarify some (beginner) doubts.

using Distributed

addprocs(4)

@everywhere function foo(n)
    sleep(0.001)
    return randn(10)
end

function inner_loop(func, n_times)
    out = @distributed (+) for var = [10 for i=1:n_times]
        func(var);
    end
end

function main(n_outer_loops)
    out=zeros(10);
    for i=1:n_outer_loops
        out += inner_loop(foo, 10)
    end
    return out;
end

r = main(4)

My inner loop is running in parallel with the @distributed macro and with the reduction (+). Does the function wait until each func is complete in each processor and then adds it to the solution?
For var=1, Worker1 completes func,…WorkerN completes func. Afterward, add everything to out.

Or does it add to the solution as soon as func is complete in a worker? Worker1 completes func, adds to out,… WorkerN completes func adds to out.

Since this is parallelizing only the inner loop, is this a suitable implementation? Or is Julia creating and destroying the parallel portion at each outer_loop and thus having a non-trivial overhead? If so, can it be created once, and then assign the jobs in the inner_loops?

I have read online that @distributed splits the jobs evenly across the workers, and that pmap can do some load balancing. However, pmap returns a vectors of solutions after each inner_loop which I do not have a need to. Is is possible to use it with the same concept as @distributed where instead of returning a vector, adds the solution to the out variable?