Mixing vectorization and parallelization


I have a Monte Carlo computation that involves computing an ensemble average of random vectors. If I were to compute this using vectorization, it would look like:

Ybar = zeros(d);
for j = 1:nsamples
    @. Ybar += f(randn())/nsamples;

where f is some vector valued function. In contrast, using shared memory parallization, I would do

Ybar = @parallel(+) for j=1:nsamples

Do people have any recommendations for how to maximize the efficiency of these kinds of computations?


If the operation is such that you can calculate statistics for parts of the data then merge (eg a mean), then you can do that. This may help:


That’s not shared memory parallelism. You would want to use Threads.@threads, storing an intermediate sum for each thread then summing the intermediates.

Also the above 2 code snippets give different results. The first returns a vector and the second returns a number, so I don’t see how the comparison makes sense. See How to add arrays using multithreading? for doing the second example’s job with threading.


So I see that this is not actually shared memory parallelism. I was really just referring to working on a shared memory machine.
Regarding your second point, if f(x):\mathbb{R}\to \mathbb{R}^d, is there no way to get @parallel(+) to do the desired reduction?


Oh I misunderstood. If f returns a vector, then it should just work as expected, and then both examples would be doing the same. Broadcasting helps with doing inplace operations to avoid allocations, but to do reduction, data has to be sent between processors anyways so when it is received I believe it gets allocated in the respective processor’s memory. So are you asking if you can avoid allocating an output vector while summing the individual results after they get received? Try defining an inplace function that wraps + and use that in the reduction instead. It can overwrite the first argument and then return it, so as to avoid allocation of the output. The arguments to the reduction are not used after reduction so I think it should be fine.


Yea, I was basically asking what was the “clever” way to do a parallel(+) reduction when the reduced variable was vector/array valued.