I have a Monte Carlo computation that involves computing an ensemble average of random vectors. If I were to compute this using vectorization, it would look like:
Ybar = zeros(d); for j = 1:nsamples @. Ybar += f(randn())/nsamples; end
f is some vector valued function. In contrast, using shared memory parallization, I would do
Ybar = @parallel(+) for j=1:nsamples f(randn())/nsamples; end
Do people have any recommendations for how to maximize the efficiency of these kinds of computations?