I have a Monte Carlo computation that involves computing an ensemble average of random vectors. If I were to compute this using vectorization, it would look like:

```
Ybar = zeros(d);
for j = 1:nsamples
@. Ybar += f(randn())/nsamples;
end
```

where `f`

is some vector valued function. In contrast, using shared memory parallization, I would do

```
Ybar = @parallel(+) for j=1:nsamples
f(randn())/nsamples;
end
```

Do people have any recommendations for how to maximize the efficiency of these kinds of computations?