Mixing vectorization and parallelization

gideonsimpson · March 18, 2018, 10:06pm

I have a Monte Carlo computation that involves computing an ensemble average of random vectors. If I were to compute this using vectorization, it would look like:

Ybar = zeros(d);
for j = 1:nsamples
    @. Ybar += f(randn())/nsamples;
end

where f is some vector valued function. In contrast, using shared memory parallization, I would do

Ybar = @parallel(+) for j=1:nsamples
    f(randn())/nsamples;
end

Do people have any recommendations for how to maximize the efficiency of these kinds of computations?

Tamas_Papp · March 20, 2018, 12:29pm

If the operation is such that you can calculate statistics for parts of the data then merge (eg a mean), then you can do that. This may help:

mohamed82008 · March 20, 2018, 1:53pm

That’s not shared memory parallelism. You would want to use Threads.@threads, storing an intermediate sum for each thread then summing the intermediates.

Also the above 2 code snippets give different results. The first returns a vector and the second returns a number, so I don’t see how the comparison makes sense. See How to add arrays using multithreading? for doing the second example’s job with threading.

gideonsimpson · March 20, 2018, 5:03pm

So I see that this is not actually shared memory parallelism. I was really just referring to working on a shared memory machine.
Regarding your second point, if f(x):\mathbb{R}\to \mathbb{R}^d, is there no way to get @parallel(+) to do the desired reduction?

mohamed82008 · March 21, 2018, 12:23am

Oh I misunderstood. If f returns a vector, then it should just work as expected, and then both examples would be doing the same. Broadcasting helps with doing inplace operations to avoid allocations, but to do reduction, data has to be sent between processors anyways so when it is received I believe it gets allocated in the respective processor’s memory. So are you asking if you can avoid allocating an output vector while summing the individual results after they get received? Try defining an inplace function that wraps + and use that in the reduction instead. It can overwrite the first argument and then return it, so as to avoid allocation of the output. The arguments to the reduction are not used after reduction so I think it should be fine.

gideonsimpson · March 23, 2018, 4:18pm

Yea, I was basically asking what was the “clever” way to do a parallel(+) reduction when the reduced variable was vector/array valued.

Topic		Replies	Views
Parallelizing Monte Carlo - using buffers, avoiding global variables New to Julia hpc , parallel , distributed , sharedarrays , simulations	3	754	August 26, 2022
Vectorize and parallelize wsample function New to Julia	5	1201	August 21, 2019
Parallel version of code not giving the right answer New to Julia	2	325	November 21, 2020
Multithreading and Monte Carlo in 1.7 General Usage multithreading , random	12	906	December 31, 2021
Some Questions about Parallelization New to Julia parallel	8	1031	March 18, 2018

Mixing vectorization and parallelization

Related topics