Running simple loop in parallel

parallel

#1

Hi,
I am trying to do some simple computations inside a single or double loop, and I would like to parallelize the computations inside the loops, and so far I couldn’t figure it out. Here is a simple example of what I want to do


using Optim

const n_z = 5
const n_b = 5
const J = 60
const Ht_grid = [1.17; 1.50; 1.92]
const n_ht = length(Ht_grid)
const B_grid = [0.0;  0.00356596; 0.0285277; 0.096281; 0.228222]
const pension_grid = [0.280896; 0.309223; 0.340406; 0.374734; 0.412523]

const vvv_opt_r =     zeros(n_z,n_b,J);
const policy_opt_r =  zeros(2,n_z,n_b,J);


function test(x, y, z,w)
    return x+y+z+w
end

for i_z in 1:n_z, i_b in 1:n_b
    y =  pension_grid[i_z]
    b = B_grid[i_b]
    bprime_opt_r = zeros(n_ht)
    vvv_r = zeros(n_ht)
    for i_ht = 1:n_ht
        htilde =    Ht_grid[i_ht]
        fopt(x) = test(x,b,y,htilde)
        res =   optimize(fopt, 0, maximum(B_grid))
        bprime_opt_r[i_ht] =    Optim.minimizer(res)
        vvv_r[i_ht] =           -Optim.minimum(res)
    end

    vvv_opt_r[i_z,i_b,J], idx = findmax(vvv_r)
    policy_opt_r[1,i_z,i_b,J] = bprime_opt_r[idx]
    policy_opt_r[2,i_z,i_b,J] = idx
end

Essentially the code loops over i_z and i_b, calculates something inside for each value of i_z and i_b (independtenly) and saves the resuls in vvv_opt_r, policy_opt_r and c_opt_r.
All the functions (value_age_J_r), arrays and constants (vvv_opt_r, policy_opt_r and c_opt_r, qb, current_rent, etc.) are predifined in a different file that I load with “include(“path”)” command, as well as the packages (such as QuanEcon).

Ideally I would like to loop over both i_z and i_b in parallel and save the results to vvv_opt_r, etc as they come.

Thanks for the help,
Myroslav


#2

Please provide a minimal working example, and quote your code


#3

done


#4

Would simple @threads as described in the manual not work? It seems you wouldn’t run into race conditions/your code seems threadsafe.


#5

You should start by putting the main code in a function.
You should be able to use DistributedArrays.jl for the parallelization.


#6

This is one of the domains that I am still very unclear on, maybe you can help me with this. At what point does it become better to use things like DistributedArrays and multiprocessing-style parallelism over shard-memory parallelism? Personally I haven’t found a usecase of these things yet, but admittedly I’m normally not dealing with gb’s - tb’s worth of data, nor have I found the need to parallelize over multiple cluster nodes yet.

But does it all just come down to data size, or the necessity to parallelize over nodes? A lot of the example code seems to be on semi trivial array sizes, and from personal experience shared memory threading seems to do a lot better in most cases. Has this changed drastically in the last half year? (last time I tried the multiprocessing stuff)


#7

You can benchmark both. I’ve found it has changed as Julia has developed and also depends on the specific problem. I didn’t take (have) the time to figure out exactly why. But, it might be easy to test a simplified version of your problem.


#8

Is there any chance you can show how it’s done? Either on my example or on a simpler one? Thanks


#9
Threads.@threads for i_z=1:n_z 
      for i_b=1:n_b 
           ...
       end
end

This should work, if I remember correct nested threading works. If not you might need to squash the two indices into 1.
Oh and remember to run export JULIA_NUM_THREADS=.... before launching julia


#10

I’m not sure what that does.

But threading only the outer loop usually better. I tried threading on the OP’s code with BenchmarkTools and found no difference in speed using all four cores. You need something in the loop that uses more cpu time per iteration.


#11

Wrap the code in a function so that you can benchmark it:

function runexample()
    Threads.@threads for i_z in 1:n_z
        for i_b in 1:n_b            
            y =  pension_grid[i_z]
   ...

Then do

julia> using BenchmarkTools;
julia> @btime runexample()
  4.274 ms (202029 allocations: 3.12 MiB)

Try with and without threading, and threading on different loops.

EDIT: But make the arrays bigger. Also, you are allocating in the inner loop, which may be important. You can allocate the arrays outside the loops, one set for each thread and index them by threadid (look up how to get this in the docs). Then you can zero them before each iteration rather than reallocating. That takes time and adds complexity, so I’m not sure its worth it. You might want to try distributed parallelism first.


#12

It doesn’t work or it doesn’t provide any speedup. I assume that in the actual code a bit more is happening than in the MWE? I agree on the statement that you should usually only thread outer loop, I just saw that it were only 5 iterations so I thought that maybe doing the full 10 would be at least a bit more reason to thread :smiley:.

On the outer loop threading, in the situation that one runs a lot of matrix operations that rely on BLAS, it might be better to lower the amount of outer threads in favor of the amount of threads used for BLAS ( I think BLAS.set_num_threads(nthreads) is the way to set those). Some testing on the best ratio might result in minor speedups. Also, when using the full amount of threads for the outer loop, be sure to put the BLAS threads to 1, otherwise they might interfere and severely impact the performance (happened to me a couple of times).


#13

Threading on the outer loop runs the inner loop in separate threads “automatically” so-to-speak, just because the entire inner loop is running in separate threads.

EDIT:
If you have enough memory for multiple copies of data that varies with iterations, it is better to disable BLAS threading and thread at the outermost loop possible. This is general advice from my experience. But, it depends on details of your problem. If you are operating on huge matrices with builtin functions that consume significant time, then threading only via BLAS might be better.

Ha discourse is showing your response before I am done. I see your point. But, it would take more work to get 25 threads (and btw, this involves yet another parameter, how many cores do you have have ? and will hyperthreading work well, or not. This may depend on the chipset and details of the computation)


#14

Right that’s true. The way I wrote it would mean that the inner loop itself would also get multithreaded. Imagine that the code runs on a 25 core machine, instead of only using 5 threads (since in the MWE n_z = 5) this would mean that each of the 5 threads spawns another 5 to solve the inner loop.

But I just checked the base/threadingconstructs.jl and it turns out that’s not what would happen so let me change my post. :smile:


#15

Also, you are violating the tip Access arrays in memory order, along columns. Although in this case, it probably will not make any difference.


#16

thanks for the advice guys, I will try with the benchmarking and see what it does.
and yes, my actual code will do more computationally intensive stuff inside the loop, so it might help speed up the overall code quite a bit!


#17

and regarding the number of threads, it’s at most equal to number of cores, right? there is no way I can run 8 threads at the same time on a 4 core machine?


#18

You can, but this would completely trash the performance of your code (unless your CPU has hyperthreading but even this would probably not yield any improvements).


#19

I find that hyperthreading does lead to some improvements for most things. If it’s all BLAS/hardcore optimized math then no, but usually that doesn’t happen. I tend to see between 10-30% improvement.


#20

I had a similar question, and wrote up the answers (various parallelizing choices) in

http://cookbook.tips/!julia/doku.php?id=parallel

hope this helps.

regards,

/iaw