Fastest way to run a for loop in parallel

I am a new user of Julia and I am trying to run a for loop in parallel using .Threads but I don’t know if this is the fastest way that I can do this.
The for loop is:

    res = zeros(length(posicao),2)
    Threads.@threads for i = 1:length(posicao)
        res[i,:] = resolve_prob_individuo(precos,posicao[i],r_emprestimo)
    end
    return sum(res,dims=1)

The function resolve_prob_individuo returns a vector of float with size 2. I am interested in the sum of these vectors only. Each call to the function resolve_prob_individuo takes ~1.4seconds. The length of the vector posicao is 36. This for loop takes ~12seconds. I am running this on a Ryzen 3600, with 12 threads. Is there any way to improve this?
I will run this for loop thousand of times, so any small improvement could make a huge difference.

1 Like

It might not help much but you could try:

    total = Atomic{Float64}()
    Threads.@threads for i = 1:length(posicao)
        res = resolve_prob_individuo(precos,posicao[i],r_emprestimo)
        atomic_add!(total, res[1] + res[2])
    end
    return total[]

It will reduce the memory usage somewhat, and moves the addition into the threads. There may be some contention using the atomic operations but with a single CPU (multiple cores) you probably won’t feel it.

Just for giggles you might also try:

    total = Atomic{Float64}()
    @sync begin
        for i = 1:length(posicao)
            local j = i
            @Threads.spawn begin
                res = resolve_prob_individuo(precos,posicao[j],r_emprestimo)
                atomic_add!(total, res[1] + res[2])
             end
        end
    end
    return total[]

(I’m not sure if the intermediate variable ‘j’ is needed, it might be fine just using ‘i’ in the @spawn.)

If the time to execute resolve_prob_individuo varies wildly then this might show better performance.

Lastly you might consider updating resolve_prob_individuo to generate Tuple instead of an array of 2. I believe if you generated a Tuple there would be no memory allocations for the result instead of 2 allocations.

1 Like

You can look at the load of your processor. If it struggles somewhat I found that using ThreadsX.map! sometimes helps.

If you have to run the loop thousands of times, you have to consider the possibility of parallelizing not the loop, but those multiple runs. Of course, of they do not depend one on the previous sequentially.

Specifically about that loop: you will get a better scaling if you split the loop in two, one on the number of threads, the other on the number of operations per thread. Something like:

nthreads = Threads.nthreads()
result = zeros(nthreads)
n_per_threads = (number of calculations)/(number of threads)
Threads.@threads for it in 1:nthreads
   first = (it-1)*n_per_thread+1  
   last = first + n_per_thread
   for i in first:last
        result[it] = ...
   end
end
sum(result)

Of course you would have to tune the details.

(if you have spare time, you might want to take a look at this class, where I have discussed the parallelization of the calculation of the potential energy between particles, but the principles are the same. It is quite basic. (in Portuguese: https://youtu.be/V70tvYdv8QY)

2 Likes

I’m a beginner to Julia, but I’ve parallelized some of my code recently, here are some things I’ve found helpful.

First, try logging your threads using GitHub - tro3/ThreadPools.jl: Improved thread management for background and nonuniform tasks in Julia. Docs at https://tro3.github.io/ThreadPools.jl . This will give you a good picture of whether your threads are taking roughly the same amount of time, and how many are actually running. It’ll also help tell you whether most of the time is spent actually running the threads.

Second, try running top on the command line when running your job. This will give you similar information but help separate overhead from actual processing.

3 Likes