I am a new user of Julia and I am trying to run a for loop in parallel using .Threads but I don’t know if this is the fastest way that I can do this.
The for loop is:

res = zeros(length(posicao),2)
Threads.@threads for i = 1:length(posicao)
res[i,:] = resolve_prob_individuo(precos,posicao[i],r_emprestimo)
end
return sum(res,dims=1)

The function resolve_prob_individuo returns a vector of float with size 2. I am interested in the sum of these vectors only. Each call to the function resolve_prob_individuo takes ~1.4seconds. The length of the vector posicao is 36. This for loop takes ~12seconds. I am running this on a Ryzen 3600, with 12 threads. Is there any way to improve this?
I will run this for loop thousand of times, so any small improvement could make a huge difference.

total = Atomic{Float64}()
Threads.@threads for i = 1:length(posicao)
res = resolve_prob_individuo(precos,posicao[i],r_emprestimo)
atomic_add!(total, res[1] + res[2])
end
return total[]

It will reduce the memory usage somewhat, and moves the addition into the threads. There may be some contention using the atomic operations but with a single CPU (multiple cores) you probably won’t feel it.

Just for giggles you might also try:

total = Atomic{Float64}()
@sync begin
for i = 1:length(posicao)
local j = i
@Threads.spawn begin
res = resolve_prob_individuo(precos,posicao[j],r_emprestimo)
atomic_add!(total, res[1] + res[2])
end
end
end
return total[]

(I’m not sure if the intermediate variable ‘j’ is needed, it might be fine just using ‘i’ in the @spawn.)

If the time to execute resolve_prob_individuo varies wildly then this might show better performance.

Lastly you might consider updating resolve_prob_individuo to generate Tuple instead of an array of 2. I believe if you generated a Tuple there would be no memory allocations for the result instead of 2 allocations.

If you have to run the loop thousands of times, you have to consider the possibility of parallelizing not the loop, but those multiple runs. Of course, of they do not depend one on the previous sequentially.

Specifically about that loop: you will get a better scaling if you split the loop in two, one on the number of threads, the other on the number of operations per thread. Something like:

nthreads = Threads.nthreads()
result = zeros(nthreads)
n_per_threads = (number of calculations)/(number of threads)
Threads.@threads for it in 1:nthreads
first = (it-1)*n_per_thread+1
last = first + n_per_thread
for i in first:last
result[it] = ...
end
end
sum(result)

Of course you would have to tune the details.

(if you have spare time, you might want to take a look at this class, where I have discussed the parallelization of the calculation of the potential energy between particles, but the principles are the same. It is quite basic. (in Portuguese: https://youtu.be/V70tvYdv8QY)

Second, try running top on the command line when running your job. This will give you similar information but help separate overhead from actual processing.