An update to this topic, which may be useful for someone reaching this thread.
Hardware
First, the hardware where I benchmark the code has this processor:
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
with 16 cores (32 threads at most).
All the examples are run with JULIA_NUM_THREADS=8
, so even below the limit of the number of threads or physical processors available.
This is important because I do not see the same level of fluctuations in my personal computer (a regular Intel laptop with 4 cores / 8 threads). Thus that is somehow hardware specific.
Julia version: 1.6.3
I donβt now if anything changed in 1.7.0 that might affect this. In any case I can reproduce what I had in 1.6.3. Some initial tests appear to indicate that 1.7.0 does behave differently, but I wonβt mix this here now.
1. Baseline
The typical output I had was something like this (here resulting from running 30 times the same calculation UnicodePlots ):
[ 4.0, 6.0) β€ββ 1
[ 6.0, 8.0) β€ 0
[ 8.0, 10.0) β€ββ 1
[10.0, 12.0) β€ββ 1
[12.0, 14.0) β€ββββββββββββββββββββββββββββββββββββ 20
[14.0, 16.0) β€βββββββββββββ 7
β β
Frequency
As one can see, sometimes (here once) I get a 5-second run in the exact same problem in which most frequently I get a 13-second run. The 5-second time is the expected one, considering the progression of times from smaller to larger systems and the dependence of the time on the size of the problem (number of particles in this case).
Important: in this and the following case, the multi-threaded loop is structured as:
@threads for it in 1:nthreads()
for tasks in splitter(...) # the workload doesn't differ much
# run task
end
end
Thus, the number of batches is equal to the number of threads available (in this case, 8
).
2. JULIA_EXCLUSIVE=1
Start Julia with:
JULIA_EXCLUSIVE=1 julia-1.6.3/bin/julia -t 8
This has a great impact on the fluctuations and on the performance:
[5.0, 5.5) β€βββ 1
[5.5, 6.0) β€ββββββββββββββββββββββββββββββββββββ 14
[6.0, 6.5) β€ 0
[6.5, 7.0) β€ββββββββ 3
[7.0, 7.5) β€βββββββββββββββββββββββββββββββ 12
β β
Frequency
now we got much closer to the correct result in most runs. Still, there are fluctuations and many times the time is not ideal.
3. Using @spawn
Now I changed the loop structure to:
@sync for it in 1:nbatches
@spawn for task in splitter()
# run task
end
end
The point here is that now I will have the flexibility to increase the number of batches
, and use possible free threads. Using nbatches=8
(thus nothing new for the moment), and NOT using JULIA_EXCLUSIVE=1
, there is already an improvement relative to (1), though the fluctuations are huge:
[ 4.0, 6.0) β€ββββββββββββββββββββββ 6
[ 6.0, 8.0) β€ββββββββββββββββββββββββββ 7
[ 8.0, 10.0) β€ββββββββββββββββββββββββββββββββββββ 10
[10.0, 12.0) β€ββββββββββββββββββ 5
[12.0, 14.0) β€ββββ 1
[14.0, 16.0) β€ββββ 1
β β
Frequency
4. Increasing nbatches
Increasing the number of batches (here to 32), without JULIA_EXCLUSIVE=1
provides a huge performance benefit in the average run:
[4.0, 4.5) β€ββββββββ 4
[4.5, 5.0) β€ββββββββββββββββββββββββββββββββββββ 18
[5.0, 5.5) β€ββββ 2
[5.5, 6.0) β€ 0
[6.0, 6.5) β€ββββββββ 4
[6.5, 7.0) β€ββββ 2
β β
Frequency
The worse runs (those with 14-16 seconds) disappear from list.
5. Back with JULIA_EXCLUSIVE=1
, with nbatches=8
With julia exclusive, but with a small number of batches, there is an improvement relative to (3), which shows that the it is not the unbalance workload between threads which causes the greater times (14s), but the uneven times obtained from different threads.
[4.0, 4.5) β€ββββββββββββββββββββββββββββββββββββ 13
[4.5, 5.0) β€βββββββββββββββββββββββββ 9
[5.0, 5.5) β€ 0
[5.5, 6.0) β€ 0
[6.0, 6.5) β€ββββββ 2
[6.5, 7.0) β€βββββββββ 3
[7.0, 7.5) β€βββββββββ 3
β β
Frequency
6. JULIA_EXCLUSIVE=1 and nbatches=32
β β
[4.0, 4.2) β€ββ 1
[4.2, 4.4) β€ββββββββββββββββββββββββββββββββββββ 21
[4.4, 4.6) β€ββ 1
[4.6, 4.8) β€ββββ 2
[4.8, 5.0) β€ββββββ 3
[5.0, 5.2) β€ββββ 2
β β
Frequency
With async spawn, 32 threads, and JULIA_EXCLUSIVE
, all calculations fall into the "correct* time.
Conclusions
The original problem is associated with the unbalanced time of each thread, but not because the workload differs too much between tasks (the workload is very homogeneous, indeed). For some unknown reason, some tasks sometimes take much more time to run, causing the overall slowdown in the original implementation. JULIA_EXCLUSIVE=1
greatly impact that random uneven performance of the threads, in such a way that the problem is greatly minimized. However, optimal performance is obtained only if that is combined with using more threads that can be run asynchronously in any available thread, and that can be achieved by reducing the size of the batches.
I do not observe all these variations in every hardware I have access to.
As of CellListMap
, the package concerned, version 0.7
will be a breaking release to appear soon, incorporates the multi-threading changes and the possibility of tuning the number of batches on specific calculations, to improve the overall parallel performance.
I really appreciate all the help received here, the tips and suggestions, which allowed me to improve the implementation and understand the problem.
Fastest possible
For this test case, which is a 3M-particle calculation, the fastest possible peformance is obtained using, in this machine, julia -t32
and nbatches=2^13=8192
, for which we have (exclusive does not aid here):
β β
[1.8, 2.0) β€ββββββββββββββββββββββββββββββββββββ 14
[2.0, 2.2) β€ββββββββββββββββββββββββββββββββββ 13
[2.2, 2.4) β€ββββββ 2
[2.4, 2.6) β€βββ 1
β β
Frequency