Which are the recommended strategies/packages to deal with the parallelization of many (thousands of) tasks that are:
- Individually relatively fast (costs associated to spawning may not be negligible).
- Possibly somewhat heterogeneous.
- Possibly have to mutate a large array in arbitrary positions.
Probably the possibilities are different depending on what the output
is. There are two classes of problems in my case:
-
output
is a scalar, and is conceivable to just compute (and even store) itntasks
times and reduce the result at the end (i. e. the total energy of a system of particles). -
output
is a large array, which can be mutated in any position by any of the tasks. The array is large, such that it cannot be copiedntasks
(thousands of) times (i. e. the forces acting on each particle of a system). In this case, I have always opted to createnthreads()
copies of the output, and let the tasks on each thread mutate one of the copies. Herelocks
start to be possibility, but since the tasks may be fast, locking is prohibitive.
The alternatives I have considered so far are:
a) Run nthreads()
tasks, and randomly distribute the work within these tasks.
- Good: cost of spawning is low.
- Bad: If one of the threads gets a larger workload, it ends up alone.
Basic syntax:
@threads for it in 1:nthreads()
for itask in random_task_splitter(ntasks,nthreads)
output[it] = ... work on one task
end
end
This is what Iām doing now, but I notice that I have a performance tail associated with the heterogeneity of what each thread gets that is not negligible.
b) Run ntasks
tasks with @threads
:
- Good: nothing really (because the tasks are assigned to each thread statically).
- Bad: cost of spawning many tasks.
- Bad: requires using
threadid()
to identify the output where the task must write, which is discouraged (by @tkf).
Basic syntax:
@threads for itask in 1:ntasks
it = threadid()
output[it] = ... work on one task
end
c) The above, but using ThreadPools.@qthreads
:
- Good: good distribution of the workload.
- Bad: the cost of spawning is high, and if the tasks are fast, this becomes slow compared to (a).
- Bad: also requires identifying the output by
threadid()
.
Basic syntax:
ThreadPools.@qthreads for itask in 1:ntasks
it = threadid()
output[it] = ... work on one task
end
I have some good results with this one if the tasks become more expensive, but for faster tasks it slows down everything relative to (a) as well.
Any insight is appreciated.