Data parallelism is a very useful approach to parallelism because we can describe what to compute rather than how to compute. It’s especially nice for parallel computing since specifying how to compute often leads to subtle concurrency bugs. However, each user’s each program might have specific requirements for scheduling to obtain decent performance. In particular, it is challenging to create a single scheduler that can manage very imbalanced load, terminatable reduction, resource requirements, backpressure, and/or preserving the primary thread for latency-oriented tasks.
FoldsThreads.jl tries to solve this problem by providing various thread-based data-parallel execution mechanisms with different performance characteristics. It can be used for any JuliaFolds/*.jl packages such as FLoops.jl, Transducers.jl, and Folds.jl (ref [ANN] Folds.jl: threaded, distributed, and GPU-based high-level data-parallel interface for Julia). The aim is to make it possible to specify the execution mechanism after you have described what to compute. This way, you can improve the performance of your parallel code by just swapping the executor passed to the data parallel API. Quoting the README
Executors ,----------------------. Algorithms | FoldsThreads.jl | Data structures ,------------------. |-----------------------| ,-----------------. | FLoops, | | ThreadedEx* | | Array, | | Folds, | | WorkStealingEx, | | Tables, | | Transducers, | --- | DepthFirstEx, | --- | FGenerators, | | OnlineStats, | | TaskPoolEx, | | Dict, | | DataTools, ... ' | NondeterministicEx, | | Set, ... | `------------------' | ... | `-----------------' `-----------------------'
(*
ThreadedEx
is the default executor provided by Transducers.jl)
WorkStealingEx
implements work stealing (continuation stealing). Useful for load-balancing.DepthFirstEx
implements depth-first scheduling. Useful forfindfirst
-type computations.TaskPoolEx
: Task pool executor. Useful for fine execution control (e.g., back pressure and “background” threads).NondeterministicEx
: An executor for parallelizing computations with non-parallelizable iterators.
These executors can be passed to executor
of @floop executor for
and the last argument of Folds.jl API (e.g., Folds.map(f, xs, executor)
. See more information in the documentation.
Benchmarks
I ran a few benchmarks. I hope these results can give you some idea for when to use each executor.
Work stealing is useful for load-balancing
- Scripts for benchmarks and analysis and also result data: Makefile · GitHub
- Additional figures and comments in a rendered notebook: Jupyter Notebook Viewer
I benchmarked
xs = 1:2^13
Folds.sum($f, xs, $Executor(basesize = 1))
where f
spins for 100 μs for nworks
items (i.e., every length(xs) ÷ nworks
) in the input collection xs
. This models parallel reduce with a wildly skewed run-time distribution. I compared the default ThreadedEx
executor and WorkStealingEx
executor.
WorkStealingEx
performs better than ThreadedEx
if the run-time distribution is unbalanced enough. Furthermore, the run-time of WorkStealingEx
is much more consistent than ThreadedEx
.
Depth-first scheduling is useful for low-latency of findfirst
“hit case”
- Scripts for benchmarks and analysis and also result data: Makefile · GitHub
- Additional figures and comments in a rendered notebook: https://nbviewer.jupyter.org/gist/tkf/bba44b8d1ee7038a802d53d182e87a06/analysis.ipynb#Notes
I benchmarked finrfirst
with the needle at different places:
xs = rand(2^23)
xs[needleloc] = 2
@benchmarkable(Folds.findfirst(>(1), xs, $Executor(basesize = $basesize)))
NOTE: x axis needleloc
is logarithmic; i.e., it magnifies the performance benefits of DepthFirstEx
.
DepthFirstEx
is useful when the latency of the “hit case” is important and/or the needle is expected to be somewhere at the beginning of the collection. Otherwise, ThreadedEx
(default) is a good option. For consistent latency, WorkStealingEx
may be useful.
Also, note that the speedup is much more drastic when the median (left) is compared. It means that DepthFirstEx
behaves much more consistently and performs better. This is probably due to the randomized scheduling nature of Julia’s task runtime.
Random thoughts
-
The implementation of
WorkStealingEx
is rather crazy and I’m a bit scared of releasing it. Sorry if it breaks your code . Please switch back toThreadedEx
and it’d be great if you can come up with a minimal reproducible example! -
My impression on playing with different schedulers is that, even though people are commenting on the overhead of
@spawn
, it is very fast (the fastest?) concurrency mechanism compared to other concurrency primitives in Julia. It means that writing a specific scheduler for improving latency is rather hopeless (expected?). However, I think improving throughput is possible (e.g.,WorkStealingEx
) by exploiting some properties that the default scheduler cannot assume and there are more interesting things we can do. For example, I made the result ofWorkStealingEx
deterministic by enforcing the shape of the task tree to be static (although the scheduling itself is dynamic). We can improve the pressure to GC and maybe also the throughput either by imposing the exact associativity or not requiring the deterministic result. -
Although I’ve been “complaining” that using the custom scheduler in library packages is not a good practice since it reduces composability of parallel programs (e.g., Overhead of `Threads.@threads` - #16 by tkf and Automatically fusing together several for-loops - #18 by tkf), I do understand that approaches like ThreadPools.jl to execute tasks only in “background” threads is sometimes required. That’s why I added
TaskPoolEx
which is based on the trick I learned from ThreadPools.jl. I think it’s nice that packages written based on JuliaFolds/*.jl can be mixed with latency-oriented code by just passing around the executor object. -
Most of these executors can be extracted as a slightly generalized version of divide-and-conquer algorithmic skeleton (aka parallel skeleton. It’d then let us write, e.g., mergesort and quicksort based on different executors. I’m thinking to do this after adding a few more executors for clarifying the API we need.