After reading the post by @StefanKarpinski about compiler work priorities here, where multithreading (via PARTR) is now planned as priority #2, I figured it was a good time to ask a question that’s been on my mind for a while: how do we plan to implement multithreading in Julia at the user and package author level, such that the following goals are met:
- Packages that are multithread-capable are composable in an efficient and elegant manner.
- Avoids the need for expensive locks/atomics when running in single-threaded mode (and not requiring package authors to implement a bunch of additional code to switch “thread-safe mode” on or off in functions depending on the current threading mode).
- Avoids the need to manually add locks/atomics to possibly multithread-capable data structures (like
DataFrames). Additionally, avoids the proliferation of duplicate types that only vary by whether they’re thread-safe or not.
- Potentially permit certain packages to not modify their data structures or algorithms at all, but still have the possibility to benefit from some form of automated multithreading (maybe based on Cassette or similar? see proposal below for example).
Now, for context on why I consider the previous goals are important, consider the approach that a language like C takes to multithreading: C often requires that thread-aware data structures and algorithms be specifically implemented alongside, or instead of, their non-thread-aware counterparts. This adds a lot of duplication (especially to data structures which require locks or atomics to become thread-safe), a large potential for bugs, and just general bloat within packages which aspire to properly operate within a threaded environment.
In comparison to C, if we instead look at Halide, we see that their approach to parallelism is based on the separation of the algorithm from its storage and scheduling. This gives the nice property that an algorithm can be written plainly, without having to worry about any details about how the algorithm might interact in a multithreaded (specifically, GPU-parallel) environment.
Of course, there are benefits and downsides to each approach, and many libraries are available for languages like C that make multithreading much easier to implement and get right. However, multithreading is still something that’s very hard to implement properly and safely, especially for users who haven’t spent significant time learning about, and practising with, multithreading on a regular basis. More specifically, I don’t think anyone should have to get into the dirty details of multithreading, unless they really want to eke out the last bits of performance possible.
I’d like to know what people think about this topic, and whether there are suitable approaches out there that we could implement in Julia to achieve the previously-stated goals listed at the top of this post. Just to start things off, here is my very abstract proposal for how I would do things (warning: this may be total nonsense!):
At the highest level, there are a set of keywords/macros/patterns which turn on multithreading for a subset of code. Things like launching a
Task with a certain keyword given to
schedule(), a revised
@threads macro, etc. As the thread-enabled code is then compiled before execution, a pass is performed across the code and all accessible variables:
- The number of available threads is determined, and either now or later on during this pass, additional threads are launched.
- The variables within the code that are potentially accessible by multiple threads (whether they’re allocated locally, or accessible via closure or as globals) are tracked based on how and where they’re accessed. If a read-modify-write occurs, those read and write points are individually marked.
- At each marked access point, the pass will determine what sort of safeguards (locks/atomics) must be implemented to allow the executing threads to safely access each marked variable. As an example, if a field within a shared
structis read, its value incremented by 1, and then the updated value is written back to the
struct, then the pass will insert code to allocate a shared
Lock, insert a
lock()call before the read, and an
unlock()call after the write.
- If multiple, potentially shared
structs (say a
MyStructmay be mutable) are allocated locally, then the pass may make a temporary variant of
Vectorwhich adds locks or atomics to accommodate threaded access (NOTE: somehow, magically, these temporary variants will still have the same types as their originals, so as to not break dispatch).
- As necessary to improve performance, or as specified by the caller, the pass may try planning different combinations of thread scheduling and atomic insertions (or even disable multithreading if it’ll add too much overhead).
- If we see an
eval()or other dynamic code, force single-threaded mode during (and possibly after) its execution.