Question regarding Julia's planned approach to composable, safe, and easy multithreading

multithreading

#1

After reading the post by @StefanKarpinski about compiler work priorities here, where multithreading (via PARTR) is now planned as priority #2, I figured it was a good time to ask a question that’s been on my mind for a while: how do we plan to implement multithreading in Julia at the user and package author level, such that the following goals are met:

  1. Packages that are multithread-capable are composable in an efficient and elegant manner.
  2. Avoids the need for expensive locks/atomics when running in single-threaded mode (and not requiring package authors to implement a bunch of additional code to switch “thread-safe mode” on or off in functions depending on the current threading mode).
  3. Avoids the need to manually add locks/atomics to possibly multithread-capable data structures (like Arrays or Dicts or DataFrames). Additionally, avoids the proliferation of duplicate types that only vary by whether they’re thread-safe or not.
  4. Potentially permit certain packages to not modify their data structures or algorithms at all, but still have the possibility to benefit from some form of automated multithreading (maybe based on Cassette or similar? see proposal below for example).

Now, for context on why I consider the previous goals are important, consider the approach that a language like C takes to multithreading: C often requires that thread-aware data structures and algorithms be specifically implemented alongside, or instead of, their non-thread-aware counterparts. This adds a lot of duplication (especially to data structures which require locks or atomics to become thread-safe), a large potential for bugs, and just general bloat within packages which aspire to properly operate within a threaded environment.

In comparison to C, if we instead look at Halide, we see that their approach to parallelism is based on the separation of the algorithm from its storage and scheduling. This gives the nice property that an algorithm can be written plainly, without having to worry about any details about how the algorithm might interact in a multithreaded (specifically, GPU-parallel) environment.

Of course, there are benefits and downsides to each approach, and many libraries are available for languages like C that make multithreading much easier to implement and get right. However, multithreading is still something that’s very hard to implement properly and safely, especially for users who haven’t spent significant time learning about, and practising with, multithreading on a regular basis. More specifically, I don’t think anyone should have to get into the dirty details of multithreading, unless they really want to eke out the last bits of performance possible.

I’d like to know what people think about this topic, and whether there are suitable approaches out there that we could implement in Julia to achieve the previously-stated goals listed at the top of this post. Just to start things off, here is my very abstract proposal for how I would do things (warning: this may be total nonsense!):

At the highest level, there are a set of keywords/macros/patterns which turn on multithreading for a subset of code. Things like launching a Task with a certain keyword given to schedule(), a revised @threads macro, etc. As the thread-enabled code is then compiled before execution, a pass is performed across the code and all accessible variables:

  1. The number of available threads is determined, and either now or later on during this pass, additional threads are launched.
  2. The variables within the code that are potentially accessible by multiple threads (whether they’re allocated locally, or accessible via closure or as globals) are tracked based on how and where they’re accessed. If a read-modify-write occurs, those read and write points are individually marked.
  3. At each marked access point, the pass will determine what sort of safeguards (locks/atomics) must be implemented to allow the executing threads to safely access each marked variable. As an example, if a field within a shared struct is read, its value incremented by 1, and then the updated value is written back to the struct, then the pass will insert code to allocate a shared Lock, insert a lock() call before the read, and an unlock() call after the write.
  4. If multiple, potentially shared structs (say a Vector{MyStruct}, where MyStruct may be mutable) are allocated locally, then the pass may make a temporary variant of MyStruct and/or Vector which adds locks or atomics to accommodate threaded access (NOTE: somehow, magically, these temporary variants will still have the same types as their originals, so as to not break dispatch).
  5. As necessary to improve performance, or as specified by the caller, the pass may try planning different combinations of thread scheduling and atomic insertions (or even disable multithreading if it’ll add too much overhead).
  6. If we see an eval() or other dynamic code, force single-threaded mode during (and possibly after) its execution.

#2

there is a PR doing this:


#3

Yes I know, that’s the PARTR I referred to above. However, that PR doesn’t actually implement the necessary infrastructure for user and package code to become thread-safe automagically. It only implements changes at the level of libuv and the Task, Channel, etc. level to support parallel depth-first scheduling (and other assorted things I’m not mentioning since it’s a huge PR).


#4

@kpamnany’s talk at the CAJUN meetup from last year addresses some of your questions:

In particular, by allowing the PARTR depth-first thread scheduler to map tasks to kernel threads and do work-stealing, composabilty is addressed: with this design, you can write threaded libraries and then write concurrent code that calls those libraries and it composes. Moreover, since the scheduling is depth-first, the innermost threaded library call gets the parallelism first, which is a unique property of PARTR and what makes it cutting edge research. This means that you get the parallelism where you want it—at the lowest level library where typically, someone has spent a lot of time, energy and expertise making sure that threads are used effectively.

The rest of your questions are about locking, which presumes that locking is the right way to do synchronization of threads. Julia’s new parallel runtime once it’s fully hooked up will be very similar to Go’s. Since Go has been very successful for writing highly concurrent code, we should take as many lessons from it as we can. The biggest lesson is nicely explained in this blog post:

https://blog.golang.org/share-memory-by-communicating

The key quote from this post is:

Do not communicate by sharing memory; instead, share memory by communicating.

In other words, instead of sharing data structures and protecting them with locks, let a single task manage the data structure and use channels to communicate to that task; protect access to anything that needs synchronization that way. Note the complete absence of locks or slow, complex concurrent data structures in the Go example code at the end. Instead of locking and using concurrent data structures, idiomatic Go programs use normal data structures without locks but only a single task “owns” a data structure at any time and is allowed to modify it. (Unlike Rust, this is not enforced by the compiler, but it just a matter of program design and convention.) This can take the form of sending requests to update a structure to a single task whose job is to handle such requests or it can take the form of passing the references to the data structure around and having whatever task holds the reference to it make the updates before passing the data structure on to another task.

In this approach to multithreading, locks are far less important than they are in the traditional C++/Java style where many threads are trying to safely access a shared data structure at the same time. You may need them occasionally, but I hope that we can avoid needing to have “threaded” and “non-threaded” versions of every data structure imaginable. The flip side is that the task and channel implementations have to be really good and fast to make the Go approach practical. But we want that anyway.


#5

Ahh yes, I was waiting for someone to mention this video. :smiley: As you say, it only addresses some of the questions above, but it definitely is a key piece of the puzzle (namely, defining where the boundaries between threads exist, and how the scheduling works).

In my opinion, the more important part for end-users and package authors is your second point, where you look to Go for inspiration. In regards to what you’ve picked out from that article, how do we plan to approach “ownership” of a data structure? Do we, 1. Intend to introduce new keywords to implement ownership/borrowing of data structures? And if so, how will these data structures be tracked for aliasing and such? Or do we, 2. Intend to tell authors/users that they have to use certain patterns and idioms to become multithread-capable, like what Go seems to enforce through its semantics and tooling?

I worry that if we adopt the latter approach, a lot of existing code will have to be significantly rewritten to make use of Tasks and Channels and Conditions, where they weren’t previously used. This may mean that a lot of previously-elegant and concise code becomes bloated or unreadable just to support multithreading. Of course, if we just want to become Go, then that’s totally reasonable, but I don’t think that’s what’s desired.

Instead, in my mind, the allure of Julia is that one can just “write the formula”, and things will just work and be quite fast without having to do too much work. If you told me that to effectively use multithreading, I’d have to litter my functions with various odd keywords that have nothing to do with my domain, I’d be confused and potentially annoyed, since that just doesn’t feel like the “julian” approach to doing things.

I’m sure that I’m just going off on an irrelevant tangent, and people have already thought through this and come up with a powerful, elegant solution (and preferably some examples so that we can see what things could look like in the near future), but I haven’t yet seen anything of the sort, so I’m of course just a bit hesitant to assume that everything is just going to unfold perfectly after PARTR is merged and enabled. :wink:


#6

Happy to see discussion on this.

I can’t speak to whether there will be a ‘recommended’ way to implement your parallel application or library. The Go/Erlang way is elegant but single task owned data structures are tricky to scale for high data parallelism whereas the other approach Stefan describes is essentially flat combining (https://dl.acm.org/citation.cfm?id=1810540) which is a form of lock-free data structure. Regardless, you will always pay a penalty for whatever approach you use to manage contention, when there is no contention. Is that penalty too high for single threaded programs? Is there a way to support OP’s goals #2 and #3?

All partr does, by design, is support OP’s goal #1 (in order to do so BTW, it requires that you stop thinking about threads for the most part and only think in terms of parallel tasks).

Various parallel programming models can be built atop partr. My own inclination is towards something like Skandium (https://ieeexplore.ieee.org/abstract/document/5452456), i.e. an algorithmic skeleton framework written in Julia. However, I don’t think we should or are able to mandate a ‘one true way’ that works well for everyone at this point.

I’d also like to add: you absolutely have to consider parallelism to make effective use of it – this starts at an algorithmic level as some algorithms simply scale better than others. You have to determine what the unit or grain of parallelism is for your application or package, i.e. what is a task. This does not have nothing to do with your domain.


#7

Maybe a side question and I wouldn’t want to take this interesting thread off-topic, but is it clear that depth first is typically the best choice? My limited expertise with multithreading comes from trying to write a m x k times k x n multithreaded matrix multiplication, e.g. in a cache oblivious manner (divide and conquer). Wouldn’t you start by first dividing m and n into blocks to distribute to different threads, and then within the threads slice up further, so as to giving each thread a large chunk of work? Depth first in a divide and conquer approach would result in the very last tiny blocks be distributed over the different threads, in which case I could imagine that there is quite a bit of overhead?


#8

Completely agree, that was one of my other main issues while thinking through how such a pattern would work out. At a certain point, you just can’t grant ownership of one or a few data structures to only a single task and expect continued performance increases as the number of parallel-executing tasks increases.

This is really interesting! I once had an idea like this, but specific to a certain data structure I was working with :smile: This seems like it could be really useful for both struct and Vector parallel access (if I’m reading this paper correctly).

While I do wish there were a “one true way” for parallelism that was easy, efficient, and scalable, we all know that nothing like that really exists; so yes, I think supporting different approaches to exploiting the functionality provided by partr is necessary and also beneficial. I would just hope that there’s still the potential to mix-and-match different approaches, especially when packages make different choices about which paradigm(s) they want to support.

I agree with you in a general sense, in that my initial statement that " I don’t think anyone should have to get into the dirty details of multithreading, unless they really want to eke out the last bits of performance possible." is an unreasonable request. However, what I was getting at, is that there may be the possibility for one or more forms of “automatic” parallelization of arbitrary algorithms that can be done without the user having to put any effort into writing task-focused code. It’s certainly possible, in limited cases, to automatically thread certain algorithms in such a way that it won’t impact the end result, e.g. element-wise broadcast. I’d really like to see at least a package implementation of something like this so that lazy people like me don’t have to work too hard :slight_smile: (Of course, now that I’ve said it, I’ll probably be the one that gets to try to implement this!)

Also, in regards to “This does not have nothing to do with your domain.”: I don’t intend to imply that domain specialists cannot and should not think carefully about how they want to support multitasking/parallel execution of their code; this is of course necessary to extract the best performance possible (as I stated in my original post), among other benefits. However, I don’t think you can say that every person who uses or will use Julia, will want to have to worry about multitasking when writing their code. They still want the benefits of multithreaded execution, of course, and so I think we should at least consider some approach that appeals to users who don’t want to invest the extra time and research required to comprehend and implement multitasking within their code. My initial proposal at the end of my original post was intended to be a first step towards visualizing how something like that might work out in practice. Hopefully you agree that something like what I proposed has value and is feasible in at least a few real-world circumstances :smiley:


#9

One higher level programming model that one could probably put on top of PARTR might be something like PLINQ. Query.jl is all designed to enable something like that, so we would “only” need to write a backend that utilizes PARTR :slight_smile: But the architecture for that should be more or less there.


#10

Check out Strided.jl. Here too, I first divide the dimensions into large chunks which are distributed over the different threads, and then within the threads, I still use a blocking strategy when different arrays have different optimal loop orders. So I am really interested to see how PARTR will change that philosophy.


#11

What about code that’s not written with any multithreading in mind at all, but that might not be thread-safe? E.g. if I have a function that manipulates some global data structure, that code will likely not work as expected if it ends up getting called from multiple threads simultaneously, right?


#12

True, that is an issue. We’ll have to see how it goes.


#13

IMO it is now a parallel world and programmers need to start getting used to that. :slight_smile:


#14

There are always costs and benefits to consider. If we can get parallelization easily, and/or the algorithm amenable to it, then sure, it is nice. However, if parallelization takes effort (learning new things, coding, debugging), and/or the gains are not significant, then it may not be worth it.

I think @jpsamaroo makes some important points about composability/transparency. Having a framework which is opt-in for coders and invisible to users (I can call functions which may or may not parallelize, and I don’t need to care unless I want to) would be nice.