Questions on parallel programming terminology

I am using this week to learn the parallel programming features of the language in more depth. I confess that I am a bit confused with the terminology used throughout the documentation, and would like to ask a few questions.

Questions

  1. In terms of execution pools, I understand that we have two types in general: processes and threads. The Julia documentation refers to processes as either procs() or workers() (i.e. all processes minus the master process in a :master_worker topology), and refers to real threads as 1:Threads.nthreads(), correct? I saw that third concept in the documentation called Task or Coroutine, or even “green threads”. From what I grasped, “green threads” aren’t real threads, what are they then? Should new users learn about them at all?

  2. Can you please explain the difference between Channel, RemoteChannel and Future? First, I understand that these three concepts are restricted to process pools, so nothing to do with threads. Second, I understand that a Channel is a buffer where you can place and take data, and that a RemoteChannel is the same concept with the only difference that it also works across processes. If that is the case, I wonder why the documentation is emphasizing this difference. Couldn’t it just mention Channel in general? To me it feels like it is explaining the general concept of Array via internal types like SubArray, OffsetArray. Third, I understand that a Future is just a channel that is returned by a remote function call? Please correct me if I am wrong.

  3. I started reading the docs of pmap from the Distributed section (i.e. section that deals with process pools) and noticed that it has an intriguing option distributed=false that makes it possible to send work to multiple “tasks” instead of processes. In this context, it seems that “tasks” are a specific thing: Julia coroutines. Can we submit work to remote processes running on remote machines, and use multi-threading there with a specific number of real Threads threads instead of “green threads” as well? Also, this functionality needs to necessarily live in pmap? How to deal with this hybrid type of parallelism more explicitly?

  4. I understood that all functions with remote* prefix refer to process pools. We have the main function called remotecall that allows calls of functions on remote processes as expected, and variants like remotecall_fetch and remotecall_wait that do something extra like fetching the result directly or blocking the execution until the remote process finishes execution. I couldn’t get however the remote_do documentation. What is it? I’ve also understood that all these remote* functions come in two flavors: remote*(f, pid, ...) and remote*(f, pool, ...). The first flavor assigns the work to a specific remote process, and the second flavor waits for any process in the pool to become available before calling the first flavor.

I would like to help improve the documentation after I have a more solid understanding of the concepts. If you can please reply to specific points, that would help a lot.

8 Likes

Yes, if what you mean by “real thread” is OS thread.

I think it depends. If you want to, say, implement something like pmap, you probably should. But if you just want to use pmap then it’s probably OK to forget about it.

Channels and futures in general are communication mechanisms in concurrent programming. The Julia type Channel can be used for communication within a process and RemoteChannel and Future are for communication across processes. So, Channel cannot be used with a process pool (but it can be used with thread pool; e.g., with @spawn and @threads).

(Though looking at the second question, maybe you already get this idea?)

Note my emphasis on concurrency and not parallelism. The difference is a bit subtle and I’m not sure everybody agrees with a single definition. But I like this explanation by Rob Pike:

I’m bringing it up because you may be confused that Channel is used even without threading, i.e., parallelism. Concurrency is about dealing with multiple things at once so Channel is useful even though you use only one CPU (e.g., using @async).

Future can be used for passing at most one value (or an exception) to the peer while RemoteChannel can be used for passing arbitrary number of values.

I don’t think you can do it just with Distributed. One option is to use Transducers.jl: Thread- and process-based parallelisms in Transducers.jl (+ some news)

I think we need more high-level abstractions and easy-to-use APIs, developed outside stdlib.

The documentation says

Unlike remotecall, it does not store the result of computation, nor is there a way to wait for its completion.

So, it’s just an API that does not send back the result by default. You need to use RemoteChannel etc. for communication.

9 Likes

Thank you @tkf for the helpful reply. Below are my follow-up questions.

I’ve asked another question about pmap where I was curious why it is using Julia Task (i.e. Coroutines) instead of real threads? I remember someone with an HPC background criticizing the green threading model. So if Threads is a better model of threading, why we keep Task around? And why it is used in pmap for instance?

Nice, I like learning correct terminology, it helps a lot. I will start calling it concurrent from now on. My question is still valid though. Maybe the documentation could hide this internal details about Channel versus RemoteChannel and simply talk about the general concept of Channel? Similar to when we discuss arrays in Julia. Perhaps they both need to show up in the beginning of the documentation because users will need to use them explicitly most of the time?

P.S.: :heart: the video

Transducers.jl in my TOREAD list of parallelism in Julia. I’ve seen the reduce and dreduce operations in the docs this week as well. Need more time to investigate what is out there.

Fully agree. I’ve actually started a package with that specific goal this week while I was reading the docs. Currently it is just a playground for ideas and for grasping the concepts. But if it turns out useful, I will share it here.

I see, so it is basically a remotecall for which we don’t need the result. I am assuming that it can run much faster then.

1 Like

Tasks/Green threads are useful when you are performing something but have to wait. Wait for the disk to load a file, wait for the disk to save a file, wait for the network to receive some data, wait for the network to send some data.

So if have 100 unrelated things you want to do that all involve both waiting and high CPU usage if you just have threads you would need to create 100 threads. That way when one thread waits, the OS can schedule another thread to use the CPU.

There are two downsides here. The first is that threads require a contiguous chunk of memory pre-allocated for the stack, so threads are not “free”. The second if all the threads are doing CPU calculations the OS will keep switching between them trying to give each thread fair use of the CPU. This can actually hurt performance because first context switching between threads is not without cost, second if the threads are working on different memory locations the CPU cache will loose it’s effectiveness on a thread switch.

With tasks/green threads you only create as many threads as CPU cores. When one tasks waits, another task is scheduled by julia, that new tasks runs as long as it needs the CPU. So once a task is running on a core it stays running on that core, with all the benefits of no context switches and the CPU cache loaded for that task. The down side here is that the ALL other tasks are going to wait until the CPU is free. There is no fair use, which if you just want the the output is fine, if one of those tasks is providing feedback to the user, maybe not so much.

Currently, it appears that a task is only scheduled on the thread that created it. This is not ideal since it requires that the programmer balance the tasks between threads. However soon (hopefully) the ability to schedule a task on which ever thread is idle should be coming out.

So in general tasks/green threads works will when you have lots more individual tasks to perform “simultaneously” than you have CPU cores.

4 Likes

Thank you. Can you please elaborate on this? You mean serial execution? If I need to wait anyways, what is the advantage of deferring the calculation to some other place, in this case a “green thread”?

That is how I understand it. If I have 100 unrelated things to do which are “simple enough” compared to the full program, I can create 100 threads and let them run in parallel, correct?

Fully get it. Threads are not free. Do we have a rule of thumb for a maximum number of CPU threads in general? Is it the number of physical cores? logical cores?

I think that is where the concurrency versus parallelism idea comes into play as @tkf mentioned. I don’t get it why it is beneficial to express the computation with multiple tasks given that the execution is still serial? What is the benefit?

Thank you for the clarifications. The “Tasks vs. Threads” idea is what is still confusing to me. I appreciate if you can explain it with a simple concrete example. If you have :julia: code that you can share to illustrate these examples, it is even better.

The classic example where concurrency is useful without parallelism is asynchronous IO. You can create an asynchronous task that waits for a file to load from disk while the main program continues doing other things with the CPU. @oxinabox made a nice blog post about this:

https://white.ucc.asn.au/2018/07/14/Asynchronous-and-Distributed-File-Loading.html

I recommend reading it and paying particular attention to the distinctions between the serial loader, the asynchronous loader and the distributed loader. The first is purely serial, the second exhibits concurrency but not parallelism, and the third is truly parallel, using multiple CPU cores at the same time.

Yes, but it’s best to think of tasks as light weight threads. The advantage of threads is fair CPU usage, each thread will get to spend some time on the CPU even if another thread still needs it. The advantage of tasks is light weight task switching, the time spent switching from task A to task B is much more efficient that with threads.

Creating 100 threads to perform heavy calculations will take longer than creating 100 task to perform those same heavy calculations (if you have less than 100 cores). This is because the CPU won’t have to spend time task switching, which involves saving the current CPU state and loading the saved CPU state for the next task. Task switching I believe has gotten even more expensive with the various CPU vulnerabilities that have been recently discovered. Now the OS has to make SURE that the new thread can’t gleam any information from the previous tasks execution.

This is hard to answer. First because in Julia tasks currently only run on the thread that scheduled them. In Go tasks are scheduled on whichever thread is idle. So under Go the rule of thumb is don’t use threads unless you need something to run consistently at a certain time. A heartbeat would be a good example, or providing an initial response to a user action.

Having tasks only run on 1 thread does have some advantages, thread synchronization is expensive, so if a set of tasks all run on the same thread, there is no real synchronization needed. But this is only really helpful if the total CPU power needed will fit on 1 core comfortably. In this situation what you are really doing is paralleling the “waiting” tasks, like accessing multiple files or multiple network streams.

Once Julia has the ability to schedule a task on any thread, then I believe the rule will be based on the model.

If you have all the data at once and it can be processed in parallel, then threads are probably the way to go. An array of data is a great example of this, you load the entire array, split it into chunks based on the number of threads/cores and process away. If the data comes in sporadically, is processed mostly independently, then tasks are the way to go. Handling HTTP requests would be a great example of this.

1 Like

I don’t know the real answer as I don’t know how pmap was developed. But my guess is that it simply because pmap predates stable/robust threading support in Julia.

The reason why Task is used for pmap (or actually throughout Distributed) is actually related to the point pixel27 is mentioning. This is because the main work for Distributed is passing around the messages (functions, arguments, etc.) via the networking I/O. So the main work of Distributed is to wait for other process to send or receive something. You don’t have to use CPUs in parallel for this.

Networking is actually the main reason behind the hype of green threading (or any “user space” concurrency paradigm). Creating one OS thread for each connection is too much of a overhead (see also C10k problem - Wikipedia) so people had to come up with a way to implement thread-like-thing in the user space (i.e., not relying on OS/kernel). Conceptually, OS and green thread are essentially doing the same thing; i.e., providing a way to handle concurrency. OS threads are executed by the kernel (OS) and green threads are executed by programming language/framework runtime (e.g., Julia).

OSes had threads before multi-core CPUs are popular. If you don’t have multiple cores you can’t compute anything in parallel. Still, threads were useful programming paradigm (and that’s why it makes sense for Python/Ruby/etc. to have GIL/GVL even though they have “real” (= OS) threading support).

However, since OS sits between user space software (e.g., julia) and the hardware (CPU), OS thread is the API that user space software has to use for implementing parallel scheduler. This may be the source of confusion that treats OS threading as a synonym of parallelism.

Let me first clarify some concepts/APIs in Julia. There are @async f() and Threads.@spawn f() in Julia. Behind the scene, both of them create a task (t = Task(f)) and then hand it to the scheduler for concurrent execution (schedule(t)). The difference between @async and Threads.@spawn is that the execution of the latter may happen in parallel.

In Julia, programmers describe concurrency using Tasks. This gives the julia runtime an opportunity to run them in parallel, by mapping Tasks to OS threads which in turn mapped to CPUs by the OS.

So, Base.Threads is actually using Tasks. There is nothing “bad” about it. Actually, it’s much better than using OS threads directly.

If your question is “why not let OS’s threading scheduler handle the parallel execution”, I think a short answer is that, since every thread/task runtime has some trade-offs, it’s much better to implement a runtime that is geared towards high-performance computing within julia. You can find more information in: Announcing composable multi-threaded parallelism in Julia

Do you mean https://docs.julialang.org/en/v1.5-dev/manual/distributed-computing/#Channels-and-RemoteChannels-1 ? Or maybe the fifth and seventh paragraphs in https://docs.julialang.org/en/v1.5-dev/manual/distributed-computing/ ? Yeah, I guess it should spend a bit more sentences to explain the conceptual part.

Maybe? Another reason could be that it’s handy to have different synchronization point. There is a sentence I ignored:

A successful invocation indicates that the request has been accepted for execution on the remote node.

So, you can know that f is started on some worker while it’s harder to do on other remote* variants.

3 Likes