When could I need more threads than CPU cores, given 1.7+ task migration?

Title basically says it all.

To elaborate, the number of OS threads running in parallel is at most the number of virtual CPU cores. The reason for having many more threads is that some threads can or must idle, so there is opportunity to run another thread on a core. There’s a parallel to Julia’s tasks scheduled on OS threads; when a task idles, there is opportunity to run another task on a thread, which also encourages the OS to keep it scheduled on the cores.

Threads migrate across cores, that is a thread does not necessarily idle then run on the same core. Julia tasks used to be and can still optionally be stuck to a thread, which could justify spreading them among a bigger number of threads to mitigate a thread hoarding ready tasks from other idle threads full of idle tasks, or a busy task holding up other ready tasks in that same thread. My understanding is that tasks migrating across threads effectively makes them migrate just as well across cores, so I only need to make many more tasks. Have I missed another benefit of setting more threads than cores?

One case where this can be useful is if you have synchronous code that waits on external resources (e.g. webservers).

I am also interested what people have to say on the topic.

Let me share some recent observations running multi-threaded computations on a 64C/128T machine. (If some inner workings stuff is misstated below, it’s only a consequence on my ignorance and not intent to misrepresent)

  1. The GC gets called rather frequently and pauses all threads to do its job. Running on 128T I would get only 50% CPU usage and half of the wall time was spent in GC. Somewhat fixed by hunting allocations down and switching some stuff to inplace.

  2. Splitting a task into 128 subtasks and sending them to each of the 128 threads would result in some threads being done early and waiting on laggards. Fixed by implementing a bounded worker pool, where long lived workers pull data to be processed from a shared channel. Combined with the previous item, this almost decreased total wall time by a factor of 2 (CPU ~95% of 128T the whole time).

  3. (Unrelated to 1) and 2)) Calling some multi-threaded facility like SciML’s EnsembleThreads() can potentially call many somewhat small computations meaning you spend a decent amount of time with overhead. Fixed by handling the multi-threaded aspect of the simulation manually myself to have, again, long lived threads. Maybe there’s a smarter way to do it in SciML, I’m still exploring.

This is all in the past week, and I’m still discovering what problems (or I should say sub-optimal outcomes) can surface highly threaded workloads.

That’s a good example. If your tasks are going to be stuck doing I/O.

As in the part of the program not scheduled in Tasks? I thought that’s scheduled like an unsticky Task, gives up a thread when idle and restarts on any available one. How would more threads than cores help?
EDIT: It appears to be sticky, so the main “task” (can we call it that?) could be held up by a busy Task or hold up other sticky Tasks on its thread. Is that what you meant?

julia> current_task().sticky
true

I think technically the GC waits for all Tasks (plus the main program) to pause at safepoints. Then the GC runs on the available threads, by default the number of worker threads. I’m using threads to strictly mean OS threads here; Tasks are Julia’s flavor of green threads, and logical processors could be called hardware threads, but I’m deliberately avoiding those terms.

I think technically the GC waits for all Tasks (plus the main program) to pause at safepoints.

Yes and no… Tasks can only be paused at safepoints, so the GC does wait on all tasks, but it does so by waiting on all threads and knowing that any task not on a thread is already paused.

As the others said, you need more threads if some of them get blocked on IO.

If you do your IO through the julia systems (ultimately libuv), then your julia thread will park the task, fire off the IO request, and grab a different task. So no reason to have more threads than cores.

If you do blocking IO that bypasses the julia scheduler, then the IO request hits the kernel, which puts the entire OS-thread to sleep, and the julia scheduler never gets the opportunity / safepoint to mount a different task.

Why would you do that? Bypassing the julia scheduler?

For example because you MMap a file that happens to be on a network share in australia. Your code reads an integer from an array, the OS kernel page-faults, network packets move across oceans, and all this time your OS thread is parked by the OS kernel. While you’re still holding that spinlock and other julia threads / cpu-cores are furiously spinning and make an impression of a space heater (seriously, julia spinlocks should fall back on a futex).

(the same applies if your mmaped file is on a local drive. Your SSD latency is high enough compared to CPU speeds that you should treat it as async)

I was only aware of Libc.systemsleep’s blocking IO, that helps. Is that a necessity of Mmap or just how it’s implemented currently? Are there other things in base Julia or the mainstream ecosystem that can force a thread to idle alongside the task or is it mostly a concern for interop?

Are you referring to something else here or are you saying Mmap waiting has spinlocks? I thought a spinlock forced a thread to keep running on a core for the fastest response to input.

Answering only this part, no they should not.

It will break realtime code because the conventional wisdom “spinlocks are always bad” does not apply in practice.

No, it forces a task to keep running on an OS thread. Which OS thread runs when on which hardware thread is a kernel scheduler decision, and we never inform the kernel about our spinlocks.

I wanted to say: Accessing mmapped memory – or any memory really, if you have swap enabled – can block on the OS-level. This means that your OS thread gets parked by the kernel, and julia never knows about this.

This has the extra unhappy consequence that if you hold a spinlock, you’re not going to release it until the OS kernel has finished the underlying IO (that you didn’t know about) and continues running your thread.

This also has the consequence that you could profitably oversubscribe your system, i.e. run more OS threads than you have hardware threads / logical cores: When you block on the OS-level, then the OS kernel can use the free hardware thread to run something else. Downside is the bad interaction with the current julia spinlock implementation.

Sorry for the diversion…

Spinlocks in julia have three functions / reasons why somebody would use a spinlock:

  1. use in places where julia-task migration would be bad, i.e. forbid the julia scheduler from grabbing a different task and
  2. try to keep running on your core, e.g. to reduce latency, or realtime, or etc,
  3. cases where the author believes contention is very low.

My believe is that most actual uses of spinlock in the julia ecosystem are of type (1) and (3), not (2). That includes implicit locks for large atomics, all the scheduler locks, etc.

Let’s think about locks where we don’t want task-migration. This means that our lock must not interact with the julia scheduler (e.g. because the lock is used inside the scheduler code).

Lock implementations for this job exist on a continuum:

  1. try to acquire the lock via atomic
  2. spin for some time, retry
  3. use some backoff strategy to increase the spin-time in-between subsequent attempts to acquire the lock (no backoff is also a strategy).
  4. eventually (or never) inform the OS kernel that we are stuck on a lock, via futex / os_sync_wait_on_address / _umtx_op / waitonaddress

There are various reasonable choices for these tunable parameters, which depend on a lot of details (expected contention, expected length of critical section, latency goals, etc).

What I am saying is that for userspace processes on general purpose computers running general purpose operating systems, there are very few cases where a good choice for (4) is “spin until the lock is free or the kernel preempts you, never informing it about the issue”.

If everything works as intended, then it doesn’t matter whether we “spin for a relatively long time” or “spin for an unbounded amount of time”: The code for “we have spun for a freaking millisecond, something is wrong, let’s tell the kernel” never executes, and the actually executed sequence of instructions on the happy and moderately unhappy paths are almost the same.

The big problem with never informing the kernel is the following: Suppose the lock-holder got pre-empted while holding the lock, for example because system load was high (e.g. the number of julia threads is equal to the number of cores, and some other process on your machine needed to do something, like e.g. my browser because I’m typing on discourse). Now the other threads reach the lock and start spinning. They never release their cpu-cores, so we cannot make progress until the kernel scheduler decides to run the lock-holder again, which needs a free cpu core, so we are stuck until either somebody else gets preempted or the background process finishes. So we get a doom-spiral: More system-load → higher chance of lock-holders getting preempted, and longer time until they get re-scheduled → more system-load. This kind of breakdown sucks, and depends on things outside the control of the programmer (it depends on the other processes the user runs on the same machine).

You can view this as a kind of priority inversion: The lock-holder is sleeping because the system is under high load, and the system is under high load because all the waiters are spinning.

That basically causes a hiccup of a couple of milliseconds.

…so when is unlimited spinning the right tool for the job? If your threads are pinned, you have informed the kernel that the lock-users are all realtime-threads, you don’t mind the wasted electricity, and you do care very much about latency, which implies that you’re running without garbage collection.

I think it would be much more reasonable to put the unhappy path (couldn’t acquire the lock after spinning for some reasonable amount of time) into some small C file, and ask that rare realtime users LD_PRELOAD a version that doesn’t inform the kernel.

If you’re saying “this doesn’t happen in practice, nobody wants to spend the effort to implement and test such tricky OS-dependent code”, then fine. If you’re saying that we are currently doing the right thing, then I disagree.

I’ll just note that you will observe this effect easily: Just spawn a handful of threads contending on a spinlock (or atomic), and oversubscribe your system (e.g. start two julia processes in parallel).

I’d suggest that you never oversubscribe a system running parallel jobs, spinlocks or not. Any decent resource manager software will enforce this.

That’s very much a software quality question. If the code you’re running is beautiful and can keep all cores busy all the time, sure, you’re absolutely right.

If your code is a hot mess and frequently switches between “can use all cores” and “effectively single-threaded” and “effectively IO bound” phases, then spawning multiple processes in parallel on the same machine can be an easy win. Sometimes you’re oversubscribed, sometimes you’re undersubscribed.

Common julia-related example would be “compile something” (compilers are a hot mess, not a beautiful matmul). If you build LLVM from sources for tinkering with julia, then of course you tell ninja to parallelize, up to core count / memory constraints! (and you better use mold for linking…)

That makes sense, if we ask the OS to do something, it’s at liberty to manage OS threads. I’m assuming that doesn’t apply if we’re consistently crunching numbers in RAM, but please correct me if I’m wrong.

So how do we pick the number of OS threads when we might have a lot of disk or network IO? Too little stalls our program, too much uses up memory that could have gone elsewhere. I’ve read that Go somewhat makes this easier by initially spawning a number of specified threads (defaulting to logical cores) and more threads when too many are asleep to run that number simultaneously; however, the threadpool doesn’t decrease, so it can be thought of as delaying the number of threads we would specify at the start for a similar Julia program.

Yes.

Ideally, the julia task-scheduler can redistribute tasks that do disk or network IO to different hardware threads. So that’s not a reason to have more OS threads than logical cores.

However – the julia task-scheduler can only do that if it is properly informed about the disk/network IO. If you go behind the task-scheduler’s back and directly ask the OS kernel, then it can’t do this. This includes C-libraries that don’t integrate with the julia scheduler (maybe they use pthreads mutexes internally), and it also includes hidden things like the whole mmap digression (access memory → page fault → kernel does IO on your behalf).

If your code does that a lot, e.g. due to messy C libraries you’re calling, then oversubscription can make sense (but also consider @threadcall – that exists specifically for this purpose!).

The golang (also modern JVM) solution is ideal: If the language runtime detects that a lot of its OS threads are blocked, outside of its control, then it spawns more hardware threads. We currently can’t do that in julia, so you need to set out the number beforehand.

Maybe somebody here has a good profiling script that measures when julia-threads are blocked on the OS-level without the julia scheduler knowing about it? That would be the ideal answer: Profile your workload.

I think I’m missing something here: what exactly is a hardware thread? I’ve been reading that as a logical core, but those can’t be liberally spawned, let alone by the Go runtime. I’ve been calling the threads we specify at a Julia process’s outset and the threads that Go spawns “OS threads.”

I’m interpreting this as the OS being just as capable of letting an OS thread stay awake to do other things during IO as it can block. That would be consistent with M:N schedulers being possible at all. Why let that happen for Mmap, though? Is it just too low-level or stuck in C libraries to tweak for Julia’s scheduler?

The docstring also mentions that the threads come from a fixed threadpool for the process, so blocking threads can only go so far. It’s still nice that a Julia thread isn’t being blocked by a C function though.

Sorry for inconsistent terminology. I used “hardware thread” as “logical core”. The OS scheduler is responsible for putting OS threads on logical cores. The julia scheduler is responsible for putting julia tasks on OS threads controlled by the julia scheduler.

The “good way” of doing IO (implemented in the julia runtime) does the following: We register the fact that we will be waiting for some IO, somewhere in the libuv stuff. Then we tell the OS kernel to do “async IO”, i.e. please schedule that IO, and inform the libuv / “julia io system” once the result is ready. Then we inform the julia scheduler that our task is waiting for IO, please put it on ice and see if another julia task is ready to run on the now-vacated OS thread.

When the IO operation is done, the OS kernel wakes up the IO thread, and tells it about the finished operation. The IO thread looks at the result, and finds the julia task that was waiting on it. Then it tells the julia scheduler that the corresponding julia task is ready to run again. Then the IO thread tells the kernel to please go to sleep again, until there are new events to handle

If you bypass the julia runtime and do IO the “bad way”, then nobody told the julia scheduler that your task is waiting on IO. Instead, the OS kernel puts the entire OS thread to sleep.

Of course you can also do IO the good way without libuv! For that you need to do basically the same thing that the julia runtime does.

MMap is very low level indeed. It’s a mix of direct hardware (virtual memory! The MMU! Shiny new thing, ~40 years ago) and OS support (page faults). I used it as an illustrative example that “doing IO the bad way” can unavoidably happen to you, the programmer, due to things that are beyond your direct control, instead done by the operating system and the user of the program.

The way to address this would be to do what e.g. golang does: The julia runtime detects that, over some time (1) there are more tasks ready to run in the julia scheduler queue than OS threads in the threadpool, and (2) the system is not fully occupied, i.e. the OS kernel cannot find enough unblocked OS threads to feed the CPU.

Then one can draw the conclusion that probably some julia-owned OS threads in the common pool are currently not running because they are waiting on blocking IO / locks / etc that bypassed the julia scheduler. So one should start more OS threads and add them to the common threadpool, up to a limit (look, sometimes the entire system is IO bandwidth bound, not IO latency bound!).

We cannot do that, because the common threadpool size is fixed from startup; and that assumption is baked into a lot of the package ecosystem.

A frustrating thing is that it’s hard to tell when and how “good” or “bad” is done. For writing a string to a text file, I followed Cthulhu.jl only to be stuck at a ccall:

function unsafe_write(s::IOStream, p::Ptr{UInt8}, nb::UInt)
    iswritable(s) || throw(ArgumentError("write failed, IOStream is not writeable"))
    return Int(@_lock_ios s ccall(:ios_write, Csize_t, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t), s.ios, p, nb))
end

And it’s not obvious to me where in ios_write involves informing the scheduler for task switching (“good”) or even moving the write to a background thread like @threadcall (mitigated “bad”).