When could I need more threads than CPU cores, given 1.7+ task migration?

Title basically says it all.

To elaborate, the number of OS threads running in parallel is at most the number of virtual CPU cores. The reason for having many more threads is that some threads can or must idle, so there is opportunity to run another thread on a core. There’s a parallel to Julia’s tasks scheduled on OS threads; when a task idles, there is opportunity to run another task on a thread, which also encourages the OS to keep it scheduled on the cores.

Threads migrate across cores, that is a thread does not necessarily idle then run on the same core. Julia tasks used to be and can still optionally be stuck to a thread, which could justify spreading them among a bigger number of threads to mitigate a thread hoarding ready tasks from other idle threads full of idle tasks, or a busy task holding up other ready tasks in that same thread. My understanding is that tasks migrating across threads effectively makes them migrate just as well across cores, so I only need to make many more tasks. Have I missed another benefit of setting more threads than cores?

One case where this can be useful is if you have synchronous code that waits on external resources (e.g. webservers).

I am also interested what people have to say on the topic.

Let me share some recent observations running multi-threaded computations on a 64C/128T machine. (If some inner workings stuff is misstated below, it’s only a consequence on my ignorance and not intent to misrepresent)

  1. The GC gets called rather frequently and pauses all threads to do its job. Running on 128T I would get only 50% CPU usage and half of the wall time was spent in GC. Somewhat fixed by hunting allocations down and switching some stuff to inplace.

  2. Splitting a task into 128 subtasks and sending them to each of the 128 threads would result in some threads being done early and waiting on laggards. Fixed by implementing a bounded worker pool, where long lived workers pull data to be processed from a shared channel. Combined with the previous item, this almost decreased total wall time by a factor of 2 (CPU ~95% of 128T the whole time).

  3. (Unrelated to 1) and 2)) Calling some multi-threaded facility like SciML’s EnsembleThreads() can potentially call many somewhat small computations meaning you spend a decent amount of time with overhead. Fixed by handling the multi-threaded aspect of the simulation manually myself to have, again, long lived threads. Maybe there’s a smarter way to do it in SciML, I’m still exploring.

This is all in the past week, and I’m still discovering what problems (or I should say sub-optimal outcomes) can surface highly threaded workloads.

That’s a good example. If your tasks are going to be stuck doing I/O.

As in the part of the program not scheduled in Tasks? I thought that’s scheduled like an unsticky Task, gives up a thread when idle and restarts on any available one. How would more threads than cores help?
EDIT: It appears to be sticky, so the main “task” (can we call it that?) could be held up by a busy Task or hold up other sticky Tasks on its thread. Is that what you meant?

julia> current_task().sticky
true

I think technically the GC waits for all Tasks (plus the main program) to pause at safepoints. Then the GC runs on the available threads, by default the number of worker threads. I’m using threads to strictly mean OS threads here; Tasks are Julia’s flavor of green threads, and logical processors could be called hardware threads, but I’m deliberately avoiding those terms.

I think technically the GC waits for all Tasks (plus the main program) to pause at safepoints.

Yes and no… Tasks can only be paused at safepoints, so the GC does wait on all tasks, but it does so by waiting on all threads and knowing that any task not on a thread is already paused.

As the others said, you need more threads if some of them get blocked on IO.

If you do your IO through the julia systems (ultimately libuv), then your julia thread will park the task, fire off the IO request, and grab a different task. So no reason to have more threads than cores.

If you do blocking IO that bypasses the julia scheduler, then the IO request hits the kernel, which puts the entire OS-thread to sleep, and the julia scheduler never gets the opportunity / safepoint to mount a different task.

Why would you do that? Bypassing the julia scheduler?

For example because you MMap a file that happens to be on a network share in australia. Your code reads an integer from an array, the OS kernel page-faults, network packets move across oceans, and all this time your OS thread is parked by the OS kernel. While you’re still holding that spinlock and other julia threads / cpu-cores are furiously spinning and make an impression of a space heater (seriously, julia spinlocks should fall back on a futex).

(the same applies if your mmaped file is on a local drive. Your SSD latency is high enough compared to CPU speeds that you should treat it as async)

I was only aware of Libc.systemsleep’s blocking IO, that helps. Is that a necessity of Mmap or just how it’s implemented currently? Are there other things in base Julia or the mainstream ecosystem that can force a thread to idle alongside the task or is it mostly a concern for interop?

Are you referring to something else here or are you saying Mmap waiting has spinlocks? I thought a spinlock forced a thread to keep running on a core for the fastest response to input.