It does not scale for fine-grained tasks because of lock contention
It’s not NUMA-aware
As you might know, a lot of schedulers for task parallelism have been proposed even in these days. And I guess there is no “best” scheduler and there are always trade-offs. In some cases, work stealing schedulers would perform better.
My proposal is to make an abstraction to customize the scheduling policy in Julia, and to make it possible to publish user-defined schedulers as packages, so that users can choose the “best” scheduler for their own purpose.
This is wrong. The current implementation of PARTR may be sub-optimal for very small work-units because of the overhead of creating tasks and assigning them to workers, but those are both O(1) or O(nworker) things. Claims (in the quoted issue) that trouble scaling to large problems, seen in some other model, applies here are empty without data or analysis. Even the quoted issue acknowledges that lock contention is irrelevant here. PARTR is designed to encourage data-locality, so why should NUMA be an issue?
To end on a more positive note, variant schedulers are indeed desirable, and since @tkf is actively exploring their potential uses, his topic suggestions are golden.
I missed the point that the lock contention issue should be addressed by MultiQueues. I suspected that PDF scheduling had some scaling issue on large scale (like ~100 cores) because of its centralized nature, but I cannot say anymore unless I actually measure its preformance. I’ll take it back.
Because PARTR is designed so that workers execute close tasks in a depth-first manner, and thus they tend to touch close data at the same time. That’s why it has good data locality on UMA machines, but sharing the same working set among NUMA nodes is not a good idea. It will increase remote memory accesses, which causes performance degradation. In HPC, It’s considered a good practice to have distinct set of data on NUMA nodes to avoid remote accesses.
Here’s a video of a swift for tensorflow design meeting for parallel abstractions that touches on a protocol for custom schedulers, parallel iterators for data pipelines (@tkf might be interested in that) and related things. Has some interesting ideas
Regarding NUMA, I should have elaborated (or maybe just exposed more naivete): a thread scheduler naturally works for shared memory, so why isn’t NUMA (i.e., poorly shared memory) better handled at the application level, by separate processes which the operating system should put on separate sockets? It seems that the problem-and-system-specific numbers for NUMA allocation would be readily available to the application but not to a general-purpose scheduler.
Yes, the compute should ideally be scheduled near the data so in principle the scheduler needs to be aware of the data layout as well as the available compute resources and task tree structure. This information is available to the application layer and the best performance will always come from the application carefully handling all this “by hand”. But naively that introduces tight coupling between high level algorithms vs underlying compute resources and most applications won’t be written to be aware of various compute backends.
Ideally it would be nice to run applications which weren’t built with NUMA machines in mind with good efficiency, so I think it’s reasonable to ask the following question: To what extent can a scheduler mitigate NUMA overhead using a scheduling heuristic which is only aware of the task tree structure exposed by typical parallel algorithms?
I’m wondering how far we can go with simple heuristics like breadth-first across NUMA nodes and then depth-first inside. Combining it with Julian coding patterns like mutate-or-widen strategy that encourage node-local reuse of data structures, maybe we can write a decent memory architecture-agnostic algorithms. Maybe it’s also useful to apply similar heuristics for cache hierarchy so that the task scheduling is even friendlier to cache-oblivious algorithms?
And the NUMA issue is just one of the issues related to schedulers, including task-creation overheads, priorities, fairness, and so on. I believe custom schedulers can benefit various types of programs that can be written in Julia.
(Anyway, the relationship between schedulers and algebraic effects seems interesting. I’ll dig deeper into it. Thanks.)
Thanks for sharing the references. It’d be great if you bring such state-of-the-art schedulers to Julia.
I agree! I think a very “dumb” example is to use a static scheduler in a given dynamic scope. This is useful for getting deterministic results with computations using thread-local random number generators. With non-deterministic scheduling, it’s not straightforward to get deterministic results with RNGs. I understand there are fancier ways to get back this property. But I think a naive static scheduler is a nice option especially when you know that the workload is uniform across tasks.
It’s interesting you bring up RNGs: I feel it’s unfortunate that we have “global” (or thread-local) RNG state
As you said, it introduces non-determinism
It can’t be swapped out without passing an RNG context argument through to the innermost function call (exactly the same problem with logger contexts)
In general, using thread-local state for tasks interacts badly with the desire to auto-migrate tasks between threads (you’d better not yield while holding any thread local state!)
So maybe access to the RNG state would be well modeled with an algebraic effect. The trick of course would be somehow making it efficient enough…
Yeah, that’s related to the fancier solution I was thinking. It’d be nice to hook “splittable” PRNGs in the @spawn mechanism and make it task-local than thread-local. I mentioned it at Brew a Parallel RNG? - #5 by tkf before (see also Parallel Mersenne Twister - #15 by oschulz). Providing this via an effect system sounds like an interesting and possibly elegant solution.
I agree that there’s a lot of value to having a static scheduler available; I hope to look at this soon. However, I’m unconvinced by using RNGs as an example. Determinism in the presence of parallelism is a much bigger problem and not so easily resolved.
As for the NUMA question, many folks have tried really hard to handle it transparently, but to the best of my knowledge, nobody runs real code that shares memory across NUMA domains – one process per NUMA domain is the rule of thumb. Research is always good – someone might come up with something cool – but QPI/UPI, network card memory maps, and now GPUs/accelerators make the cross-domain traffic too unpredictable.
Let me add: a project that I think would be very useful would be a work-stealing scheduler as a drop-in replacement for the current PDF scheduler. Compile-time switching would be fine for this. We could then use these for a proper comparative survey using a small set of multi-threaded applications that require nested parallelism.
I think I understand that it’s hard to get deterministic output with parallel computation in general. But wouldn’t a static scheduler make using RNG in a deterministic manner significantly easier, if not trivial?
(More precisely, my understanding is that you “just” need to write program where the task tree constructed by @spawns is deterministic in the sense it only depends on the input of computation (and not run-time properties like timing). By “static scheduler” I’m assuming that it will assign each task to a consistent thread with consistent ordering every time it is given the same shape of task tree.)
Hi Kiran, it’s great to have someone chiming in here who already has real experience implementing schedulers (unlike myself at least; I’m not sure about other people on this thread :-)). I agree this project sounds very practical and has quite a concrete outcome, both of which are great.
Out of curiosity, do you have any thoughts about pluggable dynamically scoped schedulers (as seems to be enabled by treating spawn as an effect in an algebraic effect system). The idea seemed cool, but I wasn’t sure whether having multiple schedulers would solve real practical problems. I also felt like there was a potential mismatch there between the simple concurrent examples I saw when reading about algebraic effects, vs a potential need for deeper runtime integration. In particular when tasks are parallel and not merely concurrent.