I’ve been thinking about ways to propagate information about available workers (and possibly other resources) in scenarios with nested computations. Say we have 10000 workers available and want to run a high-level distributed algorithm which will scale up to 100 workers. Each parallel part of the high-level algorithm will in turn use a lower-level parallel algorithm internally that can also scale up to 100 workers. So all 10000 workers could be utilized - the question is, how will each instance of the low-level algorithm know which workers it’s allowed to use, when coding this in a modular fashion: There could be different low-level algorithms to choose from, and there might also be methods that are used stand-alone in other situations - we don’t want to code it as a monolithic thing.
On the thread-level, the partr scheduler has pretty much solved these problem now (since Julia v1.3). To my knowledge, we currently don’t have such a scheduler for (possibly distributed) worker processes. I was thinking about some kind of simple solution we could use until we have a fancy scheduler for workers, like we have for threads now.
task_local_storage could be used to propagate information about available/assigned resources (mainly workers) in a hierarchy of tasks? We’d need a way to pass it along when spawning new local/remote tasks, of course.
If so, could we come up with a community standard on which key names/values to use in
task_local_storage, so that we can pass resource availability information through different parallel computing packages (e.g. Transducers.jl, CC @tkf) in a hierarchical computation?