Overhead of `Threads.@threads`

If I understand correctly, the normal paradigm would be to submit X jobs, and then different tasks would “steal” these from a primary queue if they don’t have any other work to do.
I would like the ability to ask “how many cores aren’t busy?” though, to divide fine grained work up into that many chunks.
If you’re processing dense arrays and aren’t using cache-level blocking, there aren’t discrete chunks of work.
Note that even for things like matrix multiplcation, you won’t always perform cache-level blocking; the matrices have to be fairly large before that’s profitable (i.e., cache-blocks provide an upper limit on the size of blocks; if the number-of-threads to array-size ratio is high enough, your preferred block sizes will be below this limit).

With roughly continuous chunks of work rather than discrete, the optimum is whatever minimizes overhead. If they’re going to run in parallel on 7 threads, dividing it up into 7 pieces yields the best throughput (minimum overhead), and definitely yields the best latency. E.g., if you have 70 time-units of work and there is 0 overhead to threading, diving into 7 threads takes 10 time-units, while dividing into 10 threads takes 14 time units.

I think we can do a lot in Julia, although Stefan argues that not fracturing the ecosystem is important.

True. Lots of workloads benefit from hyperthreading (but dense linear algebra also does not).

That’s a good idea.
Executing a task is non-blocking, but __wait (which seems to be a part of the API, so I should remove those leading underscores) is of course blocking.
So the simplest fix would be to have __wait check if a task has been started. If it hasn’t, it should “steal” the work (and execute serially).

I think it’d be great for the scheduler to support different kinds of behavior. For example, maybe f in sum(f(x) for x in xs if p(x)) does some dense linear algebra that it’d like to thread. If the matrices are moderately small, it may want to run them on e.g. 4 threads (on a computer with potentially 32, 64, or 128+ threads iterating over xs).
The thing I’d like there (for the smaller statically scheduled code) is the ability to request “up to X” threads as described earlier. These “up to X” should be scheduled together in a group; aside from lowering overhead of starting them, this would be beneficial in allowing for tasks that want low-overhead fine grained synchronization across the threads they do run (i.e., if a set of tasks synchronizes via spin locks, you want to start them at the same time).

I think we’d need a (runnable) motivating example to pick the best behaviors at each level, but that’d be an interesting example w/ respect to composability

2 Likes