This is a question for both the current state and planned future for multithreading.
One of the more common opportunities for parallelism is tree-like recursion (like e.g. for sorting).
function do_stuff!(args)
left, right = split_work(args)
res_left = do_stuff!(left)
res_right = do_stuff!(right)
return merge_results!(args, res_left, res_right)
end
where the two do_stuff!
calls could be done in parallel. How I would imagine this to work, for maximal convenience, is
function do_stuff!(args)
left, right, is_large = split_work(args)
if !is_large
res_left = do_stuff!(left)
res_right = do_stuff!(right)
return merge_results!(args, res_left, res_right)
else
@defer res_right = do_stuff!(right)
res_left = do_stuff!(left)
@wait res_right
return merge_results!(args, res_left, res_right)
end
end
So let’s unpack what this hypothetical API would do: @defer
would put the desired work package into our thread’s work pool, and wake up an idle thread if it exists. @wait
would check whether the work package is (a) untouched, (b) finished, or (c) currently work-in-progress. If it is untouched, we run it now. If it is finished, we grab its results and continue. If it is currently work-in-progress, we register a continuation waiting on the result, and try to grab a different work package from our thread-local list. If our thread has no more work packages, we steal another thread’s work package. If there are none, then we go to sleep.
So my question is: Is this kind of recursive multi-threading already possible (using various ccall
s into the runtime)? Is it planned?
Obviously there are details in how to schedule this kind of multithreading (e.g.: After everything is sequential again, we want to switch execution to thread 1). However, one would hope that highly parallel problems could have minimal overhead: Cases (b) and (c) that require synchronization would be rare, and threads going to sleep or waking up would be even rarer. Once we saturated the threadpool, execution within each thread would be mostly sequential and in the same order as the sequential code, which the programmer hopefully optimized for good L1/L2 temporal cache locality.
As opposed to more user-controlled scheduling (like Partr?), we give up L3 temporal cache locality: Sometimes (e.g. blas?), you have multiple jobs that share a lot of read access; these jobs should be scheduled to run concurrently, because L3 is shared between cores.
On the opposite side, many algorithms instead have the property that concurrency comes with a price in terms of lock contention and/or synchronization of caches. In the example of parallel sorting, one would probably prefer this kind of scheduling, i.e. “far away work packages are done concurrently”, as opposed to “close together work packages are done concurrently”.