Multithreading for recursive operations

This is a question for both the current state and planned future for multithreading.

One of the more common opportunities for parallelism is tree-like recursion (like e.g. for sorting).

function do_stuff!(args)
left, right = split_work(args)
res_left = do_stuff!(left)
res_right = do_stuff!(right)
return merge_results!(args, res_left, res_right)
end

where the two do_stuff! calls could be done in parallel. How I would imagine this to work, for maximal convenience, is

function do_stuff!(args)
left, right, is_large = split_work(args)
if !is_large
res_left = do_stuff!(left)
res_right = do_stuff!(right)
return merge_results!(args, res_left, res_right)
else
@defer  res_right = do_stuff!(right)
res_left = do_stuff!(left)
@wait res_right
return merge_results!(args, res_left, res_right)
end
end

So let’s unpack what this hypothetical API would do: @defer would put the desired work package into our thread’s work pool, and wake up an idle thread if it exists. @wait would check whether the work package is (a) untouched, (b) finished, or (c) currently work-in-progress. If it is untouched, we run it now. If it is finished, we grab its results and continue. If it is currently work-in-progress, we register a continuation waiting on the result, and try to grab a different work package from our thread-local list. If our thread has no more work packages, we steal another thread’s work package. If there are none, then we go to sleep.

So my question is: Is this kind of recursive multi-threading already possible (using various ccalls into the runtime)? Is it planned?

Obviously there are details in how to schedule this kind of multithreading (e.g.: After everything is sequential again, we want to switch execution to thread 1). However, one would hope that highly parallel problems could have minimal overhead: Cases (b) and (c) that require synchronization would be rare, and threads going to sleep or waking up would be even rarer. Once we saturated the threadpool, execution within each thread would be mostly sequential and in the same order as the sequential code, which the programmer hopefully optimized for good L1/L2 temporal cache locality.

As opposed to more user-controlled scheduling (like Partr?), we give up L3 temporal cache locality: Sometimes (e.g. blas?), you have multiple jobs that share a lot of read access; these jobs should be scheduled to run concurrently, because L3 is shared between cores.

On the opposite side, many algorithms instead have the property that concurrency comes with a price in terms of lock contention and/or synchronization of caches. In the example of parallel sorting, one would probably prefer this kind of scheduling, i.e. “far away work packages are done concurrently”, as opposed to “close together work packages are done concurrently”.

2 Likes

On current master (v1.3-DEV), tasks are run on threads. Using a macro offered by Jeff Bezanson and reposted in

one can get what you describe:

right_task = @par do_stuff!(res_right, right)
do_stuff!(res_left, left)
wait(right_task)

In my experiments so far I find that there is moderate overhead and one has to be diligent about inference (function barriers are often helpful). But it works, and I see good scaling (e.g. a basic recursive task-decomposed merge-sort of Float64 on 1 core is 1.6x slower than the serial version in Base, but running on 8 cores speeds up the tasked version by 6x).

Data dependencies, task priorities, and cache-friendly algorithm design are obviously still up to the programmer (for now).

I’ve learned enough about concurrent programming to be full of admiration for the PARTR implementors, principally Kiran, Jameson, and Jeff IIUC. This is really nice infrastructure.

10 Likes

Thanks for that explanation and link! This works splendidly and gives a nice API. The docs appear to be quite out-of-sync with master, though (I stopped reading at “Note
Currently, all tasks in Julia are executed in a single OS thread co-operatively.”).

However, the scheduling is quite different from what I imagined: As advertised, partr appears to attempt to run the innermost parallelization opportunities concurrently. With two threads and an expensive f():

julia> const t1=Ref(1); const t2=Ref(1);
julia> function _parmap(f, A, lo=1, hi=length(A)+1)
       if hi==lo+1
       f()
       if Threads.threadid()==1
       A[lo] = t1[]
       t1[] += 1
       else
       A[lo] = -t2[]
       t2[] += 1
       end
       else
       m = lo+ (hi-lo)>>1
       t = @par _parmap(f, A, m, hi)
       _parmap(f, A, lo, m)
       wait(t)
       end
       nothing
       end
julia> t1[]=1; t2[]=1; A=zeros(Int,10); _parmap(f,A); A
10-element Array{Int64,1}:
  1
 -3
 -1
  2
 -2
  3
 -5
  4
 -4
  5

What I originally imagined would have been deterministic (in the limit of very expensive f) A==[1, 2, 3, 4, 5, -1, -2, -3, -4, -5].

I guess I’ll have to read some more on the scheduler and learn to deal with it. (I care about the schedule because I have some reasonable guesses about which tasks can run concurrently without issues, and a more expensive conflict resolution if these guesses proved wrong.)

With regards to overhead, it appears that 100 us is an OK size of tasks and 30 us is too small (on my machine).