FLoops ThreadedEx scheduling

I’ve been studying different types of loop scheduling in multithreading and I couldn’t really understand how the executor ThreadedEx in FLoops/FoldsThreads work.
The docs say:

this executor takes a divide-and-conquer approach. That is to say, it first recursively halves the input collection until the each part (base case) is smaller or equal to basesize. Each base case is then executed in a single Task. The results of base cases are then combined pair-wise in distinct Tasks (re-using the ones created for reducing the base case)

Does that mean that if we use FLoops to only distribute work among threads instead of making a reduction, after the base cases being split, each Task would run the work equivalent of a loop with iterations that correspond to the size of the base case? How would the later combination of Tasks work if there is no reduction?

I tested an example of a loop with it iterations, with n threads and basesize it/n where it was divisible by n. I followed the task distribution by:

dist = [Int[] for _ in 1:n]
@floop ThreadedEx(basesize=div(it,n)) for i in 1:it
   push!(dist[threadid()],i)
   ... 
end

I was expecting to find the same amount of tasks (iterations) for every thread but it wasn’t what I found. I was also expecting, based on what the documentation said, that the amount of tasks in one thread would pass the basesize, which I encountered. So I think I didn’t understand the process well.

(Maybe aside: Typically, simple tweaking of basesize is the best “very cheap” way to get “good enough” performance. For load-balancing, simply reduce basesize. If the given computation is not big enough increase basesize. I should put this at more prominent location in the manual.)

What did you have in the ...? It has to be sufficiently time-consuming in order to observe the Julia scheduler doing something. If it’s too short, the Julia scheduler may re-use the same worker thread to execute the given tasks.

The main difficulty here is that, if ... is too long, any clever things that FoldsThreads can do will be negligible. This is why [ANN] FoldsThreads.jl: A zoo of pluggable thread-based data-parallel execution mechanisms has a bit more elaborate examples to emphasize the differences of the schedulers.

Another difficulty is interaction with the GC. If the iteration space is not too large (length(xs) below is not too large), I’d simply preallocate a record array and record the thread id used in the loop

record = zeros(Int, length(xs))

@floop ... for (i, x) in enumerate(xs)
   ...
   record[i] = threadid()
end

where the loop body ... should be some non-trivial computation. You’d also need to gather some samples of record data to get a “typical” picture of the execution.

1 Like

The ... in this case includes a loop that performs simple operations corresponding to math factorizations and that can go up to i iterations, so for every iteration of the outer loop, the loop of the factorization has increased work.

In literal code its:

factor = 2
done = i
while (factor < done)
    if i % factor == 0
        sum += factor + div(i, factor)
        done = div(i, factor) 
        if done == factor
            sum -= factor
        end
    end
    factor += 1
end

I ran few more tests to check what I could relate to your answer and I noticed that thread 1 would always have the higher number of tasks. Based on what you said, could I say that the first iterations aren’t time consuming enough so the Julia scheduler re-uses the master thread? Does the Julia scheduler have higher priority in distributing the tasks than the implementation of the ThreadedEx?

These are some of the distribution of tasks that I got:

  • 50k iterations and 4 threads:

    • 18751
    • 12500
    • 12500
    • 6250
  • 350k iterations and 4 threads:

    • 131251
    • 87500
    • 87500
    • 43750