After reading the WIP blogpost https://github.com/JuliaLang/www.julialang.org/pull/1904, I’m trying to adapt to the new recommended approach to multithreading, now that tasks can migrate between threads. I would like to get some confirmation that I’m understanding the new recommendation correctly.
As I understand it, the new approach is to forego Threads.@threads
altogether in favor of a chunk-centric @spawn
. Your input data is divided in a number of chunks of equal lengths, and a task is spawned on each of these chunks. You can have many chunks per thread, creating a kind of common task queue. Each task takes care of allocating any storage it needs before iterating over its assigned chunk, instead of the caller doing it beforehand as in the @threads
approach. This way threadid()
is avoided and task migration is not a problem.
There is a twist in my case. My local storage is a complicated Solver()
object with potentially large internal preallocated space. I cannot have each task allocating a new copy of Solver()
, that is too expensive. One solver copy per thread is ok, but one per task is suboptimal.
This would be my code doing things the Old Way (using @threads :static
)
function solve(inputs)
solvers = [Solver() for _ in 1:nthreads()]
solutions = Vector{Solution}(undef, length(inputs))
@threads :static for (i, input) in enumerate(inputs)
solutions[i] = solve!(solvers[threadid()], input)
end
return solutions
end
where each solve!(solver, input)
mutates internal solver
fields. Here we pay the price of having nthreads()
copies of Solver()
.
With the new task-based approach I would need to bundle each chunk with its own copy of the storage. This would be the analogous code
function solve(inputs; number_of_chunks = nthreads())
chunk_size = max(1, length(inputs) ÷ number_of_chunks)
chunks = Iterators.partition(inputs, chunk_size)
solvers = [Solver() for _ in chunks]
tasks = map(zip(solvers, chucks)) do (solver, chunk)
@spawn map(solver!, Iterators.repeated(solver), chunk)
end
solutions = Iterators.flatten(fetch.(tasks))
return solutions
end
Question 1: If number_of_chunks = nthreads()
, aren’t the two approaches equivalent?
If number_of_chunks > nthreads()
we should get some degree of load balancing with the second approach, which is nice. However we also increase the number of Solver()
copies, which is not so nice.
Question 2: Is there any way we can get number_of_chunks > nthreads()
but still use only nthreads()
copies of the Solver()
storage that get reused?
I think Question 2 should be possible (and optimal) if we had sticky tasks by default. We could then still use threadid()
and get load-balanding by spawning lots of sticky tasks that digested as threads become idle. So non-sticky tasks seem like a drag in my (probably naive) view.
Question 3: What do we gain from non-sticky tasks? Can we disable them somehow so we can still use threadid()
or would that be a bad idea for some reason?
XRef: Multithreading, preallocation of objects, threadid() and task migration