You would also have to count the overhead of creating the cache, which is done once per task. I find the Channel
-based approach useful in order to get maximum re-use of every constructed cache object.
Thank you for the reply and the explanation. It should be included in the manual.
Now, the pattern that @joaquinpelle posted, that is, to have an expensive for loop where
- all iterations are independent
- a scratch workspace is required for each iteration,
- the computational cost per iteration might be unpredictable, so chunking wonât be a good solution
is the most common non-trivial parallel computation pattern.
From the semantics of @threads
that @Mason provided, it is evident that @threads
cannot be used in this use case. So how do we code this parallel pattern in the most efficient and transparent way?
Assigning scratch space for each iteration will be very wasteful, and allocating-deallocating scratch space for each iteration will add high overhead.
Since we programmers do think in the way @joaquinpelle provided, shouldnât there be a @tasks for
and taskid()
that would allow the allocation of scratch workspace per task and dynamically generate an adequate number of tasks that would keep the resources of the computer saturated?
The usual method would be to @spawn
n tasks, each one makes its own cache, and then pass them work via a channel from the spawning task.
If the work is extremely short lived the overhead of the channel could be problematic but if youâre talking a millisecond of work or longer the channel shouldnât be an issue. At least thatâs my understanding.
I think channels allow one do do that, with this pattern:
julia> nbuff = 5;
julia> channels = Channel{Vector{Int}}(nbuff);
julia> for _ in 1:nbuff
put!(channels, zeros(Int, 3))
end
julia> @sync for i in 1:10 # ntasks
Threads.@spawn begin
buff = take!(channels)
# use buff
put!(channels, buff)
end
end
the spawned tasks will occupy the first available thread, and that will happen when a channel is freed. (afaiu there isnât really the need for serial-chunking in this case)
Thank you, and sorry for being dense.
Please elaborate on why you generate 10 tasks but allocate only 5 scratch workspaces?
Is 10 to represent the 10K iterations from the original example loop for i in eachindex(A)
?
Is 5 the expected number of concurrent threads that would optimally saturate a computer while running this computation?
Thatâs how I understand it. Since the tasks wait for a free channel, only 5 can run at once.
Nice! This seems like a very flexible way to manage the maximum number of buffers. What is the overhead of the take!
and put!
in practice? Does this pattern always make sense, or do we need to have very inhomogeneous/long task runtimes?
Edit: I just did a quick benchmark of a take!
followed by a put!
, and it only seems to take only 91ns in my machine. I assume there would only be an issue if your task runtimes are of that order or shorter.
This is the classic producer-consumer pattern, with the queue of scratch workspaces as a shared resource. The workers-threads are consumers and producers when they acquire and release a scratch workspace, respectively. The overhead should be negligible.
In the solution proposed by @lmiq, do we need to pad each scratch workspace with enough bytes to avoid false sharing? Whenever the end of a scratch buffer ends up in the same cache line as the beginning of the next scratch buffer, any write operation in one of the two will invalidate some elements of the other.