Multithreading with shared memory caches

You would also have to count the overhead of creating the cache, which is done once per task. I find the Channel-based approach useful in order to get maximum re-use of every constructed cache object.

2 Likes

Thank you for the reply and the explanation. It should be included in the manual.

Now, the pattern that @joaquinpelle posted, that is, to have an expensive for loop where

  • all iterations are independent
  • a scratch workspace is required for each iteration,
  • the computational cost per iteration might be unpredictable, so chunking won’t be a good solution
    is the most common non-trivial parallel computation pattern.

From the semantics of @threads that @Mason provided, it is evident that @threads cannot be used in this use case. So how do we code this parallel pattern in the most efficient and transparent way?

Assigning scratch space for each iteration will be very wasteful, and allocating-deallocating scratch space for each iteration will add high overhead.

Since we programmers do think in the way @joaquinpelle provided, shouldn’t there be a @tasks for and taskid() that would allow the allocation of scratch workspace per task and dynamically generate an adequate number of tasks that would keep the resources of the computer saturated?

1 Like

The usual method would be to @spawn n tasks, each one makes its own cache, and then pass them work via a channel from the spawning task.

If the work is extremely short lived the overhead of the channel could be problematic but if you’re talking a millisecond of work or longer the channel shouldn’t be an issue. At least that’s my understanding.

1 Like

I think channels allow one do do that, with this pattern:

julia> nbuff = 5;

julia> channels = Channel{Vector{Int}}(nbuff);

julia> for _ in 1:nbuff
           put!(channels, zeros(Int, 3))
       end

julia> @sync for i in 1:10 # ntasks
           Threads.@spawn begin
               buff = take!(channels)
               # use buff
               put!(channels, buff)
           end
       end

the spawned tasks will occupy the first available thread, and that will happen when a channel is freed. (afaiu there isn’t really the need for serial-chunking in this case)

1 Like

Thank you, and sorry for being dense.

Please elaborate on why you generate 10 tasks but allocate only 5 scratch workspaces?

Is 10 to represent the 10K iterations from the original example loop for i in eachindex(A)?
Is 5 the expected number of concurrent threads that would optimally saturate a computer while running this computation?

1 Like

That’s how I understand it. Since the tasks wait for a free channel, only 5 can run at once.

Nice! This seems like a very flexible way to manage the maximum number of buffers. What is the overhead of the take! and put! in practice? Does this pattern always make sense, or do we need to have very inhomogeneous/long task runtimes?

Edit: I just did a quick benchmark of a take! followed by a put!, and it only seems to take only 91ns in my machine. I assume there would only be an issue if your task runtimes are of that order or shorter.

This is the classic producer-consumer pattern, with the queue of scratch workspaces as a shared resource. The workers-threads are consumers and producers when they acquire and release a scratch workspace, respectively. The overhead should be negligible.

In the solution proposed by @lmiq, do we need to pad each scratch workspace with enough bytes to avoid false sharing? Whenever the end of a scratch buffer ends up in the same cache line as the beginning of the next scratch buffer, any write operation in one of the two will invalidate some elements of the other.