Reusing @local cache when multithreading

Hi everyone! I am trying to update an old project of mine where I have three nested for loops as such:

scratch = zeros(Float32, tree_nodes, nthreads())
collect_here = zeros(Float32, nthreads())
for i in eachindex(V)
  @threads for j in eachindex(M)
    @views collect_here[threadid()] = tree_traversal!(scratch[:, threadid()], tree)
  end
  result = do_stuff(collect_here)
  ...
end

Where tree_traversal! does considerable work on a binary tree and needs a relatively large amount of scratch memory. In this old setting I was able to just allocate the memory in one go beforehand and I could avoid relying on locking mechanisms. After moving to OhMyThreads and setting up my scratch memory in @local, I am now spending approximately 20% of my time in the task allocating this memory (from @btime), which is a penalty I cannot really afford… :frowning: I am aware of the option of using the :static scheduler, but would like to avoid this.

using OhMyThreads

for i in eachindex(V)
  collect_here = @tasks for j in eachindex(M)
    @set collect = true
    @local scratch = zeros(Float32, tree_nodes)
    tree_traversal!(scratch, tree)
  end
  result = do_stuff(collect_here)
  ...
end

Is there any sort of well-supported way of setting up some sort of pool for scratch memory, where this allocated scratch memory can be reused between tasks in different iterations of the outer loop? Or am I stuck having to allocate this memory in every such iteration.

(post deleted by author)

Did you try the Channel-based scheme described in the OhMyThreads docs? That seems to suit your workload. It does use locks under the hood, but the user doesn’t have to think about them.