Multithreading with seperate caches for each thread?

Hi everyone,
I am writing a finite element solver. Since the solver takes a lot of time on a single core, I am trying to parallelize it. The solver method has several functional calls which share some cache. This has been done to keep the solver from allocating and to reduce time spent in garbage collection.

For parallelizing this, I am defining an array containing a cache struct for each thread. To prevent any race conditions, I am also tracking the cache used by a thread by maintaining a free_cache boolean array.

The code will look something like this:

chunks = Iterators.partition(eachindex(A), div(length(A), Threads.nthreads())
cache = [Cache{Float64, Int64}() for _ in Threads.nthreads()]
free_cache = fill(Bool(1), Threads.nthreads())

tasks = map(chunks) do chunk
           @spawn do_something(A, chunk, cache, free_cache)
        end 
result = maximum(fetch.(tasks))

And the function’s defintion looks like:

function do_something(A, chunk, cache, free_cache)
   free_idx = findfirst(!=(0), free_cache)
   free_cache[free_idx] = 0 
   ### SOME COMPUTATION using cache[free_idx] ####
   free_cache[free_idx] = 1
   return result
end

Upon testing, this approach produces the correct result. I wanted to ask if this a good practice and if there are any issues that can occur with this approach?

I believe you have a potential race condition if two different tasks finish free_idx = findfirst(!=(0), free_cache) before running the next line.

One way to resolve would be to specifically assign a cache to a task. For example, you could do something like

tasks = map(chunks) do chunk
           free_cache_for_task = ....
           @spawn do_something(A, chunk, cache, free_cache_for_task)
        end 

Basically you want to minimize the amount of shared memory that gets passed to @spawn.
For more detail, see PSA: Thread-local state is no longer recommended

Got it! Thanks. But there still remains the problem of allocating a lot.
Is it not possible to allocate once and use the same cache throughout?

You mean you want to reuse the caches later again with a new set of tasks? You could e.g. put the Caches into a Channel and each task just take!s one cache from the channel at the start instead of instantiating a new one. After finishing the task put!s it back into the Channel. A channel is thread-safe.

2 Likes

I like ChunkSplitters.jl for this kind of thing

2 Likes

You can use the task-local-storage. I use a macro (or two) to keep e.g. vectors around in each task, i.e. typically a spawned thread:

macro tlscache(type, makeit)
    sym = Expr(:quote, gensym("tls"))
    quote
        get!(() -> $(esc(makeit)), task_local_storage(), ($sym,$(esc(type))))::$(esc(type))
    end
end


macro tlscache(type)
    sym = Expr(:quote, gensym("tls"))
    quote
        get!(() -> $(esc(type))(), task_local_storage(), ($sym,$(esc(type))))::$(esc(type))
    end
end

It’s used like this:

v = @tlscache Vector{Int}
resize!(v, 23)
w = @tlscache Matrix{Float64} zeros(15,20)

An example of the same idea but without macros can be seen here: PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads() - #26 by Mason