Writing to a shared buffer in multi-threaded for loop

I have a for loop that iterates over millions of elements, each of which are processed quickly. Elements update random parts of a shared buffer. I’m trying to parallelize this operation in a performant way. I have considered the following design approaches, but they each have problems:

  1. Storing multiple copies of the shared buffer in a Channel, then reducing the buffers at the end. Unfortunately the time cost of getting and putting the buffer dominates the computation time, making this approach very slow.
  2. Using a buffer per thread, and reducing at the end. Unfortunately there is no reliable way of getting a thread index.
  3. Creating a TaskLocalValue. Notwithstanding the fact that the buffer is large, and the number of tasks could be large too (for load-balancing), using too much memory, I don’t see a way to reduce TaskLocalValues after the for loop, since they get cleaned up.

Can anyone recommend a better solution?

TaskLocalValues.jl are just a thin wrapper around Base.task_local_storage, which provides more flexibility. Using the latter, when you create the local buffer for the first time in a given task, you could store it in a thread-global array (protected by a lock) to later reduce over, i.e. something like:

buf = get!(task_local_storage(), :MY_BUFFER_SYMBOL) do
    # create new buffer if it doesn't exist yet for this task
    lock(my_buffers_lock) do
        newbuf = create_new_buffer()
        push!(my_buffers, newbuf)
        newbuf
    end
end::SomeBufferType

Alternatively, you can replace your threaded loop with a loop over equal-sized chunks of the array, via ChunkSplitters.jl, and allocate one buffer per chunk, as described here: No more threadid indexing? [thread-local storage] - #13 by lmiq — the tradeoff is that this pushes more load-balancing responsibility onto you.

I think the description is too vague. Is it really random? Or not.

If let me have a wild guess, it sounds like constructing a LP model with the GRBaddvar api.

say, you have 1million variables, the buffer is the constraint matrix.

it probably doesn’t make sense to use multithreading in the context here.

It’s a for loop to compute the Hessian of an optimization problem. Each element is a cost function that updates various sub-blocks of the Hessian. The sub-blocks are known, so not really random, but each cost function can update multiple sub-blocks, so it isn’t possible to ensure different threads access mutually exclusive parts of the buffer - unless I substantially change the program design (which has got me thinking :sweat_smile:).

Interesting guess :grinning_face_with_smiling_eyes:

In my context, multi-threading definitely makes sense.