Editing an array with multiple threads

Is it possible to update the values of an array (of ints) (size is known and fixed in advanced) using multiple threads in a safe way? I can’t partition parts of the array to only be touched by certain threads as any thread only knows the idxs to edit as a result of its work, it is not known in advance. In OpenMP I can just mark the section of code as critical. Are locks in Julia appropriate for this? If so I could use a lock to force each thread to edit the array one at a time. I also thought of using a vector of atomics, but I’m not sure if that’s appropriate either.

I have a similar question about IO, can I force each thread to only write to a file one at a time with a lock?

I think what you need is a Base.Threads.Mutex. Then, use it to protect the part of the code that writes to the array.

That will be very slow.

I would think as long as each thread is working on it’s own “segment” of the array it should be fine. The issue would be the CPU cache, if CPU A writes a value but isn’t flushed to memory and CPU B tries to read that value it may get the original (that is still in memory) instead of the new value. But as long as the threads are only accessing their own portion of the array you will be fine.

You would just need to create the array before you start any of the threads.

My two worries would be that Julia does some under the covers data protection behind the scenes. Or (I think this is unlikely) that there is some delay flushing the CPU cache for the worker threads at the end, I would assume that the Julia code that waits for the threads to finish makes sure the CPU caches are flushed before it continues the main thread.

There is no need to think about CPU caches etc. Just code so you don’t have data races. The rest will be taken care of.

2 Likes

I’m facing a similar issue, but threads will not be working on their own subsets of the array. Essentially I have threads calculating array subsets and those values need to be accumulated (via sum) in some very large master array.

I’d like to avoid having temporary objects for each thread that are then summed in a final step as this will inflate memory consumption (64 bit arrays up with up to 100,000,000 elements or more). Is there a thread-safe way to do this currently (or might there be with the multi-threading improvements that are being developed)?

Thanks very much for any insight!

If the threads are just reading from the array, then you don’t need any synchronization. i.e. you create/populate the array, start up multiple threads to read values from it and aggregate those results.

If the threads are reading a value, performing an operation then updating it (or another offset in the array), then you either need to use a SpinLock() or Mutex() to control access, or you might be able to use atomic operations to perform the updates, like an atomic test and set.

1 Like

@pixel27 Thanks! I’m reading the array and writing to it since on each thread and at each iteration (in a loop), I’m computing a new array and adding it back to the original, so I think Mutex might be the way to go. Will give it a shot! The overhead will hopefully be relatively negligible since the operations I’m doing are pretty costly (the addition itself will generally be much less than 0.1% of the total compute time in each iteration).

There is one important exception here, and that’s if you’re accessing sequential elements in an array from different threads. A very easy way to fall into this trap is by writing to A[threadid()] — which seems like it’d nicely segment your array and prevent data races. It’ll indeed prevent data races. It’ll be very safe and very slow. Sometimes LLVM will save you here by emulating the array access with a temporary value if it’s trivial, but this can cause massive slowdowns if LLVM doesn’t do this:

julia> function f(spacing)
           test_size = 1000000
           a = zeros(Threads.nthreads()*spacing)
           b = rand(test_size)
           calls = zeros(Threads.nthreads()*spacing)
           Threads.@threads for i = 1 : test_size
               @inbounds begin
                   a[Threads.threadid()*spacing] += b[i]
                   calls[Threads.threadid()*spacing] += 1
               end
           end
           a, calls
       end
f (generic function with 1 method)

julia> @btime f(1);
  41.525 ms (35 allocations: 7.63 MiB)

julia> @btime f(8);
  2.189 ms (35 allocations: 7.63 MiB)

The thing to remember about caches and such is that they’re meant to be an invisible abstraction to speed things up. That means that doing something that would otherwise violate cache coherency will do the right thing but it’ll ruin the speedup because the CPU cores need to communicate in order to prevent breaking the semantics. This is known as false sharing.

6 Likes

That’s interesting, thanks. What’s the recommended way to get somewhere to accumulate per thread, instead of A[threadid()]?

I just want to note that my comment was only regarding correctness. For performance there is indeed a much much longer story.

Is there an example of some user code which uses Mutex to protect code blocks in julia so as only one thread can execute them at a time?

Not really since the new threading work hasn’t been released yet.

I haven’t yet played with threading in Julia precisely because of all the potential challenges I see in terms of this. Thus take this as guesses, not facts.

  1. If possible partition the vector A, rather as nThreads big blocks than individual items:
parititionPoints = range(1, stop=length(A), count=nThreads)

should normally give better cache coherency than using
idx mod nThread

  1. In terms of temporary work spaces, a
tempWorkAreas = Vector{Any}(undef, nThreads)

might be useful. Normally boxing works against you, since the compile generates slower code, whereas if you don’t have the box, the data can be tightly packed in adjacent memory locations. In this case, you don’t want the work space of separate threads to be tightly packed together, since that might cause the cache coherency to try to keep them in sync, whereas you specifically want them to be worked on individually by separate cores. You would need to be very careful though that this boxing doesn’t accidentally “spread” (due to the compiler being unable to determine types generated through this temp Work Area) forcing all the memory lookups to be a 2 stage (boxed) lookup.

Here I am trying to extrapolate the usecase, based on my previous experiences and not neccesarily your actual usecase, due to lack of detail provided: In @vlandau case where the addition is only 0.1% of the total compute, I would be tempted to go with an approach of :

  1. having the array global “const” for the duration of the iteration,
  2. calculating a list of all updates required in each thread’s local work space, this can happen in an async manner
  3. have a separate step to merge all of the updates back into the global array at the end of the iteration (consider doing this single threaded again, since it is such a small percentage and threading and locking etc. might just make things more difficult, instead of faster). If you really must go multi-threaded, consider sorting the updates and partitioning the sorted list again to work on separate regions of memory.

This way, you also have a stable global array for the duration of the iteration, compared to the alternative of every thread modifying the global array immediately and thus being at the mercy of race conditions as to what other threads are going to be reading for that iteration.

Just some thoughts, I hope they are useful. But I am keen to see how Julia threading evolves and what the language will be able to add to make things easier.