Scattered Atomic Writes Into Array

I’m porting some OpenCL code that does scattered atomic writes. These writes are quite sparse, meaning that threads rarely try to write cache lines at the same time.

I’m trying to figure out how to do this in Julia but am having some trouble. The atomic operations seem only to operate on individual boxed primitives, which is not very useful for me here, since I need to do atomic operations into an array.

Is there some workaround?

I am not an expert so this is more like a question:

My first idea is to create a smaller array of locks and mapping each lock to a continuous area in the original array. Would that be highly suboptimal?

1 Like

That would certainly be a reasonable thing to try. In this case I don’t have large amounts of contention over small areas of the array, so the cost of what you suggested would be proportional to the cost of the atomic operations, though a little more awkward with the extra resource management, etc.

That’s probably what I’ll do as a backup (or just let the scatter part of the code be single threaded).

I am assuming you are on x86_64?

Good news is that writes of 1-8 bytes are always atomic. Bad news is that atomic writes of larger things is unsupported (needs some locking structure [Edit: So x86_64 does have cmpxchg16b. TIL]). This also means that atomic ops on structs that are larger than 8 bytes are unsupported on your hardware.

You probably know this, and actually need things like atomic_add, atomic_rmw and atomic_cas on e.g. Ptr{UInt64} extracted from pointer(some_array, index)?

In that case, you should take a look at

Base only defines atomic operations for the boxed primitives – but you can just extend them to Ptr{your_needed_primitive} (copy-paste the code with minimal adjustments). Yes, this is type piracy; but this is imo the OK kind of type piracy (there is only one canonical definition that makes sense).

In case that the macro-heavy code from Base is too annoying to follow: You want e.g.

julia> Threads.atomic_add!(p::Ptr{UInt64}, v::UInt64) = Core.Intrinsics.llvmcall("""%ptr = inttoptr i64 %0 to i64*
       %rv = atomicrmw add i64* %ptr, i64 %1 acq_rel
       ret i64 %rv""",
       Tuple{UInt64, UInt64},
       reinterpret(UInt64, p), v)
julia> a=[UInt(4), UInt64(5)]
julia> Threads.atomic_add!(pointer(a,2), a[1])

If you dislike type piracy, then just call your function my_atomic_add!.


Sorry, my original phrasing may have been unclear. I want to atomically increment Float32 elements in an array, so it’s a read-modify-write, not just a write.

Thanks for the link. A bit of type piracy is alright by me. I might just try and manually unroll the templated llvmcalls in that file. Hopefully once all the multithreading APIs settle down there will at least be versions of these functions defined for Ptr types.

Oh I’ve just seen your edit. Thanks! That clarifies things a lot for me.

I forgot the second relevant link:


I agree that this is a shortcoming in the exposed API. Feel free to comment on the issue on github!

Someone there or here on discourse (maybe myself, or yourself) is likely to submit a PR if there is enough popular demand :wink: