Atomic operations issue on StaticArrays with CUDAnative

jeremiedb · May 16, 2020, 7:37am

Trying to perform a basic per element sum of two Vector of SVector works fine, but using the atomic_add! resulted in an error. I’m aware that in this particlar case, an atomic operation isn’t needed, but it’s a MWE from a use case where it is needed.

The base example - working:

using CUDAnative, CuArrays, StaticArrays

function kernel!(x, y)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if i <= length(x)
        @inbounds x[i] += y[i]
    end
    return
end

function hist_gpu!(x, y; MAX_THREADS=1024)
    thread_i = min(MAX_THREADS, length(x))
    threads = (thread_i)
    blocks = ceil.(Int, length(x) .÷ threads)
    CuArrays.@sync begin
        @cuda blocks=blocks threads=threads kernel!(x, y)
    end
    return
end

x = rand(SVector{2, Float32}, Int(1e7))
y = rand(SVector{2, Float32}, Int(1e7))
x_gpu = CuArray(x)
y_gpu = CuArray(y)

@CuArrays.time hist_gpu!(x_gpu, y_gpu)

And the failing case with atomic add:

function kernel!(x, y)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if i <= length(x)
        k = Base._to_linear_index(x, i)
        CUDAnative.atomic_add!(pointer(x, k), y[i])
    end
    return
end

@CuArrays.time hist_gpu!(x_gpu, y_gpu)


InvalidIRError: compiling kernel kernel!(CuDeviceArray{SArray{Tuple{2},Float32,1,2},1,CUDAnative.AS.Global}, CuDeviceArray{SArray{Tuple{2},Float32,1,2},1,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)

Is such an expected behavior? From other discussions, such error message seems tied to situation where the type couldn’t be inferred. But in the above, I don’t get why the atomic operation would fail to infer types while the generic x[i] += y[i] had no issue.

vchuravy · May 16, 2020, 7:24pm

The error message gives you a hint. The element type of the CuDeviceArray you are operating on is a SArray. So the pointer(x, k) will yield a DevicePtr{SArray}, and atomic_add! is only defined for DevicePtr{<:Union{Float32, Float64, Int32}}, see CUDAnative.jl/atomics.jl at ff0cd45c7c2cde3f6893c9e6747234b4bb64dbef · JuliaGPU/CUDAnative.jl · GitHub

An atomic add here is not possible, since you have two fields that you want to update atomically and the hardware can’t do that. You could update them individually and that might be fine for your particular case, do you really need to atomically update the whole StaticArray in one go? If you can I would re-order the code so that you can make updates locally and not need atomics in the first place.

jeremiedb · May 17, 2020, 12:52am

Thanks for the explanation, it now makes much sense.
I’m doing histogram accumulation, so I haven’t figured a safe and efficient way to avoid atomic. However, reframing the problem from a matrix of size M x N of SVector{L} into an array of size M x N x L seems a good alternative.

Topic		Replies	Views
Adding at specific CuArray position GPU question	6	195	May 6, 2024
Cannot manage to use CUDA.atomic_add! GPU cuda , atomic	4	60	June 30, 2025
CUDA.jl - Sub-Vector Indexing Problem Inside CUDA Kernel GPU cuda , error , cuarrays , error-message , staticarrays	2	1244	March 28, 2022
Local thread memory in GPU using StaticArrays GPU question , gpu , cuda	4	6253	January 26, 2020
Why can Flux not reduce this? GPU	5	394	February 8, 2023

Atomic operations issue on StaticArrays with CUDAnative

Related topics