Atomic operations issue on StaticArrays with CUDAnative

Trying to perform a basic per element sum of two Vector of SVector works fine, but using the atomic_add! resulted in an error. I’m aware that in this particlar case, an atomic operation isn’t needed, but it’s a MWE from a use case where it is needed.

The base example - working:

using CUDAnative, CuArrays, StaticArrays

function kernel!(x, y)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if i <= length(x)
        @inbounds x[i] += y[i]
    end
    return
end

function hist_gpu!(x, y; MAX_THREADS=1024)
    thread_i = min(MAX_THREADS, length(x))
    threads = (thread_i)
    blocks = ceil.(Int, length(x) .÷ threads)
    CuArrays.@sync begin
        @cuda blocks=blocks threads=threads kernel!(x, y)
    end
    return
end

x = rand(SVector{2, Float32}, Int(1e7))
y = rand(SVector{2, Float32}, Int(1e7))
x_gpu = CuArray(x)
y_gpu = CuArray(y)

@CuArrays.time hist_gpu!(x_gpu, y_gpu)

And the failing case with atomic add:

function kernel!(x, y)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if i <= length(x)
        k = Base._to_linear_index(x, i)
        CUDAnative.atomic_add!(pointer(x, k), y[i])
    end
    return
end

@CuArrays.time hist_gpu!(x_gpu, y_gpu)


InvalidIRError: compiling kernel kernel!(CuDeviceArray{SArray{Tuple{2},Float32,1,2},1,CUDAnative.AS.Global}, CuDeviceArray{SArray{Tuple{2},Float32,1,2},1,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)

Is such an expected behavior? From other discussions, such error message seems tied to situation where the type couldn’t be inferred. But in the above, I don’t get why the atomic operation would fail to infer types while the generic x[i] += y[i] had no issue.

The error message gives you a hint. The element type of the CuDeviceArray you are operating on is a SArray. So the pointer(x, k) will yield a DevicePtr{SArray}, and atomic_add! is only defined for DevicePtr{<:Union{Float32, Float64, Int32}}, see https://github.com/JuliaGPU/CUDAnative.jl/blob/ff0cd45c7c2cde3f6893c9e6747234b4bb64dbef/src/device/cuda/atomics.jl#L130

An atomic add here is not possible, since you have two fields that you want to update atomically and the hardware can’t do that. You could update them individually and that might be fine for your particular case, do you really need to atomically update the whole StaticArray in one go? If you can I would re-order the code so that you can make updates locally and not need atomics in the first place.

1 Like

Thanks for the explanation, it now makes much sense.
I’m doing histogram accumulation, so I haven’t figured a safe and efficient way to avoid atomic. However, reframing the problem from a matrix of size M x N of SVector{L} into an array of size M x N x L seems a good alternative.