Trying to perform a basic per element sum of two Vector of SVector works fine, but using the atomic_add!
resulted in an error. I’m aware that in this particlar case, an atomic operation isn’t needed, but it’s a MWE from a use case where it is needed.
The base example - working:
using CUDAnative, CuArrays, StaticArrays
function kernel!(x, y)
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if i <= length(x)
@inbounds x[i] += y[i]
end
return
end
function hist_gpu!(x, y; MAX_THREADS=1024)
thread_i = min(MAX_THREADS, length(x))
threads = (thread_i)
blocks = ceil.(Int, length(x) .÷ threads)
CuArrays.@sync begin
@cuda blocks=blocks threads=threads kernel!(x, y)
end
return
end
x = rand(SVector{2, Float32}, Int(1e7))
y = rand(SVector{2, Float32}, Int(1e7))
x_gpu = CuArray(x)
y_gpu = CuArray(y)
@CuArrays.time hist_gpu!(x_gpu, y_gpu)
And the failing case with atomic add:
function kernel!(x, y)
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if i <= length(x)
k = Base._to_linear_index(x, i)
CUDAnative.atomic_add!(pointer(x, k), y[i])
end
return
end
@CuArrays.time hist_gpu!(x_gpu, y_gpu)
InvalidIRError: compiling kernel kernel!(CuDeviceArray{SArray{Tuple{2},Float32,1,2},1,CUDAnative.AS.Global}, CuDeviceArray{SArray{Tuple{2},Float32,1,2},1,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)
Is such an expected behavior? From other discussions, such error message seems tied to situation where the type couldn’t be inferred. But in the above, I don’t get why the atomic operation would fail to infer types while the generic x[i] += y[i]
had no issue.