Cannot manage to use CUDA.atomic_add!

Hello,

I want to place particles into a 10x10 grid. To do this, I would like to use CUDA.atomic_add!() to count how many particles land in each cell.

Here is a minimal example:

using CUDA
function kernel_test_atomicadd!(arr, N, positions)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # Convert positions to (x, y) grid indices
    x = floor(Int, positions[i,1])+1
    y = floor(Int, positions[i,2])+1

    linear_idx = (y-1) * 10 + x
    # Atomically increment N[x, y], storing the previous value
    n = CUDA.atomic_add!(CUDA.pointer(N, linear_idx), 1)

    # Add the index i of the particle in arr[x,y,n+1]
    arr[x,y,n+1] = i

    return nothing
end

# Create a 10x10 grid to hold particle indices (max 50 per cell)
arr = CUDA.zeros(10,10,50);

# Initialize counter for particles per cell
N = CUDA.zeros(10,10);

# Random particle positions in the 10x10 space
positions = CUDA.rand(256,2)* 10.0;

@cuda threads = 256 kernel_test_atomicadd!(arr, N, positions)

I get this error message:

Reason: unsupported dynamic function invocation (call to atomic_add!)

If I replace pointer() by CUDA.Ref(N[x,y]) then I get the same error with.

I think I cannot use @atomic because I need to read the value of N[x,y] before adding 1, is it true ?

# This is working but does not correspond to what I am looking for
n = N[x,y]
@atomic N[x,y]+=1

Thank you for your help !

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.4
NVIDIA driver 552.86.0

CUDA libraries:
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+552.86

Julia packages:
- CUDA: 5.6.1
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX 2000 Ada Generation (sm_89, 11.572 GiB / 15.996 GiB available)

Hi,

On an RTX 3070 I can replicate your problem. Simply using N = CUDA.zeros(Int, 10, 10) fixes it. I think you need to make sure in

n = CUDA.atomic_add!(CUDA.pointer(N, linear_idx), 1)

that eltype(N) == typeof(1), as also N = CUDA.zeros(Int32, 10, 10) gives the same error. I’m not sure why N = CUDA.zeros(10, 10) (i.e. eltype(N) == Float32) and 1.f0 doesn’t work though, as

help?> CUDA.atomic_add!
(...)
  This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.

indicates Float32 should be supported. But in any case, it’s a bit weird to store counts N as Float32s, so just stick to Int (or Int32, 1i32) :slight_smile: .

No, with appropriately typed N, you can just do
n = CUDA.@atomic N[x, y] += 1
(Edit: no !) which will give you the original (non-incremented) value of N[x, y]. This works (with literal 1) not only for eltype(N) == Int64, but also for eltype(N) == Int32, though not for Float32.

1 Like

Thank you, indeed once the type of everything is fixed there is no error anymore,

using CUDA
using CUDA: i32
function kernel_test_atomicadd!(arr, N, positions)
    i::Int32 = (blockIdx().x - 1i32) * blockDim().x + threadIdx().x
    x::Int32 = floor(Int, positions[i,1i32])+1i32
    y::Int32 = floor(Int, positions[i,2i32])+1i32
    linear_idx::Int32 = (y - 1i32) * 10i32 + x
    n::Int32 = CUDA.atomic_add!(CUDA.pointer(N, linear_idx), 1i32)
    arr[x,y,n+1i32] = i
    return nothing
end

arr = CUDA.zeros(Int32, 10, 10, 50);
N = CUDA.zeros(Int32, 10, 10);
positions = CUDA.rand(Float32, 256, 2)* 10.0f0;

@cuda threads = 256 kernel_test_atomicadd!(arr, N, positions)

Thanks !

1 Like

By the way, here you could write floor(Int32, ...) to directly obtain an Int32, instead of only converting the Int64 when assigning to x::Int32.

1 Like

The reasoning here was that atomic_add! and friends are low-level atomics exposing what the hardware implements, while CUDA.@atomic is the friendlier alternative you should probably try to use if possible: CUDA.@atomic N[linear_idx] += 1.

1 Like