CUDA atomics on complex-valued arrays

Hi,

I am curious if there is an way to do atomic operations on complex-valued arrays in CUDA kernel, e.g.

using CUDA

function kernel!(x::CuDeviceArray{T}, y::CuDeviceArray{T}) where T
    i = (blockIdx().x-1)*blockDim().x + threadIdx().x
    if i <= length(x)
        CUDA.atomic_add!(pointer(x, i), y[i])
    end
    return nothing
end

function test(T)
    x = CuArray(rand(T, 2, 2))
    y = CuArray(rand(T, 2, 2))

    threads = 100
    blocks = cld(length(x), threads)
    @cuda threads=threads blocks=blocks kernel!(x,y)
end

will work if I run test(Float32) but not if test(ComplexF32). I know this is because the atomics do not support the complex numbers. I can certainly divide a complex atomic operation with two atomics (one on real part and the other on imaginary part). But then, for example, I do not know how to pass the pointer to the real part of the i-th element of x: x[i].re.

In C++, I am able to do this by simply using atomicAdd(&x[i].x, y[i].x). However, I found that in Julia CUDA, both the assignment to x[i].re and taking pointer of it is troublesome (I met errors regarding setindex!).

One may suggest using a shared memory to accumulate the summation first, but in my real code, the order of access to y is not simple, so the use of shared memory is not straightforward.

I would like to ask if what I want to do is actually impossible, or there is a method to tackle this.
Thanks.

You could try reinterpret to create a real array from the complex one and index the real and imaginary part directly? Also, you can use @atomic x[i] += y[i] instead of the manual calls to atomic_add! to get rid of the call to pointer.

Thank you! One last question: does @atomic have any actual performance advantage over atomic_op(pointer(...))?

For those who want to use reinterpret, I write an example code here:

using CUDA

function kernel!(x, y)
    i = (blockIdx().x-1)*blockDim().x + threadIdx().x
    if i <= length(x) && i % 2 == 1
        for j = 1:2:i
            CUDA.atomic_add!(pointer(y,i), x[j])
        end
    end
    return nothing
end

function test()
    x = CuArray(ones(ComplexF32, 10))
    y = CuArray(zeros(ComplexF32, 10))

    x_ = reinterpret(Float32, x)
    y_ = reinterpret(Float32, y)
    threads = 100
    blocks = cld(length(x), threads)
    @cuda threads=threads blocks=blocks kernel!(x_, y_)
    
    println(y) # will print [1.0+0.0im, 2.0+0.0im, ..., 10.0+0.0im]
end

CUDA.@atomic just lowers down to atomic_xxx!, but does the array referencing for you.