How do I avoid atomics in this kind of code on GPU?

I want to stress that this is a toy example. It is not a real calculation. It is a toy example, intended to be a minimal working example. It is showing some approach I want to apply to a problem of mine

Hello!

I am writing some GPU code which looks like this:

function TOY_EXAMPLE_ATOMIC(Result,RANDOM_NUMBERS)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    NL = 0
    @inbounds for i = index:stride:length(RANDOM_NUMBERS)
        RANDOM_NUMBER = RANDOM_NUMBERS[i]

        if RANDOM_NUMBER <= 0.5
            NL += 1

            CUDA.@atomic Result[NL] += RANDOM_NUMBER
        end
    end

    return nothing
end

What this code is that it increments a counter, “NL” whenever a condition is fulfilled and then it adds the value of the random number to the array. In reality I would like to use “set_index!” with an atomic lock, instead of “+=”, but did not know how to do with CUDA.atomics.

I’ve noticed that when using atomics, it locks all the threads and stops all other work. I was hoping there was some way I could use the concept of a counter, “NL” and avoid atomics all together.

Below in “Full Code” I made a code which can be executed with the purpose of showcasing the issue and perhaps someone who is better at GPU programming can help me - thanks a lot ! :slight_smile:

Full Code
using CUDA

function TOY_EXAMPLE_ATOMIC(Result,RANDOM_NUMBERS)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    NL = 0
    @inbounds for i = index:stride:length(RANDOM_NUMBERS)
        RANDOM_NUMBER = RANDOM_NUMBERS[i]

        if RANDOM_NUMBER <= 0.5
            NL += 1

            CUDA.@atomic Result[NL] += RANDOM_NUMBER
        end
    end

    return nothing
end

function TOY_EXAMPLE(Result,RANDOM_NUMBERS)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    NL = 0
    @inbounds for i = index:stride:length(RANDOM_NUMBERS)
        RANDOM_NUMBER = RANDOM_NUMBERS[i]

        if RANDOM_NUMBER <= 0.5
            NL += 1

            # BUG
            Result[NL] += RANDOM_NUMBER
        end
    end

    return nothing
end

### Code
N              = 1000
RANDOM_NUMBERS = CUDA.rand(N)
Result_ATOMIC  = CUDA.zeros(N)
Result         = CUDA.zeros(N)

@cuda threads=10 blocks=2 TOY_EXAMPLE_ATOMIC(Result_ATOMIC,RANDOM_NUMBERS)
@cuda threads=10 blocks=2 TOY_EXAMPLE(Result,RANDOM_NUMBERS) #If you use threads=1 and blocks=1 of course it will give right answer too.

# Correct Result Always
BASE_SUM = sum(RANDOM_NUMBERS[RANDOM_NUMBERS .< 0.5])

# Printing
println("Correct Result         : ",BASE_SUM)
println("GPU Code Result Atomic : ",sum(Result_ATOMIC))
println("GPU Code Result        : ",sum(Result))

Kind regards

The typical approach is first to perform the operations in more local memory (e.g. shared memory, as shared-memory atomics are much faster) and finally have a single thread within the block perform the final reduction into global memory.

Okay, thank you for the hint!

I will try to see if I can learn how to do it.

If anyone knows how to transform this example of mine to sharedMemory it would be very beneficial for my learning :slight_smile:

Kind regards