How do I avoid atomics in this kind of code on GPU?

Ahmed_Salih · March 10, 2023, 6:57pm

I want to stress that this is a toy example. It is not a real calculation. It is a toy example, intended to be a minimal working example. It is showing some approach I want to apply to a problem of mine

Hello!

I am writing some GPU code which looks like this:

function TOY_EXAMPLE_ATOMIC(Result,RANDOM_NUMBERS)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    NL = 0
    @inbounds for i = index:stride:length(RANDOM_NUMBERS)
        RANDOM_NUMBER = RANDOM_NUMBERS[i]

        if RANDOM_NUMBER <= 0.5
            NL += 1

            CUDA.@atomic Result[NL] += RANDOM_NUMBER
        end
    end

    return nothing
end

What this code is that it increments a counter, “NL” whenever a condition is fulfilled and then it adds the value of the random number to the array. In reality I would like to use “set_index!” with an atomic lock, instead of “+=”, but did not know how to do with CUDA.atomics.

I’ve noticed that when using atomics, it locks all the threads and stops all other work. I was hoping there was some way I could use the concept of a counter, “NL” and avoid atomics all together.

Below in “Full Code” I made a code which can be executed with the purpose of showcasing the issue and perhaps someone who is better at GPU programming can help me - thanks a lot !

Full Code

using CUDA

function TOY_EXAMPLE_ATOMIC(Result,RANDOM_NUMBERS)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    NL = 0
    @inbounds for i = index:stride:length(RANDOM_NUMBERS)
        RANDOM_NUMBER = RANDOM_NUMBERS[i]

        if RANDOM_NUMBER <= 0.5
            NL += 1

            CUDA.@atomic Result[NL] += RANDOM_NUMBER
        end
    end

    return nothing
end

function TOY_EXAMPLE(Result,RANDOM_NUMBERS)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    NL = 0
    @inbounds for i = index:stride:length(RANDOM_NUMBERS)
        RANDOM_NUMBER = RANDOM_NUMBERS[i]

        if RANDOM_NUMBER <= 0.5
            NL += 1

            # BUG
            Result[NL] += RANDOM_NUMBER
        end
    end

    return nothing
end

### Code
N              = 1000
RANDOM_NUMBERS = CUDA.rand(N)
Result_ATOMIC  = CUDA.zeros(N)
Result         = CUDA.zeros(N)

@cuda threads=10 blocks=2 TOY_EXAMPLE_ATOMIC(Result_ATOMIC,RANDOM_NUMBERS)
@cuda threads=10 blocks=2 TOY_EXAMPLE(Result,RANDOM_NUMBERS) #If you use threads=1 and blocks=1 of course it will give right answer too.

# Correct Result Always
BASE_SUM = sum(RANDOM_NUMBERS[RANDOM_NUMBERS .< 0.5])

# Printing
println("Correct Result         : ",BASE_SUM)
println("GPU Code Result Atomic : ",sum(Result_ATOMIC))
println("GPU Code Result        : ",sum(Result))

Kind regards

maleadt · March 10, 2023, 7:57pm

The typical approach is first to perform the operations in more local memory (e.g. shared memory, as shared-memory atomics are much faster) and finally have a single thread within the block perform the final reduction into global memory.

Ahmed_Salih · March 11, 2023, 8:39am

Okay, thank you for the hint!

I will try to see if I can learn how to do it.

If anyone knows how to transform this example of mine to sharedMemory it would be very beneficial for my learning

Kind regards

Topic		Replies	Views
Cannot manage to use CUDA.atomic_add! GPU cuda , atomic	4	51	June 30, 2025
One example from `GPU programming in Julia \| Workshop \| JuliaCon 2021` GPU question , gpu	0	364	April 5, 2022
How to use `@atomic` with CUDA? New to Julia cuda	1	541	October 5, 2020
CUDA atomics on complex-valued arrays GPU cuda	3	1259	April 12, 2021
CUDA.jl - Atomic Addition Error To A Sparse Array Inside CUDA Kernel GPU cuda , error , memory-allocation , sparse , atomic	4	702	March 31, 2022

How do I avoid atomics in this kind of code on GPU?

Related topics