I want to stress that this is a toy example. It is not a real calculation. It is a toy example, intended to be a minimal working example. It is showing some approach I want to apply to a problem of mine
Hello!
I am writing some GPU code which looks like this:
function TOY_EXAMPLE_ATOMIC(Result,RANDOM_NUMBERS)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
NL = 0
@inbounds for i = index:stride:length(RANDOM_NUMBERS)
RANDOM_NUMBER = RANDOM_NUMBERS[i]
if RANDOM_NUMBER <= 0.5
NL += 1
CUDA.@atomic Result[NL] += RANDOM_NUMBER
end
end
return nothing
end
What this code is that it increments a counter, “NL” whenever a condition is fulfilled and then it adds the value of the random number to the array. In reality I would like to use “set_index!” with an atomic lock, instead of “+=”, but did not know how to do with CUDA.atomics.
I’ve noticed that when using atomics, it locks all the threads and stops all other work. I was hoping there was some way I could use the concept of a counter, “NL” and avoid atomics all together.
Below in “Full Code” I made a code which can be executed with the purpose of showcasing the issue and perhaps someone who is better at GPU programming can help me - thanks a lot !
Full Code
using CUDA
function TOY_EXAMPLE_ATOMIC(Result,RANDOM_NUMBERS)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
NL = 0
@inbounds for i = index:stride:length(RANDOM_NUMBERS)
RANDOM_NUMBER = RANDOM_NUMBERS[i]
if RANDOM_NUMBER <= 0.5
NL += 1
CUDA.@atomic Result[NL] += RANDOM_NUMBER
end
end
return nothing
end
function TOY_EXAMPLE(Result,RANDOM_NUMBERS)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
NL = 0
@inbounds for i = index:stride:length(RANDOM_NUMBERS)
RANDOM_NUMBER = RANDOM_NUMBERS[i]
if RANDOM_NUMBER <= 0.5
NL += 1
# BUG
Result[NL] += RANDOM_NUMBER
end
end
return nothing
end
### Code
N = 1000
RANDOM_NUMBERS = CUDA.rand(N)
Result_ATOMIC = CUDA.zeros(N)
Result = CUDA.zeros(N)
@cuda threads=10 blocks=2 TOY_EXAMPLE_ATOMIC(Result_ATOMIC,RANDOM_NUMBERS)
@cuda threads=10 blocks=2 TOY_EXAMPLE(Result,RANDOM_NUMBERS) #If you use threads=1 and blocks=1 of course it will give right answer too.
# Correct Result Always
BASE_SUM = sum(RANDOM_NUMBERS[RANDOM_NUMBERS .< 0.5])
# Printing
println("Correct Result : ",BASE_SUM)
println("GPU Code Result Atomic : ",sum(Result_ATOMIC))
println("GPU Code Result : ",sum(Result))
Kind regards