Hi again,
as you can see, I’m trying to understand some basic CUDA kernel programming under julia, so please try not to be too rough with me…
This time I want to build up a kernel that takes a bunch of numbers and return their sum (the same CUDA.sum() does). My attempt is something like this
function my_Sum_2(y,x)
idx_x = (blockIdx().x - 1) * blockDim().x + threadIdx().x
str_x = blockDim().x * gridDim().x
@inbounds for i in idx_x:str_x:length(y)
@atomic x[1] = x[1] + y[i]
end
nothing
end;
I added the @atomic
decorator as I understand this is the proper way to make the kernel understand I want to always point to the same element of the array (or memory position).
The problem is that I get slightly different results when I run my code
zzc = 10.f0*CUDA.rand(300_000,2)
res = CUDA.zeros(1)
numblocks = 256
@cuda threads = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
3.0026198f6
res = CUDA.zeros(1)
numblocks = 256
@cuda threads = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
3.002622f6
res = CUDA.zeros(1)
numblocks = 256
@cuda threads = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
3.002625f6
...
while CUDA.sum always gives the same result
CUDA.sum(zzc)
1-element CuArray{Float32,1}:
3.002624f6
What am I dong wrong here?
Thanks for your patience,
Ferran.