Hi again,

as you can see, I’m trying to understand some basic CUDA kernel programming under julia, so please try not to be too rough with me…

This time I want to build up a kernel that takes a bunch of numbers and return their sum (the same CUDA.sum() does). My attempt is something like this

```
function my_Sum_2(y,x)
idx_x = (blockIdx().x - 1) * blockDim().x + threadIdx().x
str_x = blockDim().x * gridDim().x
@inbounds for i in idx_x:str_x:length(y)
@atomic x[1] = x[1] + y[i]
end
nothing
end;
```

I added the `@atomic`

decorator as I understand this is the proper way to make the kernel understand I want to always point to the same element of the array (or memory position).

The problem is that I get *slightly* different results when I run my code

```
zzc = 10.f0*CUDA.rand(300_000,2)
res = CUDA.zeros(1)
numblocks = 256
@cuda threads = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
3.0026198f6
res = CUDA.zeros(1)
numblocks = 256
@cuda threads = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
3.002622f6
res = CUDA.zeros(1)
numblocks = 256
@cuda threads = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
3.002625f6
...
```

while CUDA.sum always gives the same result

```
CUDA.sum(zzc)
1-element CuArray{Float32,1}:
3.002624f6
```

What am I dong wrong here?

Thanks for your patience,

Ferran.