Create a simple CUDA.sum kernel

Hi again,

as you can see, I’m trying to understand some basic CUDA kernel programming under julia, so please try not to be too rough with me… :slight_smile:

This time I want to build up a kernel that takes a bunch of numbers and return their sum (the same CUDA.sum() does). My attempt is something like this

function my_Sum_2(y,x)
    idx_x       = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    str_x       = blockDim().x * gridDim().x    
    @inbounds for i in idx_x:str_x:length(y)
        @atomic x[1] = x[1] + y[i]
    end
    nothing
end;

I added the @atomic decorator as I understand this is the proper way to make the kernel understand I want to always point to the same element of the array (or memory position).
The problem is that I get slightly different results when I run my code

zzc = 10.f0*CUDA.rand(300_000,2)

res = CUDA.zeros(1)
numblocks      = 256
@cuda threads  = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
 3.0026198f6
res = CUDA.zeros(1)
numblocks      = 256
@cuda threads  = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
 3.002622f6
res = CUDA.zeros(1)
numblocks      = 256
@cuda threads  = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
 3.002625f6
...

while CUDA.sum always gives the same result

CUDA.sum(zzc)
1-element CuArray{Float32,1}:
 3.002624f6

What am I dong wrong here?

Thanks for your patience,

Ferran.

1 Like

The reason that you get slightly different results each time you run your function is that the order in which your threads get executed on the gpu is not deterministic and floating point math isn’t associative. Each time you run it the numbers get added in a different order and you get a slightly different answer.

Ok. thanks… but that’s a problem then, as it does not seem very reliable. It also happens with Float64, which has much better accuracy. But that does not happen with CUDA.sum… why?

I looked at the code in CUDA.jl (it’s a more generic mapreduce function) and I think what it is doing is using multiple kernel calls to do the sum in a deterministic way. It looks like it has a bunch of blocks that each sum over a portion of the array deterministically, then writes the results from each block to global memory. Then it calls the kernel again recursively until it eventually gets the final result from a kernel call with only one block. It doesn’t matter which order the blocks execute if you do it that way.

It’s also probably faster than how you do it. Those atomic additions you use are going to be very slow and prevent using the GPU very efficiently.