Create a simple CUDA.sum kernel

Ferran_Mazzanti · January 3, 2021, 8:54pm

Hi again,

as you can see, I’m trying to understand some basic CUDA kernel programming under julia, so please try not to be too rough with me…

This time I want to build up a kernel that takes a bunch of numbers and return their sum (the same CUDA.sum() does). My attempt is something like this

function my_Sum_2(y,x)
    idx_x       = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    str_x       = blockDim().x * gridDim().x    
    @inbounds for i in idx_x:str_x:length(y)
        @atomic x[1] = x[1] + y[i]
    end
    nothing
end;

I added the @atomic decorator as I understand this is the proper way to make the kernel understand I want to always point to the same element of the array (or memory position).
The problem is that I get slightly different results when I run my code

zzc = 10.f0*CUDA.rand(300_000,2)

res = CUDA.zeros(1)
numblocks      = 256
@cuda threads  = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
 3.0026198f6
res = CUDA.zeros(1)
numblocks      = 256
@cuda threads  = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
 3.002622f6
res = CUDA.zeros(1)
numblocks      = 256
@cuda threads  = 256 blocks = numblocks my_Sum_2(zzc,res)
res
1-element CuArray{Float32,1}:
 3.002625f6
...

while CUDA.sum always gives the same result

CUDA.sum(zzc)
1-element CuArray{Float32,1}:
 3.002624f6

What am I dong wrong here?

Thanks for your patience,

Ferran.

bcmichael · January 3, 2021, 9:51pm

The reason that you get slightly different results each time you run your function is that the order in which your threads get executed on the gpu is not deterministic and floating point math isn’t associative. Each time you run it the numbers get added in a different order and you get a slightly different answer.

Ferran_Mazzanti · January 3, 2021, 9:58pm

Ok. thanks… but that’s a problem then, as it does not seem very reliable. It also happens with Float64, which has much better accuracy. But that does not happen with CUDA.sum… why?

bcmichael · January 3, 2021, 10:09pm

I looked at the code in CUDA.jl (it’s a more generic mapreduce function) and I think what it is doing is using multiple kernel calls to do the sum in a deterministic way. It looks like it has a bunch of blocks that each sum over a portion of the array deterministically, then writes the results from each block to global memory. Then it calls the kernel again recursively until it eventually gets the final result from a kernel call with only one block. It doesn’t matter which order the blocks execute if you do it that way.

It’s also probably faster than how you do it. Those atomic additions you use are going to be very slow and prevent using the GPU very efficiently.

Topic		Replies	Views
One example from `GPU programming in Julia \| Workshop \| JuliaCon 2021` GPU question , gpu	0	364	April 5, 2022
CUDA sum kernels, threads and blocks, complex values Performance gpu , cuda	2	1406	February 3, 2021
Different calculation results when using CPU vs. GPU with CUDA.jl General Usage question , gpu , cuda , atomic	2	1193	March 2, 2022
Random variations between results of CPU and GPU computation GPU	7	420	May 9, 2023
Simple kernel not working GPU	10	1194	July 12, 2020

Create a simple CUDA.sum kernel

Related topics