Cuda kernel help

Hello, I’m somewhat new to kernel programing and I stuggle understanding why the two kernels bellow don’t do the same thing

using CUDA

f(x,y) = x+y

function mykernel!(out,space)
    ig = blockIdx().x
    jg = blockIdx().y
    i = threadIdx().x
    j = threadIdx().y
    I = (ig-1)*blockDim().x + i
    J = (jg-1)*blockDim().y + j
    out[I,J] = f(space[I],-space[J])
    return nothing
end
function mykernel2!(out,space)
    ig = blockIdx().x
    jg = blockIdx().y
    i = threadIdx().x
    j = threadIdx().y
    I = (ig-1)*blockDim().x + i
    J = (jg-1)*blockDim().y + j
    S = CUDA.@cuStaticSharedMem(Float32,(32,))
    if i == 1
        S[j] = space[J]
    end
    CUDA.sync_threads()
    out[I,J] = f(S[i],-S[j])
    return nothing
end

function apply_kernel!(ker,out,space)
    threads = (32,32)
    N = length(space)
    blocks = (div(N,threads[1]),div(N,threads[2]))
    @cuda threads=threads blocks=blocks ker(out,space)
    CUDA.synchronize()
end


N = 1024
space = CUDA.rand(N)

out1 = CUDA.zeros(N,N)
apply_kernel!(mykernel!,out1,space)

out2 = CUDA.zeros(N,N)
apply_kernel!(mykernel2!,out2,space)

out3 = f.(space,-space')


CUDA.synchronize()
out1 ≈ out3 # true
out2 ≈ out3 # false

in my mind all groups are of size (32,32) and the threads (1,1:32) init the shared memory while the other waits, then, S is fully defined per group (length 32) and S[i] and S[j] should give the right answer. I also don’t see where race conditions would appear since only (1,1:32) id threads store in sharred memory and in different location. My only idea is that reading S[i] and S[j] also leads to race conditions even if it is only about reading ??

Also, do you have a way to make the first kernel only read the global memory once if the second can’t work ?
Thank you !

1 Like

Consider the block with blockIdx().x == 1 and blockIdx().y == 32. Then you would have 1 <= I <= 32 and 992 <= J <= 1024. Therefore, you need 64 values from space to be able to compute the 1024 values of out. mykernel2 only loads 32 ([992:1024])

You could adjust mykernel2! to load all 64 required values

function mykernel2_fixed!(out,space)
    ig = blockIdx().x
    jg = blockIdx().y
    i = threadIdx().x
    j = threadIdx().y
    I = (ig-1)*blockDim().x + i
    J = (jg-1)*blockDim().y + j
    S = CUDA.@cuStaticSharedMem(Float32,(2,32))
    if i == 1
        S[1, j] = space[(ig-1)*blockDim().x + j]  # all possible I
    elseif i == 2
        S[2, j] = space[J]  # all possible J
    end
    CUDA.sync_threads()
    out[I,J] = f(S[1, i],-S[2, j])
    return nothing
end
julia> apply_kernel!(mykernel2_fixed!, out2, space); out2 ≈ out3
true

But due to caching (helping mykernel!) this ends up being slower than mykernel!

julia> @btime apply_kernel!($mykernel!, $out1, $space);
  31.500 μs (16 allocations: 448 bytes)

julia> @btime apply_kernel!($mykernel2_fixed!, $out1, $space);
  40.100 μs (16 allocations: 448 bytes)
1 Like

Oh ok I was so focus on the parallel point that I forgot to check the math nice :rofl: Thank you !

1 Like