Hello, I’m somewhat new to kernel programing and I stuggle understanding why the two kernels bellow don’t do the same thing
using CUDA
f(x,y) = x+y
function mykernel!(out,space)
ig = blockIdx().x
jg = blockIdx().y
i = threadIdx().x
j = threadIdx().y
I = (ig-1)*blockDim().x + i
J = (jg-1)*blockDim().y + j
out[I,J] = f(space[I],-space[J])
return nothing
end
function mykernel2!(out,space)
ig = blockIdx().x
jg = blockIdx().y
i = threadIdx().x
j = threadIdx().y
I = (ig-1)*blockDim().x + i
J = (jg-1)*blockDim().y + j
S = CUDA.@cuStaticSharedMem(Float32,(32,))
if i == 1
S[j] = space[J]
end
CUDA.sync_threads()
out[I,J] = f(S[i],-S[j])
return nothing
end
function apply_kernel!(ker,out,space)
threads = (32,32)
N = length(space)
blocks = (div(N,threads[1]),div(N,threads[2]))
@cuda threads=threads blocks=blocks ker(out,space)
CUDA.synchronize()
end
N = 1024
space = CUDA.rand(N)
out1 = CUDA.zeros(N,N)
apply_kernel!(mykernel!,out1,space)
out2 = CUDA.zeros(N,N)
apply_kernel!(mykernel2!,out2,space)
out3 = f.(space,-space')
CUDA.synchronize()
out1 ≈ out3 # true
out2 ≈ out3 # false
in my mind all groups are of size (32,32) and the threads (1,1:32) init the shared memory while the other waits, then, S is fully defined per group (length 32) and S[i] and S[j] should give the right answer. I also don’t see where race conditions would appear since only (1,1:32) id threads store in sharred memory and in different location. My only idea is that reading S[i] and S[j] also leads to race conditions even if it is only about reading ??
Also, do you have a way to make the first kernel only read the global memory once if the second can’t work ?
Thank you !