Consider the block with blockIdx().x == 1 and blockIdx().y == 32. Then you would have 1 <= I <= 32 and 992 <= J <= 1024. Therefore, you need 64 values from space to be able to compute the 1024 values of out. mykernel2 only loads 32 ([992:1024])
You could adjust mykernel2! to load all 64 required values
function mykernel2_fixed!(out,space)
ig = blockIdx().x
jg = blockIdx().y
i = threadIdx().x
j = threadIdx().y
I = (ig-1)*blockDim().x + i
J = (jg-1)*blockDim().y + j
S = CUDA.@cuStaticSharedMem(Float32,(2,32))
if i == 1
S[1, j] = space[(ig-1)*blockDim().x + j] # all possible I
elseif i == 2
S[2, j] = space[J] # all possible J
end
CUDA.sync_threads()
out[I,J] = f(S[1, i],-S[2, j])
return nothing
end
julia> apply_kernel!(mykernel2_fixed!, out2, space); out2 ≈ out3
true
But due to caching (helping mykernel!) this ends up being slower than mykernel!
julia> @btime apply_kernel!($mykernel!, $out1, $space);
31.500 μs (16 allocations: 448 bytes)
julia> @btime apply_kernel!($mykernel2_fixed!, $out1, $space);
40.100 μs (16 allocations: 448 bytes)