Cuda kernel help

Consider the block with blockIdx().x == 1 and blockIdx().y == 32. Then you would have 1 <= I <= 32 and 992 <= J <= 1024. Therefore, you need 64 values from space to be able to compute the 1024 values of out. mykernel2 only loads 32 ([992:1024])

You could adjust mykernel2! to load all 64 required values

function mykernel2_fixed!(out,space)
    ig = blockIdx().x
    jg = blockIdx().y
    i = threadIdx().x
    j = threadIdx().y
    I = (ig-1)*blockDim().x + i
    J = (jg-1)*blockDim().y + j
    S = CUDA.@cuStaticSharedMem(Float32,(2,32))
    if i == 1
        S[1, j] = space[(ig-1)*blockDim().x + j]  # all possible I
    elseif i == 2
        S[2, j] = space[J]  # all possible J
    end
    CUDA.sync_threads()
    out[I,J] = f(S[1, i],-S[2, j])
    return nothing
end
julia> apply_kernel!(mykernel2_fixed!, out2, space); out2 ≈ out3
true

But due to caching (helping mykernel!) this ends up being slower than mykernel!

julia> @btime apply_kernel!($mykernel!, $out1, $space);
  31.500 μs (16 allocations: 448 bytes)

julia> @btime apply_kernel!($mykernel2_fixed!, $out1, $space);
  40.100 μs (16 allocations: 448 bytes)
1 Like