Hi,
I try to optimize the following kernel
using CUDA
function kernel(A,B)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
for i in index:stride:size(A,1)
α=0
β=0
for j in 1:100
for k in 1:size(A,2)
top=A[i,k]
bot=B[i,k]
α+=top/bot
β-=top/(bot*bot)
end
end
end
return
end
I tried to use shared memory resulting in the following kernel
function skernel(A,B,nthreads)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
off=size(A,2)
# shared=@cuDynamicSharedMem(Float32,size(A,2)*nthreads)
# p=view(shared,off*(threadIdx().x-1)+1:off*(threadIdx().x))
# q=view(shared,off*(threadIdx().x-1)+1:off*(threadIdx().x))
p=@cuDynamicSharedMem(Float32,size(A,2),32*2*off*(threadIdx().x-1))
q=@cuDynamicSharedMem(Float32,size(A,2),32*off*(2*(threadIdx().x-1)+1))
for i in index:stride:size(A,1)
for k in 1:size(A,2)
p[k]=A[i,k]
q[k]=B[i,k]
end
α=0
β=0
for j in 1:100
for k in 1:size(A,2)
α+=p[k]/q[k]
β-=p[k]/(q[k]*q[k])
end
end
end
return
end
And using shared memory like that results in poor performance (10 times slower for my application) using this function to time both kernel:
function test_kernels(L,n,threads)
numblocks = ceil(Int,L/256)
snumblocks = ceil(Int,L/threads)
A=CUDA.rand(L,n)
B=CUDA.rand(L,n)
@btime CUDA.@sync begin
@cuda threads=256 blocks=$numblocks kernel($A,$B)
end
@btime CUDA.@sync begin
@cuda threads=$threads blocks=$snumblocks shmem=32*2*$n*$threads skernel($A,$B,$threads)
end
end
My main use case is L=32*1024 and n=81, du to memory limitations with the second kernel you can only use up to 8 threads per block.
I wonder if this is poor/useless use of shared memory ?
(using view instead of offset is a little bit faster, but slower than using global memory, although data is used 100 time in the loop)