ParallelStencil works with Threads but not with CUDA

Hi @Noel_Araujo, thanks for picking up ParallelStencil!

Regarding your issue, it looks like you are violating some CUDA limitations regarding maximal allowed number of threads per block, BLOCKX * BLOCKY * BLOCKZ <= 1024. In your case, you have BLOCKX * BLOCKY = 512^2 which is not allowed. You could set

BLOCKX = 32
BLOCKY = 32

which should solve your issue.

Note 1: ParallelStencil has cuthreads heuristics defined such that it permits you to skip explicit cuthreads and cublocks definition, dropping grid and block parameter definition and simply launching your kernel as

@time @parallel getEdges!(imageInput, imageOutput, threshold)
[RGB.(Array(imageInput)); RGB.(Array(imageOutput))] # check results

Note 2: You could in addition initialise your arrays in a backend-agnostic fashion as following

imageInput  = @zeros(nx,ny)
imageInput .= Data.Array(red.(originalImage))

Hope this helps :slight_smile:

(Thanks @carstenbauer for cc)

2 Likes