Hi @Noel_Araujo, thanks for picking up ParallelStencil!
Regarding your issue, it looks like you are violating some CUDA limitations regarding maximal allowed number of threads per block, BLOCKX * BLOCKY * BLOCKZ <= 1024. In your case, you have BLOCKX * BLOCKY = 512^2 which is not allowed. You could set
BLOCKX = 32
BLOCKY = 32
which should solve your issue.
Note 1: ParallelStencil has cuthreads heuristics defined such that it permits you to skip explicit cuthreads and cublocks definition, dropping grid and block parameter definition and simply launching your kernel as
@time @parallel getEdges!(imageInput, imageOutput, threshold)
[RGB.(Array(imageInput)); RGB.(Array(imageOutput))] # check results
Note 2: You could in addition initialise your arrays in a backend-agnostic fashion as following
imageInput = @zeros(nx,ny)
imageInput .= Data.Array(red.(originalImage))
Hope this helps ![]()
(Thanks @carstenbauer for cc)