ParallelStencil works with Threads but not with CUDA


I have a code of Edge Detection using ParallelStencil. The algorithm is basic, it just compares one pixel to its left and bottom neighbors, if their values are above one threshold, I fill the image(matrix) with a different color(value) to indicate a border.

using Images
using ParallelStencil

const USE_GPU = false
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 2); 
    @init_parallel_stencil(Threads, Float64, 2); 

@parallel_indices (x,y) function getEdges!(originalImage, newImage, threshold)
    if (    x >= 2  && x <= size(originalImage,1) &&
            y >= 2  && y <= size(originalImage,2) )
        currentPixel = originalImage[x,y]
        leftPixel    = originalImage[x-1,y]
        bottomPixel  = originalImage[x,y-1]

        if abs(currentPixel-leftPixel) ≥ threshold || abs(currentPixel-bottomPixel) ≥ threshold
            newImage[x-1,y-1] = 1.0

Now I load the image

originalImage = load("image2.png") 
nx,ny = size(originalImage) # for my particular image, the red channel was enough

imageInput = @zeros(nx,ny)
imageInput .= red.(originalImage)

# using CUDA: CuArray
# imageInput .= CuArray(red.(originalImage))

imageOutput = @zeros(nx,ny)

Define some grid and block parameters

threshold = 0.02 # value depends on the image

BLOCKX = 512
BLOCKY = 512


cuthreads = (BLOCKX, BLOCKX, 1)
cublocks = (GRIDX, GRIDY, 1)
@time @parallel cublocks cuthreads getEdges!(imageInput, imageOutput, threshold)
[RGB.(Array(imageInput)); RGB.(Array(imageOutput))] # check results

With Threads, everything works fine, but I want to run with CUDA for comparison, and here is where issues happens.

If I set const USE_GPU = true and uncomment the lines that convert my image into CuArray, I got the error:

ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:91
  [2] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:101 [inlined]
  [3] cuLaunchKernel(f::CUDA.CuFunction, gridDimX::UInt32, gridDimY::UInt32, gridDimZ::UInt32, blockDimX::UInt32, blockDimY::UInt32, blockDimZ::UInt32, sharedMemBytes::Int64, hStream::CUDA.CuStream, kernelParams::Vector{Ptr{Nothing}}, extra::Ptr{Nothing})
    @ CUDA ~/.julia/packages/CUDA/5jdFl/lib/utils/call.jl:26
  [4] #39
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:69 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:33 [inlined]
  [6] macro expansion
    @ ./none:0 [inlined]
  [7] pack_arguments(::CUDA.var"#39#40"{Bool, Int64, CUDA.CuStream, CUDA.CuFunction, CUDA.CuDim3, CUDA.CuDim3}, ::CUDA.KernelState, ::CUDA.CuDeviceMatrix{Float64, 1}, ::CUDA.CuDeviceMatrix{Float64, 1}, ::Float64, ::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, ::Int64, ::Int64, ::Int64)
    @ CUDA ./none:0
  [8] #launch#38
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:62 [inlined]
  [9] #44
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:136 [inlined]
 [10] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:95 [inlined]
 [11] macro expansion
    @ ./none:0 [inlined]
 [12] convert_arguments
    @ ./none:0 [inlined]
 [13] #cudacall#43
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:135 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:204 [inlined]
 [15] macro expansion
    @ ./none:0 [inlined]
 [16] call(::CUDA.HostKernel{typeof(getEdges!), Tuple{CUDA.CuDeviceMatrix{Float64, 1}, CUDA.CuDeviceMatrix{Float64, 1}, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, Int64, Int64, Int64}}, ::CUDA.CuDeviceMatrix{Float64, 1}, ::CUDA.CuDeviceMatrix{Float64, 1}, ::Float64, ::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, ::Int64, ::Int64, ::Int64; call_kwargs::Base.Pairs{Symbol, Tuple{Int64, Int64, Int64}, Tuple{Symbol, Symbol}, NamedTuple{(:threads, :blocks), Tuple{Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}}}})
    @ CUDA ./none:0
 [17] (::CUDA.HostKernel{typeof(getEdges!), Tuple{CUDA.CuDeviceMatrix{Float64, 1}, CUDA.CuDeviceMatrix{Float64, 1}, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, Int64, Int64, Int64}})(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; threads::Tuple{Int64, Int64, Int64}, blocks::Tuple{Int64, Int64, Int64}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:462
 [18] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:104 [inlined]

I know that my GPU by itself has some issues, thus : Can anyone reproduce my error?

I know that I don’t know kernel programming in GPU, but this code looks so simple that I naively thought that no errors would appear. Can anyone tell me what am I doing wrong?

Thank you for your attention


  • Julia v1.7.2
  • CUDA.jl v3.8.5
    • Nvidia GTX 1080 Ti
  • ParallelStencil.jl v0.6.0

@luraess @samo :point_up:

Hi @Noel_Araujo, thanks for picking up ParallelStencil!

Regarding your issue, it looks like you are violating some CUDA limitations regarding maximal allowed number of threads per block, BLOCKX * BLOCKY * BLOCKZ <= 1024. In your case, you have BLOCKX * BLOCKY = 512^2 which is not allowed. You could set


which should solve your issue.

Note 1: ParallelStencil has cuthreads heuristics defined such that it permits you to skip explicit cuthreads and cublocks definition, dropping grid and block parameter definition and simply launching your kernel as

@time @parallel getEdges!(imageInput, imageOutput, threshold)
[RGB.(Array(imageInput)); RGB.(Array(imageOutput))] # check results

Note 2: You could in addition initialise your arrays in a backend-agnostic fashion as following

imageInput  = @zeros(nx,ny)
imageInput .= Data.Array(red.(originalImage))

Hope this helps :slight_smile:

(Thanks @carstenbauer for cc)


I am super impressed by the work on ParallelStencil.jl that really shows how Julia masters can elaborate HPC scientific packages that are truly portable (GPU/CPU) via effective abstractions and clever use of macros !
Do you have plans to extend it to other architectures (Metal, oneAPI, AMD…) or is it better to defer this task to KernelAbstractions.jl ?


@LaurentPlagne : Thanks for your nice words. An AMDGPU backend is a short term objective ( we already have developed basic AMDGPU support in the co-developed package ImplicitGlobalGrid.jl, which is to be released soon). Another rather short term objective is a KernelAbstractions backend, in order to have automatically support for all hardware supported by KernelAbstractions abstractions…


This is a tangential question, but where is the module Data defined? I cannot seem to locate it.

This is a tangential question, but where is the module Data defined? I cannot seem to locate it.

The module Data is created upon calling @init_parallel_stencil macro. This macro then calls the init_parallel_stencil function which defines the module Data upon calling init_parallel_kernel. Bare module def can be found here.
I hope this helps.
Best is to use the search feature in GitHub to track down the paths of interest.