ParallelStencil works with Threads but not with CUDA

Noel_Araujo · April 3, 2022, 12:25am

Hellos

I have a code of Edge Detection using ParallelStencil. The algorithm is basic, it just compares one pixel to its left and bottom neighbors, if their values are above one threshold, I fill the image(matrix) with a different color(value) to indicate a border.

using Images
using ParallelStencil

const USE_GPU = false
@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 2); 
else
    @init_parallel_stencil(Threads, Float64, 2); 
end

@parallel_indices (x,y) function getEdges!(originalImage, newImage, threshold)
    if (    x >= 2  && x <= size(originalImage,1) &&
            y >= 2  && y <= size(originalImage,2) )
        
        currentPixel = originalImage[x,y]
        leftPixel    = originalImage[x-1,y]
        bottomPixel  = originalImage[x,y-1]

        if abs(currentPixel-leftPixel) ≥ threshold || abs(currentPixel-bottomPixel) ≥ threshold
            newImage[x-1,y-1] = 1.0
        end
    end
    return
end

Now I load the image

originalImage = load("image2.png") 
nx,ny = size(originalImage) # for my particular image, the red channel was enough

imageInput = @zeros(nx,ny)
imageInput .= red.(originalImage)

# using CUDA: CuArray
# imageInput .= CuArray(red.(originalImage))

imageOutput = @zeros(nx,ny)

Define some grid and block parameters

threshold = 0.02 # value depends on the image

BLOCKX = 512
BLOCKY = 512

GRIDX = nx÷BLOCKX + 1
GRIDY = ny÷BLOCKY + 1

cuthreads = (BLOCKX, BLOCKX, 1)
cublocks = (GRIDX, GRIDY, 1)

@time @parallel cublocks cuthreads getEdges!(imageInput, imageOutput, threshold)
[RGB.(Array(imageInput)); RGB.(Array(imageOutput))] # check results

With Threads, everything works fine, but I want to run with CUDA for comparison, and here is where issues happens.

If I set const USE_GPU = true and uncomment the lines that convert my image into CuArray, I got the error:

ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:91
  [2] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:101 [inlined]
  [3] cuLaunchKernel(f::CUDA.CuFunction, gridDimX::UInt32, gridDimY::UInt32, gridDimZ::UInt32, blockDimX::UInt32, blockDimY::UInt32, blockDimZ::UInt32, sharedMemBytes::Int64, hStream::CUDA.CuStream, kernelParams::Vector{Ptr{Nothing}}, extra::Ptr{Nothing})
    @ CUDA ~/.julia/packages/CUDA/5jdFl/lib/utils/call.jl:26
  [4] #39
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:69 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:33 [inlined]
  [6] macro expansion
    @ ./none:0 [inlined]
  [7] pack_arguments(::CUDA.var"#39#40"{Bool, Int64, CUDA.CuStream, CUDA.CuFunction, CUDA.CuDim3, CUDA.CuDim3}, ::CUDA.KernelState, ::CUDA.CuDeviceMatrix{Float64, 1}, ::CUDA.CuDeviceMatrix{Float64, 1}, ::Float64, ::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, ::Int64, ::Int64, ::Int64)
    @ CUDA ./none:0
  [8] #launch#38
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:62 [inlined]
  [9] #44
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:136 [inlined]
 [10] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:95 [inlined]
 [11] macro expansion
    @ ./none:0 [inlined]
 [12] convert_arguments
    @ ./none:0 [inlined]
 [13] #cudacall#43
    @ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/execution.jl:135 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:204 [inlined]
 [15] macro expansion
    @ ./none:0 [inlined]
 [16] call(::CUDA.HostKernel{typeof(getEdges!), Tuple{CUDA.CuDeviceMatrix{Float64, 1}, CUDA.CuDeviceMatrix{Float64, 1}, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, Int64, Int64, Int64}}, ::CUDA.CuDeviceMatrix{Float64, 1}, ::CUDA.CuDeviceMatrix{Float64, 1}, ::Float64, ::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, ::Int64, ::Int64, ::Int64; call_kwargs::Base.Pairs{Symbol, Tuple{Int64, Int64, Int64}, Tuple{Symbol, Symbol}, NamedTuple{(:threads, :blocks), Tuple{Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}}}})
    @ CUDA ./none:0
 [17] (::CUDA.HostKernel{typeof(getEdges!), Tuple{CUDA.CuDeviceMatrix{Float64, 1}, CUDA.CuDeviceMatrix{Float64, 1}, Float64, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, Int64, Int64, Int64}})(::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; threads::Tuple{Int64, Int64, Int64}, blocks::Tuple{Int64, Int64, Int64}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:462
 [18] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:104 [inlined]

I know that my GPU by itself has some issues, thus : Can anyone reproduce my error?

I know that I don’t know kernel programming in GPU, but this code looks so simple that I naively thought that no errors would appear. Can anyone tell me what am I doing wrong?

Thank you for your attention

Notes:

Julia v1.7.2
CUDA.jl v3.8.5
- Nvidia GTX 1080 Ti
ParallelStencil.jl v0.6.0

carstenbauer · April 3, 2022, 6:58am

@luraess @samo

luraess · April 3, 2022, 8:54am

Hi @Noel_Araujo, thanks for picking up ParallelStencil!

Regarding your issue, it looks like you are violating some CUDA limitations regarding maximal allowed number of threads per block, BLOCKX * BLOCKY * BLOCKZ <= 1024. In your case, you have BLOCKX * BLOCKY = 512^2 which is not allowed. You could set

BLOCKX = 32
BLOCKY = 32

which should solve your issue.

Note 1: ParallelStencil has cuthreads heuristics defined such that it permits you to skip explicit cuthreads and cublocks definition, dropping grid and block parameter definition and simply launching your kernel as

@time @parallel getEdges!(imageInput, imageOutput, threshold)
[RGB.(Array(imageInput)); RGB.(Array(imageOutput))] # check results

Note 2: You could in addition initialise your arrays in a backend-agnostic fashion as following

imageInput  = @zeros(nx,ny)
imageInput .= Data.Array(red.(originalImage))

Hope this helps

(Thanks @carstenbauer for cc)

LaurentPlagne · April 3, 2022, 9:06am

I am super impressed by the work on ParallelStencil.jl that really shows how Julia masters can elaborate HPC scientific packages that are truly portable (GPU/CPU) via effective abstractions and clever use of macros !
Do you have plans to extend it to other architectures (Metal, oneAPI, AMD…) or is it better to defer this task to KernelAbstractions.jl ?

samo · April 3, 2022, 5:06pm

@LaurentPlagne : Thanks for your nice words. An AMDGPU backend is a short term objective ( we already have developed basic AMDGPU support in the co-developed package ImplicitGlobalGrid.jl, which is to be released soon). Another rather short term objective is a KernelAbstractions backend, in order to have automatically support for all hardware supported by KernelAbstractions abstractions…

wsshin · May 17, 2022, 3:47pm

This is a tangential question, but where is the module Data defined? I cannot seem to locate it.

luraess · May 17, 2022, 5:40pm

This is a tangential question, but where is the module Data defined? I cannot seem to locate it.

The module Data is created upon calling @init_parallel_stencil macro. This macro then calls the init_parallel_stencil function which defines the module Data upon calling init_parallel_kernel. Bare module def can be found here.
I hope this helps.
Best is to use the search feature in GitHub to track down the paths of interest.

Topic		Replies	Views
With ParallelStencil, is it possible to launch multiple kernels and sync later? GPU	7	102	April 9, 2025
ParallelStencil for Vector Fields GPU question , diffeq	5	867	November 23, 2021
Use ParallelStencil.jl with GPU GPU question , package , installation	3	432	September 1, 2023
How to use the package ParallelStencil.jl to solve linear equation A*x=b in GPU? Modelling & Simulations	9	1569	October 4, 2023
ParallelStencil + ImplicitGlobalGrid with multiple GPUs Julia at Scale	8	862	March 31, 2023

ParallelStencil works with Threads but not with CUDA

Related topics