Kernel Compilation error- KernelError: recursion is currently not supported

Hi
i wrote this kernel function:

function gpu_kernel_init_format(format_img)
gpu_lab_image =CuArrays.fill(128, (10,10,3))

i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= 10 && j<= 10 
    format_img[i, j] = [(gpu_lab_image[i][j][0]), (gpu_lab_image[i][j][1]), (gpu_lab_image[i][j][2]), i, j]
end
return nothing

end

Call kernel:

gpu_format = CuArrays.fill(0, (10,10,5))

@device_code_warntype @cuda blocks=(2,2) threads=(16,16) gpu_kernel_init_format(gpu_format)

i added the @device_code_warntype and still coudn’t find the problem:
GPU compilation of gpu_kernel_init_format(CuDeviceArray{Int64,3,CUDAnative.AS.Global}) failed
KernelError: recursion is currently not supported

what should i fix?
Thanks

Please format your code using triple backticks. See Please read: make it easier to help you

You are allocating a CuArray within your kernel, this is unsupported. CuArray is a host-side array, you can only pass it to a device kernel.

1 Like

Hi, Thanks for answering!

i took out allocation and pass it as additional argument and got that:

Reason: unsupported call through a literal pointer (call to jl_alloc_array_1d)

what should i do now?

i am new in julia ,especially in julia GPU, do you have any recommendation for
beginners tutorial? i have seen some examples on documentation and also enrolled to courses in julia academy but i would like to practice more.

You’re still allocating some array from within a kernel. GPU kernels are restricted and cannot just call into any Julia code. If you’re not familiar with GPU computing, I’d recommend using the broadcast abstraction of CuArrays.jl. CUDAnative.jl can be used to create custom kernels, which is a little tricky as you experience here. Have a look at this tutorial: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

Thanks!

That’s my new kernel:

function gpu_kernel_init_format(format_img ,gpu_lab_image)

    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= 10 && j<= 10 
        format_img[i, j] = [(gpu_lab_image[i][j][0]), (gpu_lab_image[i][j][1]), (gpu_lab_image[i][j][2]), i, j]
    end
    return nothing
end

there is no allocation inside it, so should i fix it again?

format_img[i, j] = [(gpu_lab_image[i][j][0]), (gpu_lab_image[i][j][1]), (gpu_lab_image[i][j][2]), i, j]

That allocates an array, right? format_img is also undefined. How would you expect this to work?

Sorry about my lack of understanding but i allocated format_img in advance and pass it as
argument.
i just want to copy some values into format_img, not to allocate.
that’s my code before calling to kernel:

gpu_format = CuArrays.fill(0, (10,10,5))
gpu_lab_image =CuArrays.fill(128, (10,10,3))

@device_code_warntype @cuda blocks=(2,2) threads=(16,16) gpu_kernel_init_format(gpu_format,gpu_lab_image)

But this code wouldn’t even work on the CPU?

julia> format = fill(0, (10,10,5));
julia> lab_image = fill(128, (10,10,3))
julia> i = 1; j = 2
julia> format[i, j] = [(lab_image[i][j][0]), (lab_image[i][j][1]), (lab_image[i][j][2]), i, j]
ERROR: BoundsError

Before even considering GPU execution: there’s plenty wrong with this, you can’t do format[i,j] but need to slice, lab_image[i][j][0] is 0 indexed, it should also be lab_image[i,j,1], you want to do element-wise assignment, etc. Please make sure your code works first before trying to port it to the GPU, a pretty unfriendly environment where errors are much harder to debug. And again, you probably don’t need custom kernels at all, try working with array abstractions (which you can first develop on Array, and then port to CuArray).

Ok, Thank you!