KernelError: recursion is currently not supported


I changed my task to a simple example for description:

I will take every element in array x, make it into a third-order matrix and calculate its determinant, and then assign the value of the determinant to array y.

\begin{equation} x=\left[\begin{array}{c} {x_{1}} \\ {x_{2}} \\ {\vdots} \\ {x_{i}} \end{array}\right], \operatorname{mat}[i]=\left[\begin{array}{ccc} {x[i]+1} & {x[i]+2} & {x[i]+5} \\ {x[i]+1} & {x[i]+0} & {x[i]+4} \\ {x[i]+2} & {x[i]+3} & {x[i]+2} \end{array}\right], y=\left[\begin{array}{c} {\operatorname{mat}[1]} \\ {\operatorname{mat}[2]} \\ {\vdots} \\ {\operatorname{mat}[i]} \end{array}\right] \end{equation}

My code:

using CuArrays
using CUDAnative
using LinearAlgebra

function test(x)
    mat = [x+1 x+2 x+5
           x+1 x+0 x+4
           x+2 x+3 x+2]
    c = det(mat)
    return c

function kernel!(x,y)
    index  = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i = index:stride:size(x,1)
        y[i] = test(x[i])
    return nothing

x = rand(10000)
y = zeros(10000)
d_x = cu(x)
d_y = cu(y)

numblocks     = ceil(Int, size(x, 1)/256)
@cuda threads = 256 blocks = numblocks kernel!(d_x,d_y)


GPU compilation of kernel!(CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CuDeviceArray{Float32,1,CUDAnative.AS.Global}) failed
KernelError: recursion is currently not supported

Try inspecting the generated code with any of the @device_code_... macros.

 [1] mapreduce_impl at reduce.jl:148 (repeats 2 times)
 [2] det at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\LinearAlgebra\src\triangular.jl:2525
 [3] det at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\LinearAlgebra\src\generic.jl:1421
 [4] kernel! at In[3]:2

 [1] (::CUDAnative.var"#hook_emit_function#100"{CUDAnative.CompilerJob,Array{Core.MethodInstance,1}})(::Core.MethodInstance, ::Core.CodeInfo, ::UInt64) at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\compiler\irgen.jl:102
 [2] compile_method_instance(::CUDAnative.CompilerJob, ::Core.MethodInstance, ::UInt64) at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\compiler\irgen.jl:149
 [3] macro expansion at C:\Users\zenan\.julia\packages\TimerOutputs\7Id5J\src\TimerOutput.jl:228 [inlined]
 [4] irgen(::CUDAnative.CompilerJob, ::Core.MethodInstance, ::UInt64) at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\compiler\irgen.jl:163
 [5] macro expansion at C:\Users\zenan\.julia\packages\TimerOutputs\7Id5J\src\TimerOutput.jl:228 [inlined]
 [6] macro expansion at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\compiler\driver.jl:99 [inlined]
 [7] macro expansion at C:\Users\zenan\.julia\packages\TimerOutputs\7Id5J\src\TimerOutput.jl:228 [inlined]
 [8] #codegen#156(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\compiler\driver.jl:98
 [9] #codegen at .\none:0 [inlined]
 [10] #compile#155(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\compiler\driver.jl:47
 [11] #compile#154 at .\none:0 [inlined]
 [12] #compile at .\none:0 [inlined] (repeats 2 times)
 [13] macro expansion at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\execution.jl:392 [inlined]
 [14] #cufunction#200(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(kernel!), ::Type{Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}) at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\execution.jl:359
 [15] cufunction(::Function, ::Type) at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\execution.jl:359
 [16] top-level scope at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\execution.jl:176
 [17] top-level scope at gcutils.jl:91
 [18] top-level scope at C:\Users\zenan\.julia\packages\CUDAnative\Phjco\src\execution.jl:173
 [19] top-level scope at In[5]:2


The above example is very similar to the task I actually want to complete. Because I am new in using GPU, I cannot understand the error reporting. Excuse me, how should this mistake be solved? :sweat_smile:

You ar calling det on the GPU, which in turn calls mapreduce. That kind of functionality is not available within a kernel, where you can only do relatively simple computations. The array allocation in test is also not possible in a kernel. Just hard-code the expression to calculate the determinant of your 3x3 matrix. Or you could try using StaticArrays, which you can allocate in a kernel (since it’s stack based), and it looks like they provide a det method.


Thank you for your reply! :grinning:

  1. I have understood the problem of solving determinants.

I found this problem. What should I do if I need to create a variable mat in test(x), which will be a matrix of parameter x?

This is not about creating a variable, but about allocating memory, which you (generally) cannot do in a kernel. So either pre-allocate and pass to your kernel, or use StaticArrays to have stack-allocated memory. You can also use CUDA shared memory, but these have specific semantics that you likely don’t want (values shared across threads in a block).

II’ve got it. Thanks again! :+1: