What is the maximal number of arguments a CUDAnative kernel can take? argc = 16 yields "Error: invalid kernel call; too many arguments"



I have a kernel to pass to @cuda in CUDAnative.jl. It is a kernel function which takes 16 arguments:

function TrAXBY_CUDA!(cuTMP,G1,G2,Ind2,IX,JX,VX,IY,JY,VY,nΩ,nK,nnzX,nnzY,dimX,dimY)
         ... ...
         return nothing

using @device_code_warntype I got the following error:

2 1 ─ %1 = Base.llvmcall::Core.IntrinsicFunction                 │╻╷╷╷ macro expansion
  │        %1(Ptr{Nothing} @0x0000000003e53598, Ptr{Complex{Float64}}, Tuple{})
  │        $(Expr(:throw_undef_if_not, :tid, false))             ││   
  └──      unreachable                                           ││   
┌ Error: invalid kernel call; too many arguments
│   kernel = typeof(TrAXBY_CUDA!)
│   argc = 16
└ @ CUDAnative utils.jl:14
ERROR: LoadError: GPU compilation failed, try inspecting generated code with any of the @device_code_... macros
CompilerError: could not compile TrAXBY_CUDA!(CuDeviceArray{Complex{Float64},4,CUDAnative.AS.Global}, CuDeviceArray{Float64,4,CUDAnative.AS.Global}, CuDeviceArray{Float64,4,CUDAnative.AS.Global}, CuDeviceArray{Int64,1,CUDAnative.AS.Global}, CuDeviceArray{Int64,2,CUDAnative.AS.Global}, CuDeviceArray{Int64,2,CUDAnative.AS.Global}, CuDeviceArray{Float64,2,CUDAnative.AS.Global}, CuDeviceArray{Int64,2,CUDAnative.AS.Global}, CuDeviceArray{Int64,2,CUDAnative.AS.Global}, CuDeviceArray{Float64,2,CUDAnative.AS.Global}, Int64, Int64, Int64, Int64, Int64, Int64); kernel returns a value of type Any
 [1] validate_invocation(::CUDAnative.CompilerContext) at /home/yunlong/.julia7/packages/CUDAnative/pfAo/src/validation.jl:15
 [2] compile_function(::CUDAnative.CompilerContext) at ./logging.jl:317
 [3] #cufunction#85(::Base.Iterators.Pairs{Symbol,typeof(TrAXBY_CUDA!),Tuple{Symbol},NamedTuple{(:inner_f,),Tuple{typeof(TrAXBY_CUDA!)}}}, ::Function, ::CuDevice, ::Function, ::Type) at /home/yunlong/.julia7/packages/CUDAnative/pfAo/src/compiler.jl:655
 [4] (::getfield(CUDAnative, Symbol("#kw##cufunction")))(::NamedTuple{(:inner_f,),Tuple{typeof(TrAXBY_CUDA!)}}, ::typeof(cufunction), ::CuDevice, ::Function, ::Type) at ./none:0
 [5] _cuda(::CUDAnative.KernelWrapper{typeof(TrAXBY_CUDA!)}, ::typeof(TrAXBY_CUDA!), ::Tuple{}, ::NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::CuDeviceArray{Complex{Float64},4,CUDAnative.AS.Global}, ::CuDeviceArray{Float64,4,CUDAnative.AS.Global}, ::CuDeviceArray{Float64,4,CUDAnative.AS.Global}, ::CuDeviceArray{Int64,1,CUDAnative.AS.Global}, ::CuDeviceArray{Int64,2,CUDAnative.AS.Global}, ::CuDeviceArray{Int64,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float64,2,CUDAnative.AS.Global}, ::CuDeviceArray{Int64,2,CUDAnative.AS.Global}, ::CuDeviceArray{Int64,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float64,2,CUDAnative.AS.Global}, ::Int64, ::Int64, ::Int64, ::Int64, ::Int64, ::Int64) at /home/yunlong/.julia7/packages/CUDAnative/pfAo/src/execution.jl:235
 [6] macro expansion at ./gcutils.jl:89 [inlined]
 [7] top-level scope at /home/yunlong/.julia7/packages/CUDAnative/pfAo/src/reflection.jl:154 [inlined]
 [8] top-level scope at ./<missing>:0
 [9] include at ./boot.jl:317 [inlined]
 [10] include_relative(::Module, ::String) at ./loading.jl:1075
 [11] include(::Module, ::String) at ./sysimg.jl:29
 [12] include(::String) at ./client.jl:393
 [13] top-level scope at none:0

Is it because I am passing tooo many arguments? What is the limit of the number of arguments of a CUDA kernel?


Max 13 arguments. Enforced by CUDAnative, but at its core due to Julia’s type inference limits.
An easy workaround is to pass a tuple instead, and with 0.7 you can easily destructure that tuple into variables again:

using CUDAnative

kernel((a,b,c,d,e,f,g,h,i,j,k,l,m,n)) = nothing

@cuda kernel((1,2,3,4,5,6,7,8,9,10,11,12,13,14))


That was fixed in v0.7. Is the limitation for CUDAnative removed as well?


Ah, I missed merging of that PR. There’s still going to be a limit though, unless splatting also works properly know (Cassette-type inference problems).


Oh, you splat too? The tuple limit was eliminated and the splat limit was increased to 32:

Still much better.


OK, had a quick look, even when destructuring the splat in a generated function it still runs into issues with 36+ arguments. Didn’t have time to investigate, but still an improvement.


I’m fine if it keeps satisfying Moore’s law for argument numbers.