Illegal memory indexing for broadcast of a recursive function

albertomercurio · May 29, 2023, 12:29am

Hello,

I’m trying to implement the generalized Laguerre function acting element-wise to the elements of a CuArray.

The basic code uses a recursive function to evaluate the laguerre polynomial

using CUDA
using LinearAlgebra: BlasFloat
CUDA.allowscalar(false)

function gen_laguerre(n::Integer, alpha::Number, x::T)::T where T <: BlasFloat
    n == 0 && return 1
    n == 1 && return 1 + alpha - x
    return ((2*n-1+alpha-x)*gen_laguerre(n-1,alpha,x)-(n-1+alpha)*gen_laguerre(n-2,alpha,x))/n
end

A = CUDA.rand(1000)
gen_laguerre.(10, 3, A)

works, but if I increase the degree of the polynomial

gen_laguerre.(30, 3, A)

CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/HzJ3E/lib/cudadrv/libcuda.jl:27
  [2] check
    @ ~/.julia/packages/CUDA/HzJ3E/lib/cudadrv/libcuda.jl:34 [inlined]
  [3] cuMemAllocAsync
    @ ~/.julia/packages/CUDA/HzJ3E/lib/utils/call.jl:26 [inlined]
  [4] alloc(::Type{CUDA.Mem.DeviceBuffer}, bytesize::Int64; async::Bool, stream::CuStream, pool::Nothing)
    @ CUDA.Mem ~/.julia/packages/CUDA/HzJ3E/lib/cudadrv/memory.jl:83
  [5] alloc
    @ ~/.julia/packages/CUDA/HzJ3E/lib/cudadrv/memory.jl:71 [inlined]
  [6] actual_alloc(bytes::Int64; async::Bool, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:65
  [7] actual_alloc
    @ ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:59 [inlined]
  [8] #993
    @ ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:417 [inlined]
  [9] retry_reclaim(f::CUDA.var"#993#995"{CuStream, Int64}, isfailed::typeof(isnothing))
    @ CUDA ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:337
 [10] macro expansion
    @ ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:416 [inlined]
 [11] macro expansion
    @ ./timing.jl:393 [inlined]
 [12] #_alloc#992
    @ ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:413 [inlined]
 [13] _alloc
    @ ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:408 [inlined]
 [14] #alloc#991
    @ ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:398 [inlined]
 [15] alloc
    @ ~/.julia/packages/CUDA/HzJ3E/src/pool.jl:392 [inlined]
 [16] CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}(#unused#::UndefInitializer, dims::Tuple{Int64})
    @ CUDA ~/.julia/packages/CUDA/HzJ3E/src/array.jl:93
 [17] CuArray
    @ ~/.julia/packages/CUDA/HzJ3E/src/array.jl:176 [inlined]
 [18] CuArray
    @ ~/.julia/packages/CUDA/HzJ3E/src/array.jl:187 [inlined]
 [19] similar
    @ ./abstractarray.jl:882 [inlined]
 [20] similar
    @ ./abstractarray.jl:881 [inlined]
 [21] similar
    @ ~/.julia/packages/CUDA/HzJ3E/src/broadcast.jl:11 [inlined]
 [22] copy
    @ ~/.julia/packages/GPUArrays/TnEpb/src/host/broadcast.jl:37 [inlined]
 [23] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(gen_laguerre), Tuple{Int64, Int64, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}})
    @ Base.Broadcast ./broadcast.jl:873
 [24] top-level scope
    @ In[25]:1

And then it is impossible to calculate again a lower order Laguerre polynomial.

I can’t figure out the source of the problem.

EDIT
I noticed that on the CPU it tooks a lot of time when n>10. Maybe this is the reason of the fail. Howerver I can’t understand why it fails, while I understand the long calculation time on the CPU.

jling · May 29, 2023, 4:20am

Yeah, avoid recursive is probably the easiest solution regardless of the exact underlying problem

albertomercurio · May 29, 2023, 3:42pm

But why CUDA doesn’t support recursive functions?

mcabbott · May 29, 2023, 4:13pm

This looks type-unstable. Maybe the assertion ::T fixes it but maybe not enough?

I would fix that, and try inlining everything. Then try using Val(n) not n so that the recursion can be inferred.

julia> function gen_laguerre_2(n::Integer, alpha::Number, x::Number)
           n == 0 && return 1
           n == 1 && return 1 + alpha - x
           return ((2*n-1+alpha-x)*gen_laguerre(n-1,alpha,x)-(n-1+alpha)*gen_laguerre(n-2,alpha,x))/n
       end
gen_laguerre_2 (generic function with 1 method)

julia> @code_warntype gen_laguerre_2(3, 0.1, 0.1)
MethodInstance for gen_laguerre_2(::Int64, ::Float64, ::Float64)
  from gen_laguerre_2(n::Integer, alpha::Number, x::Number) @ Main REPL[96]:1
Arguments
  #self#::Core.Const(gen_laguerre_2)
  n::Int64
  alpha::Float64
  x::Float64
Body::Union{Float64, Int64}

maleadt · May 30, 2023, 10:12am

It’s likely overflowing the stack, indeed. If you’re bent on using this implementation, you can try querying and increasing the stack size by calling CUDA.limit! with CU_LIMIT_STACK_SIZE.

albertomercurio · May 30, 2023, 11:14pm

I can’t see an example on the documentation. Will the following code work?

CUDA.limit!(CUDA.CU_LIMIT_STACK_SIZE, 1000)

where 1000 is the maximum number of recursion I expect.

maleadt · May 31, 2023, 7:20am

We don’t re-document all of CUDA, so refer to the NVIDIA documentation for low-level details: CUDA Driver API :: CUDA Toolkit Documentation.

Stack size is expressed in bytes, not in number of recursions.

Topic		Replies	Views
Laguerre polynomials on GPU GPU gpu , cuda , polynomials	27	1350	June 9, 2023
CUDA(.jl) memory errors for very large kernels GPU cuda , error , code-generation	24	536	April 22, 2025
GPU problems with RecursiveArrayTools GPU question , gpuarrays , differentialequation	1	214	February 28, 2024
DifferentialEquations + RecursiveArrayTools Broadcasting / CuArrays Kernel compilation error -- GPU diffeq	6	862	February 4, 2020
Kernel Compilation error- KernelError: recursion is currently not supported GPU	8	1315	November 8, 2019

Illegal memory indexing for broadcast of a recursive function

Related topics