Modifying a thread-local vector within CUDA Dynamic Parallelism

Hi! I’d like to modify an MVector in the context of CUDA Dynamic Parallelism. Here goes a minimal example:

using CUDA, StaticArrays

function outer!()
    v = MVector{1, Float32}(undef)
    @cuda dynamic = true inner!(v)
    nothing
end

function inner!(v)
    v[1] = 0.0f0
    nothing
end

@cuda outer!()

I receive the following message:

ERROR: Failed to compile PTX code (ptxas exited with code 255)
Invocation arguments: --generate-line-info --compile-only --verbose --gpu-name sm_70 --output-file /tmp/jl_WVUB2l0xml.cubin /tmp/jl_m9gxUniPUZ.ptx
ptxas /tmp/jl_m9gxUniPUZ.ptx, line 43; error   : Parameter to entry function cannot be an incomplete array.
ptxas /tmp/jl_m9gxUniPUZ.ptx, line 298; error   : Parameter to entry function cannot be an incomplete array.
ptxas fatal   : Ptx assembly aborted due to errors
If you think this is a bug, please file an issue and attach /tmp/jl_m9gxUniPUZ.ptx
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compile(job::GPUCompiler.CompilerJob)
    @ CUDA /.julia/packages/CUDA/htRwP/src/compiler/compilation.jl:356
  [3] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler /.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
  [4] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler /.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
  [5] macro expansion
    @ /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:367 [inlined]
  [6] macro expansion
    @ ./lock.jl:267 [inlined]
  [7] cufunction(f::typeof(outer!), tt::Type{Tuple{}}; kwargs::@Kwargs{})
    @ CUDA /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:362
  [8] cufunction(f::typeof(outer!), tt::Type{Tuple{}})
    @ CUDA /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:359
  [9] top-level scope
    @ /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:112
 [10] top-level scope
    @ /.julia/packages/CUDA/htRwP/src/initialization.jl:206
Some type information was truncated. Use `show(err)` to see complete types.

Is there a way to do this ?

No that is sadly not possible. MArrays on the GPU currently depend on the ability of the compiler to inline all functions that use the MArray, to then turn the GC allocation into a stack allocated value as an optimization.

Since dynamic parallelism is explicitly a non-inlined function this can not occur.

Additionally I don’t even know if CUDA C supports this, since I think you can use dynamic parallelism to launch sub-kernels of different launch configurations and it is not clear to me whose address of the thread local memory would be passed to which thread in the sub-kernel

3 Likes

Thank you very much @vchuravy !

For those who don’t need Dynamic Parallelism in their application, inlining is a working option, indeed, thank you for the hint! The example can then be rewritten as follows:

using CUDA, StaticArrays

function outer!()
    v = MVector{1, Float32}(undef)
    inner!(v)
    nothing
end

@inline function inner!(v)
    v[1] = 0.0f0
    nothing
end

@cuda outer!()