Modifying a thread-local vector within CUDA Dynamic Parallelism

luciano-drozda · February 12, 2024, 4:37pm

Hi! I’d like to modify an MVector in the context of CUDA Dynamic Parallelism. Here goes a minimal example:

using CUDA, StaticArrays

function outer!()
    v = MVector{1, Float32}(undef)
    @cuda dynamic = true inner!(v)
    nothing
end

function inner!(v)
    v[1] = 0.0f0
    nothing
end

@cuda outer!()

I receive the following message:

ERROR: Failed to compile PTX code (ptxas exited with code 255)
Invocation arguments: --generate-line-info --compile-only --verbose --gpu-name sm_70 --output-file /tmp/jl_WVUB2l0xml.cubin /tmp/jl_m9gxUniPUZ.ptx
ptxas /tmp/jl_m9gxUniPUZ.ptx, line 43; error   : Parameter to entry function cannot be an incomplete array.
ptxas /tmp/jl_m9gxUniPUZ.ptx, line 298; error   : Parameter to entry function cannot be an incomplete array.
ptxas fatal   : Ptx assembly aborted due to errors
If you think this is a bug, please file an issue and attach /tmp/jl_m9gxUniPUZ.ptx
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compile(job::GPUCompiler.CompilerJob)
    @ CUDA /.julia/packages/CUDA/htRwP/src/compiler/compilation.jl:356
  [3] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler /.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
  [4] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler /.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
  [5] macro expansion
    @ /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:367 [inlined]
  [6] macro expansion
    @ ./lock.jl:267 [inlined]
  [7] cufunction(f::typeof(outer!), tt::Type{Tuple{}}; kwargs::@Kwargs{})
    @ CUDA /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:362
  [8] cufunction(f::typeof(outer!), tt::Type{Tuple{}})
    @ CUDA /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:359
  [9] top-level scope
    @ /.julia/packages/CUDA/htRwP/src/compiler/execution.jl:112
 [10] top-level scope
    @ /.julia/packages/CUDA/htRwP/src/initialization.jl:206
Some type information was truncated. Use `show(err)` to see complete types.

Is there a way to do this ?

vchuravy · February 12, 2024, 8:10pm

No that is sadly not possible. MArrays on the GPU currently depend on the ability of the compiler to inline all functions that use the MArray, to then turn the GC allocation into a stack allocated value as an optimization.

Since dynamic parallelism is explicitly a non-inlined function this can not occur.

Additionally I don’t even know if CUDA C supports this, since I think you can use dynamic parallelism to launch sub-kernels of different launch configurations and it is not clear to me whose address of the thread local memory would be passed to which thread in the sub-kernel

luciano-drozda · February 13, 2024, 1:19am

Thank you very much @vchuravy !

For those who don’t need Dynamic Parallelism in their application, inlining is a working option, indeed, thank you for the hint! The example can then be rewritten as follows:

using CUDA, StaticArrays

function outer!()
    v = MVector{1, Float32}(undef)
    inner!(v)
    nothing
end

@inline function inner!(v)
    v[1] = 0.0f0
    nothing
end

@cuda outer!()

Topic		Replies	Views
Using MVector in CUDA without memory errors GPU	3	431	October 17, 2023
Local thread memory in GPU using StaticArrays GPU question , gpu , cuda	4	6251	January 26, 2020
Create static vector of variable lenght in gpu kernel GPU question , package	2	442	September 27, 2022
CUDA.jl - Sub-Vector Indexing Problem Inside CUDA Kernel GPU cuda , error , cuarrays , error-message , staticarrays	2	1242	March 28, 2022
CUDA.jl - Variable Sized Local Arrays Inside CUDA Kernel GPU gpuarrays , cuda , error , memory-allocation , physics	2	1689	March 28, 2022

Modifying a thread-local vector within CUDA Dynamic Parallelism

Related topics