Local thread memory in GPU using StaticArrays

Hi all,

I want to write a CUDA kernel that uses some local memory per thread. I read here and here that a possible way to do it is using StaticArrays, but I couldn’t find a MWE of this.

The code I want to run is this, where I have to allocate some intermediate results in the vector S

using CuArrays, CUDAnative, CUDAdrv, StaticArrays

const M=128
const N=56

function kernel_staticarrays!(M, N, x_d, n_d)
    index = threadIdx().x
    stride = blockDim().x

    # S = MVector{N+1, Float32}(undef)
    S = SizedVector{N+1, Float32}(undef)

    for p=index:stride:M
        S[1] = 1.0f0
        for i=2:N+1
            S[i] = 0.0f0
        end
        for i=1:M
            i == p && continue
            for j=min(1, N):-1:max(1, N-M+1)
                S[j+1] += x_d[i]*S[j]
            end
        end
        n_d[p] = S[N+1]
    end

    return nothing
end

function main(M, N)
    x_d = CuArrays.rand(Float32, M)
    n_d = CuArrays.fill(0.0f0, M)

    numthreads = 256
    @cuda threads=numthreads kernel_staticarrays!(M, N, x_d, n_d)

    n = Array(n_d)
    display(n)
end

main(M, N)

The error stacktrace when I use SizedVector is

ERROR: LoadError: InvalidIRError: compiling kernel_staticarrays!(Int64, Int64, CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CuDeviceArray{Float32,1,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_apply_type)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:11
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:11
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:14
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:16
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:21
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:24
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] macro expansion at /home/iff/arubio/.julia/packages/LLVM/DAnFH/src/interop/base.jl:52
 [2] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [3] unsafe_store! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [4] setindex! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/array.jl:84
 [5] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:24
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] macro expansion at /home/iff/arubio/.julia/packages/LLVM/DAnFH/src/interop/base.jl:52
 [2] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [3] unsafe_store! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [4] setindex! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/array.jl:84
 [5] kernel_staticarrays! at /scratch-global/arubio/950034/agp_gpu.jl:24
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/validation.jl:114
 [2] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:188 [inlined]
 [3] macro expansion at /home/iff/arubio/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
 [4] #codegen#156(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:186
 [5] #codegen at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:0 [inlined]
 [6] #compile#155(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:47
 [7] #compile at ./none:0 [inlined]
 [8] #compile#154 at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:28 [inlined]
 [9] #compile at ./none:0 [inlined] (repeats 2 times)
 [10] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:392 [inlined]
 [11] #cufunction#200(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(kernel_staticarrays!), ::Type{Tuple{Int64,Int64,CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:359
 [12] cufunction(::Function, ::Type) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:359
 [13] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:176 [inlined]
 [14] macro expansion at ./gcutils.jl:87 [inlined]
 [15] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:173 [inlined]
 [16] main(::Int64, ::Int64) at /scratch-global/arubio/950034/agp_gpu.jl:35
 [17] top-level scope at /scratch-global/arubio/950034/agp_gpu.jl:41
 [18] include at ./boot.jl:328 [inlined]
 [19] include_relative(::Module, ::String) at ./loading.jl:1094
 [20] include(::Module, ::String) at ./Base.jl:31
 [21] exec_options(::Base.JLOptions) at ./client.jl:295
 [22] _start() at ./client.jl:464
in expression starting at /scratch-global/arubio/950034/agp_gpu.jl:41

and the error stacktrace when I use MVector is

ERROR: LoadError: InvalidIRError: compiling kernel_staticarrays!(Int64, Int64, CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CuDeviceArray{Float32,1,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_apply_type)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:10
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:10
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:14
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:16
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:21
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:24
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
 [1] macro expansion at /home/iff/arubio/.julia/packages/LLVM/DAnFH/src/interop/base.jl:52
 [2] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [3] unsafe_store! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [4] setindex! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/array.jl:84
 [5] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:24
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] macro expansion at /home/iff/arubio/.julia/packages/LLVM/DAnFH/src/interop/base.jl:52
 [2] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [3] unsafe_store! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/pointer.jl:167
 [4] setindex! at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/device/array.jl:84
 [5] kernel_staticarrays! at /scratch-global/arubio/950033/agp_gpu.jl:24
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/validation.jl:114
 [2] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:188 [inlined]
 [3] macro expansion at /home/iff/arubio/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
 [4] #codegen#156(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:186
 [5] #codegen at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:0 [inlined]
 [6] #compile#155(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:47
 [7] #compile at ./none:0 [inlined]
 [8] #compile#154 at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/compiler/driver.jl:28 [inlined]
 [9] #compile at ./none:0 [inlined] (repeats 2 times)
 [10] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:392 [inlined]
 [11] #cufunction#200(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(kernel_staticarrays!), ::Type{Tuple{Int64,Int64,CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:359
 [12] cufunction(::Function, ::Type) at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:359
 [13] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:176 [inlined]
 [14] macro expansion at ./gcutils.jl:87 [inlined]
 [15] macro expansion at /home/iff/arubio/.julia/packages/CUDAnative/Phjco/src/execution.jl:173 [inlined]
 [16] main(::Int64, ::Int64) at /scratch-global/arubio/950033/agp_gpu.jl:35
 [17] top-level scope at /scratch-global/arubio/950033/agp_gpu.jl:41
 [18] include at ./boot.jl:328 [inlined]
 [19] include_relative(::Module, ::String) at ./loading.jl:1094
 [20] include(::Module, ::String) at ./Base.jl:31
 [21] exec_options(::Base.JLOptions) at ./client.jl:295
 [22] _start() at ./client.jl:464
in expression starting at /scratch-global/arubio/950033/agp_gpu.jl:41

What am I doing wrong? Also, do I need to specify that M and N are constants? I have also read something about cuDynamicSharedMem, but I think using local thread memory might perform better.

Thanks in advance
:smile:

Check your code for type stability.
Either of these lines:

    # S = MVector{N+1, Float32}(undef)
    S = SizedVector{N+1, Float32}(undef)

will be type unstable.

Hi, that’s right, both lines seem to be type unstable. I have further tried using SizedArray(S0_d) with S0_d a CuArray but with no success. What would be the proper way to implement a StaticArray inside my kernel that is type stable?

Thanks

I don’t own an NVidea GPU, so I can’t easily test, and don’t have much experience with CUArrays.
I’m fairly certain the type instability is causing problems. I can’t say whether or not that is the only problem.

On the type stability, note that to be type stable, the type of the outputs must be inferrable from the type of the inputs.
One of the inputs to your function is N. The type of N is Int.
What is the type of S? It is a SizedVector{N+1,[known info]}. If we don’t know the value of N, then we don’t know it’s type.
So we have to know the value.

julia> using StaticArrays, Test
[ Info: Precompiling StaticArrays [90137ffa-7385-5640-81b9-e52037218182]

julia> makevec1(N) = MVector{N+1,Float32}(undef)
makevec1 (generic function with 1 method)

julia> makevec2(::Val{N}) where {N} = MVector{N+1,Float32}(undef)
makevec2 (generic function with 1 method)

julia> makevec3(::Val{Np1}) where {Np1} = MVector{Np1,Float32}(undef)
makevec3 (generic function with 1 method)

julia> @inferred makevec1(7)
ERROR: return type MArray{Tuple{8},Float32,1,8} does not match inferred return type MArray{_A,Float32,1,_B} where _B where _A<:Tuple
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at REPL[18]:1

julia> @inferred makevec2(Val(7))
8-element MArray{Tuple{8},Float32,1,8} with indices SOneTo(8):
  4.0f-45
  0.0
 -0.12048435
  4.557f-41
 -3.1952376f13
  4.557f-41
 -2.2381268f15
  4.557f-41

julia> @inferred makevec3(Val(8))
8-element MArray{Tuple{8},Float32,1,8} with indices SOneTo(8):
 -5.211842f10
  4.557f-41
 -5.313413f10
  4.557f-41
  5.14f-43
  0.0
  1.33f-43
  0.0

julia> @code_warntype makevec1(7)
Variables
  #self#::Core.Compiler.Const(makevec1, false)
  N::Int64

Body::MArray{_A,Float32,1,_B} where _B where _A<:Tuple
1 ─ %1 = (N + 1)::Int64
│   %2 = Core.apply_type(Main.MVector, %1, Main.Float32)::Type{MArray{Tuple{_A},Float32,1,_A}} where _A
│   %3 = (%2)(Main.undef)::MArray{_A,Float32,1,_B} where _B where _A<:Tuple
└──      return %3

julia> @code_warntype makevec2(Val(7))
Variables
  #self#::Core.Compiler.Const(makevec2, false)
  #unused#::Core.Compiler.Const(Val{7}(), false)

Body::MArray{Tuple{8},Float32,1,8}
1 ─ %1 = ($(Expr(:static_parameter, 1)) + 1)::Core.Compiler.Const(8, false)
│   %2 = Core.apply_type(Main.MVector, %1, Main.Float32)::Core.Compiler.Const(MArray{Tuple{8},Float32,1,8}, false)
│   %3 = (%2)(Main.undef)::MArray{Tuple{8},Float32,1,8}
└──      return %3

julia> @code_warntype makevec3(Val(8))
Variables
  #self#::Core.Compiler.Const(makevec3, false)
  #unused#::Core.Compiler.Const(Val{8}(), false)

Body::MArray{Tuple{8},Float32,1,8}
1 ─ %1 = Core.apply_type(Main.MVector, $(Expr(:static_parameter, 1)), Main.Float32)::Core.Compiler.Const(MArray{Tuple{8},Float32,1,8}, false)
│   %2 = (%1)(Main.undef)::MArray{Tuple{8},Float32,1,8}
└──      return %2

Note that the creation of the Val will also be type unstable (unless the value of N is known), so you will need to pass it around your program as a type, rather than an integer (in some combination of Vals or type parameters in StaticArrays).

for N in 1:100
    # do something with Val(N)
end

would be double bad. The loop will be type unstable, and the called functions will recompile for every new Val.

Thanks @Elrod! I declared the kernel as

function kernel_staticarrays!(M, x_d, n_d, ::Val{N}) where {N}

and also the main function as

function main(M, ::Val{N}) where {N}

That seems to do the trick. As I read in the docs calling the function with Val(N) produces a new method for each N, therefore the value of N is known at compile time and the compiler knows it is an integer, if I understood correctly.

Also, leaving N as a global variable and not calling it in any function argument works fine, but it is maybe not the best option if I’m working with a big module.

2 Likes