CUDA + StaticArrays weird dynamic function invocation

Hi,
I have a weird issue that is somewhat related to this thread.

When using StaticArrays within a CUDA kernel, it seems to be necessary to add @inbounds before converting arrays/views to static arrays, otherwise this leads to a an unsupported dynamic function call error due to these lines in StaticArrays.jl.
Throwing @inbounds before the converts/Sarray-constructors elides the boundscheck and thus these lines are never hit.

At least in theory.
In the following I have a small example function that compiles and runs fine if executed from the REPL.
If this same function is called from with a test (standard Test.jl), then the kernel compilation fails with the same error

Reason: unsupported dynamic function invocation (call to dimension_mismatch_fail(SA::Type, a::AbstractArray) @ StaticArrays ~/julia-depots/gpu/packages/StaticArrays/oOCPP/src/convert.jl:195)

even though the @inbounds statements are there.

I’m a bit out of ideas now… Is this due to the inbounds marker not being propagated due to different inlining behaviour in tests/non-tests?
I tried various things including setting the always_inline parameter for cufunction, but nothing helped. I might be hunting a red herring here…

Here’s a MWE:

dummy MWE
using CUDA
using StaticArrays


function staticarrays()
    function canonical_kernel(::Val{N}, out, array3, matrix, vector) where {N}
        idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
        rotation = view(array3, :, :, idx)
        column = view(matrix, :, idx)
        for i in 1:N 
            val = vector[i]
            for j in 1:N
                val += rotation[i, j] * column[j]
            end
            CUDA.@atomic out[i] += val
        end
        nothing
    end

    function sa_kernel(::Val{N}, out, array3, matrix, vector) where {N}
        idx = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
        column = view(matrix, :, idx)
        rotation = view(array3, :, :, idx)
        @inbounds column_static = SVector{N}(column)  # @inbounds is required!
        @inbounds rotation_static = SMatrix{N, N}(rotation)  # @inbounds is required!
        val = rotation_static * column_static + vector
        for i in 1:N 
            val_i = val[i]
            CUDA.@atomic out[i] += val_i
        end
        nothing
    end

    N = 3
    S = 10
    svector = @SVector randn(Float32, N)
    vector = cu(svector)
    matrix = CUDA.randn(N, S)
    array3 = CUDA.randn(N, N, S)
    out_sa = fill!(similar(matrix, N), 0)
    out_canonical = fill!(similar(matrix, N), 0)

    canonical_args = (Val(N), out_canonical, array3, matrix, vector)
    sa_args = (Val(N), out_sa, array3, matrix, svector)

    let kernel = @cuda launch=false sa_kernel(sa_args...)
        available_threads = launch_configuration(kernel.fun).threads

        threads = min(S, available_threads)
        blocks = cld(S, threads)

        kernel(sa_args...; threads, blocks)
    end

    let kernel = @cuda launch=false canonical_kernel(canonical_args...)
        available_threads = launch_configuration(kernel.fun).threads

        threads = min(S, available_threads)
        blocks = cld(S, threads)

        kernel(canonical_args...; threads, blocks)
    end

    out_sa, out_canonical
end

As I’ve said, the example works if I call staticarrays() from the REPL, but it fails with above error if the same function is called within a Tests.jl @testset.

Aside from this issue, what’s the general view on CUDA + StaticArrays? Is it recommended not to use SA within a kernel?

I’ve just quickly made that MWE a package to make this very easy to reproduce:
CUSA.jl

what works:

julia> using CUSA

julia> staticarrays()
(Float32[14.372921, -0.6995831, 6.592437], Float32[14.372921, -0.69958264, 6.5924377])

what doesn’t:

(CUSA) pkg> test
     Testing CUSA

.
.
.

CUDA + Staticarrays: Error During Test at /home/le58wel/Repos/CUSA/test/runtests.jl:4
  Got exception outside of a @test
  InvalidIRError: compiling MethodInstance for (::CUSA.var"#sa_kernel#2")(::Val{3}, ::CUDA.CuDeviceVector{Float32, 1}, ::CUDA.CuDeviceArray{Float32, 3, 1}, ::CUDA.CuDeviceMatrix{Float32, 1}, ::StaticArraysCore.SVector{3, Float32}) resulted in invalid LLVM IR
  Reason: unsupported call to an unknown function (call to julia.new_gc_frame)
  Reason: unsupported call to an unknown function (call to julia.push_gc_frame)
  Reason: unsupported call to an unknown function (call to julia.get_gc_frame_slot)
  Reason: unsupported dynamic function invocation (call to dimension_mismatch_fail(SA::Type, a::AbstractArray) @ StaticArrays ~/julia-depots/gpu/packages/StaticArrays/oOCPP/src/convert.jl:195)

Any help or thoughts appreciated!

Pkg.test runs with --check-bounds=yes, forcing bounds checks even if @inbounds is present.

1 Like