Hi,
I have a weird issue that is somewhat related to this thread.
When using StaticArrays
within a CUDA
kernel, it seems to be necessary to add @inbounds
before converting arrays/views to static arrays, otherwise this leads to a an unsupported dynamic function call error due to these lines in StaticArrays.jl.
Throwing @inbounds
before the converts/Sarray-constructors elides the boundscheck and thus these lines are never hit.
At least in theory.
In the following I have a small example function that compiles and runs fine if executed from the REPL.
If this same function is called from with a test (standard Test.jl
), then the kernel compilation fails with the same error
Reason: unsupported dynamic function invocation (call to dimension_mismatch_fail(SA::Type, a::AbstractArray) @ StaticArrays ~/julia-depots/gpu/packages/StaticArrays/oOCPP/src/convert.jl:195)
even though the @inbounds
statements are there.
I’m a bit out of ideas now… Is this due to the inbounds marker not being propagated due to different inlining behaviour in tests/non-tests?
I tried various things including setting the always_inline parameter for cufunction, but nothing helped. I might be hunting a red herring here…
Here’s a MWE:
dummy MWE
using CUDA
using StaticArrays
function staticarrays()
function canonical_kernel(::Val{N}, out, array3, matrix, vector) where {N}
idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
rotation = view(array3, :, :, idx)
column = view(matrix, :, idx)
for i in 1:N
val = vector[i]
for j in 1:N
val += rotation[i, j] * column[j]
end
CUDA.@atomic out[i] += val
end
nothing
end
function sa_kernel(::Val{N}, out, array3, matrix, vector) where {N}
idx = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
column = view(matrix, :, idx)
rotation = view(array3, :, :, idx)
@inbounds column_static = SVector{N}(column) # @inbounds is required!
@inbounds rotation_static = SMatrix{N, N}(rotation) # @inbounds is required!
val = rotation_static * column_static + vector
for i in 1:N
val_i = val[i]
CUDA.@atomic out[i] += val_i
end
nothing
end
N = 3
S = 10
svector = @SVector randn(Float32, N)
vector = cu(svector)
matrix = CUDA.randn(N, S)
array3 = CUDA.randn(N, N, S)
out_sa = fill!(similar(matrix, N), 0)
out_canonical = fill!(similar(matrix, N), 0)
canonical_args = (Val(N), out_canonical, array3, matrix, vector)
sa_args = (Val(N), out_sa, array3, matrix, svector)
let kernel = @cuda launch=false sa_kernel(sa_args...)
available_threads = launch_configuration(kernel.fun).threads
threads = min(S, available_threads)
blocks = cld(S, threads)
kernel(sa_args...; threads, blocks)
end
let kernel = @cuda launch=false canonical_kernel(canonical_args...)
available_threads = launch_configuration(kernel.fun).threads
threads = min(S, available_threads)
blocks = cld(S, threads)
kernel(canonical_args...; threads, blocks)
end
out_sa, out_canonical
end
As I’ve said, the example works if I call staticarrays()
from the REPL, but it fails with above error if the same function is called within a Tests.jl @testset
.
Aside from this issue, what’s the general view on CUDA + StaticArrays? Is it recommended not to use SA within a kernel?