Problems with LinearAlgebra functions within KernelAbstractions and CUDA

I’m almost done porting my solver over to GPU with the help of CUDA and KernelAbstractions. However, I keep running into problem with LinearAlgebra functions. Take this simple example

using StaticArrays
using CUDA: allowscalar,CuArray
allowscalar(false)
using KernelAbstractions
using LinearAlgebra: norm2,×
using BenchmarkTools

N = (100,100,100)
f = Array # CuArray
p = zeros(Float32,N) |> f;
u = rand(Float32,N...,3) |> f;
@kernel function kern(p,u)
    I = @index(Global,Cartesian)
    p[I] = norm2(SVector{3}(I.I) × SVector{3}(u[I,:]))
end
apply!(p,u) = kern(get_backend(p),64)(p,u,ndrange=size(p))
@btime apply!(p,u); # 16.459 ms (1000199 allocations: 76.31 MiB)

This gives a crazy number of allocations running on a CPU and won’t run on the GPU

Reason: unsupported call through a literal pointer (call to ijl_alloc_array_1d)
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) in Base at strings/io.jl:133)
Reason: unsupported dynamic function invocation (call to dimension_mismatch_fail(SA::Type, a::AbstractArray) in StaticArrays at C:\Users\gweymouth\.julia\packages\StaticArrays\4uslg\src\convert.jl:190)

If I take out the SVector conversion, the allocations go way up on the CPU and the GPU throws errors about Reason: unsupported dynamic function invocation (call to mapreduce_empty_iter(f, op, itr, ItrEltype) in Base at reduce.jl:375).

It seems the main issue is with SVector{3}(u[I,:]). Replacing this with SA[u[I,1],u[I,2],u[I,3]] runs and cuts most of the allocations. Can someone explain why?

Does a view on u[I,:] help ?

I know my reply comes late, but for reference I’ll answer anyway:

The issue is actually stated in the error message (last line):

You can fix this by dropping the bounds check in the SVector constructor:

v1 = SVector{3}(I.I)
u_view = @view u[I,:]
v2 = @inbounds SVector{3}(u_view)
p[I] = norm2(v1 × v2)

This is also the reason why SA[u[I,1],u[I,2],u[I,3]] works. It doesn’t do a bounds check, which would require a dynamic function call (for throwing the error).

1 Like

That @noinline function should probably have a type signature that forces specialization, in order to make the code GPU compatible. Maybe open an issue on StaticArraytsl.jl?

if I force specialization of the function StaticArrays.dimension_mismatch_fail on a local checkout ouf StaticArrays, I just get another error:

Reason: unsupported dynamic function invocation (call to print_to_string(xs...) @ Base strings/io.jl:137)
Stacktrace:
 [1] string
   @ ./strings/io.jl:189
 [2] dimension_mismatch_fail
   @ ~/julia-depots/gpu/dev/StaticArrays/src/convert.jl:196
 [3] convert
   @ ~/julia-depots/gpu/dev/StaticArrays/src/convert.jl:201
 [4] StaticArray
   @ ~/julia-depots/gpu/dev/StaticArrays/src/convert.jl:174
 [5] macro expansion
   @ ~/.julia/dev/DiffPointRasterisation/experiments/kernelabstractions.jl:13
 [6] gpu_kern
   @ ~/julia-depots/gpu/packages/KernelAbstractions/jmHRX/src/macros.jl:101
 [7] gpu_kern
   @ ./none:0

Ah yes that’s because of the string interpolation there. We could add a quirk like we have for Base methods doing the same, CUDA.jl/src/device/quirks.jl at master · JuliaGPU/CUDA.jl · GitHub, but then in an extension package for StaticArrays.jl.

1 Like

Opened an issue here: Force specialization of `dimension_mismatch_fail()` on type parameter? · Issue #1244 · JuliaArrays/StaticArrays.jl · GitHub

@maleadt the issue on StaticArrays side is fixed, so I wanted to add that quirk you mentioned.
One question though: Why as an extension package? StaticArrays currently is a direct/strong dependency of CUDA. Is there a plan to make this a weak dependency, or is there something else I don’t see?

I would like to get rid of the dependency at some point; it’s only there to implement KernelAbstraction’s scratchpad memory.

1 Like