I’m almost done porting my solver over to GPU with the help of CUDA and KernelAbstractions. However, I keep running into problem with LinearAlgebra functions. Take this simple example
using StaticArrays
using CUDA: allowscalar,CuArray
allowscalar(false)
using KernelAbstractions
using LinearAlgebra: norm2,×
using BenchmarkTools
N = (100,100,100)
f = Array # CuArray
p = zeros(Float32,N) |> f;
u = rand(Float32,N...,3) |> f;
@kernel function kern(p,u)
I = @index(Global,Cartesian)
p[I] = norm2(SVector{3}(I.I) × SVector{3}(u[I,:]))
end
apply!(p,u) = kern(get_backend(p),64)(p,u,ndrange=size(p))
@btime apply!(p,u); # 16.459 ms (1000199 allocations: 76.31 MiB)
This gives a crazy number of allocations running on a CPU and won’t run on the GPU
Reason: unsupported call through a literal pointer (call to ijl_alloc_array_1d)
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) in Base at strings/io.jl:133)
Reason: unsupported dynamic function invocation (call to dimension_mismatch_fail(SA::Type, a::AbstractArray) in StaticArrays at C:\Users\gweymouth\.julia\packages\StaticArrays\4uslg\src\convert.jl:190)
If I take out the SVector conversion, the allocations go way up on the CPU and the GPU throws errors about Reason: unsupported dynamic function invocation (call to mapreduce_empty_iter(f, op, itr, ItrEltype) in Base at reduce.jl:375).
It seems the main issue is with SVector{3}(u[I,:]). Replacing this with SA[u[I,1],u[I,2],u[I,3]] runs and cuts most of the allocations. Can someone explain why?
This is also the reason why SA[u[I,1],u[I,2],u[I,3]] works. It doesn’t do a bounds check, which would require a dynamic function call (for throwing the error).
That @noinline function should probably have a type signature that forces specialization, in order to make the code GPU compatible. Maybe open an issue on StaticArraytsl.jl?
@maleadt the issue on StaticArrays side is fixed, so I wanted to add that quirk you mentioned.
One question though: Why as an extension package? StaticArrays currently is a direct/strong dependency of CUDA. Is there a plan to make this a weak dependency, or is there something else I don’t see?