CUDA on slices

Trying to apply a function to slices of a CuArray.
Tried broadcasting, mapslices and various Tensor packages.
Is there any way to do this?

using CUDA, Pipe, Interpolations, TensorCast

d = Uniform(1,30)
x = Float64.(1:30)
y = rand(d,30,500) |> eachcol
v = rand(d,1000)
gx, gy, gv = CuArray.([x,y,v])

f(  y::Array{Float64},      v::Float64)::Float64    = linear_interpolation( x,   y )(v)
gf(gy::CuArray{Float64},   gv::Float64)::Float64    = linear_interpolation( gx, gy )(gv)

@cast   o[i,j] :=  f(  y[i],  v[j]  )             #works
@cast  go[i,j] := gf( gy[i], gv[j]  )             #fails

What error does this give you? Is it scalar indexing? If so then it is a limitation of CUDA programming model as CUDA.jl does not currently automatically vectorize your functions which means that accessing a CuArray[index] will throw you an error if not explicitly allowed. You should be able to do CuArray[1:100] though.
I am not familiar with TensorCast implementation and I don’t have a CUDA capable computer in this moment but you could look into that.
If you want to do einsum on GPU I remember I used Tullio once and it worked.

I’ve paid for 4 more hours here. password Aa12345678
https://9jeqf.launch.juliahub.app/

using CUDA, Pipe, Interpolations, TensorCast, Distributions, Tullio

d = Uniform(1,30)
x = Float64.(1:30)
y = rand(d,30,500) |> eachcol |> collect
v = rand(d,1000)
gx, gy, gv = CuArray.([x,y,v])

f(y,        v)    = linear_interpolation( x,   y )(v)
gf(gy,   gv)   = linear_interpolation( gx, gy )(gv)

@cast  o[i,j] := f(y[i],v[j])         #works
@tullio o[i,j] := f(y[i],v[j])         #works


@cast go[i,j] := gf(gy[i],gv[j])           #error
@tullio go[i,j] := gf(gy[i],gv[j])          #error

@cast gives the error

ERROR: GPU broadcast resulted in non-concrete element type Any.
This probably means that the function you are broadcasting contains an error or type instability.

@tullio gives

Warning: Performing scalar indexing on task Task (runnable) @0x00007fc6df124740.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.

For Tullio to work with GPU you need to import also:

using CUDA, CUDAKernels, KernelAbstractions

And from what I see it is some problem with your call to linear_interpolation.

julia> @tullio o[i,j] := linear_interpolation( gx,   gy[i] )(gv[j])
Error: Reason: unsupported dynamic function invocation (call to linear_interpolation)

for more you can read CUDA Troubleshooting.
If we dig deeper we can see it is some conversion function

Reason: unsupported dynamic function invocation (call to convert)

And if we inspect gy we find out that you didn’t really move y to CuArray

julia> typeof(gy[1])
SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}

(see how it is Matrix{Float64} and not CuArray{Float64}, that’s because broadcasting works only one level deep.)

This is the hardest part of GPU programming as GPUs don’t really like lists or Vectors of Vectors (CUDA.jl will throw:

ERROR: CuArray only supports element types that are stored inline

if you will try to do this.)

The “easiest” way to do this would be to write a custom kernel, or use einsum to account for two dimensions and store gy as a CuArray of size (30,500).

Sorry but here my knowledge ends unfortunately. Also I don’t know about your experience but keep in mind that Julia tries to compile a GPU kernel in contrast to i.e. Python frameworks like PyTorch or CuPy where the kernels are already compiled and you only use an api to access them, so in CUDA.jl you generally need to work on Arrays of numbers only.

3 Likes

Thanks Jaratur.
I tried a few einsum packages. When they do work they tend to be slower than mapslices on a CPU.

I’ve reposted here Mapslices very slow

I’d rather not write a custom Kernel. See if there’s any alternatives.

1 Like

Take a look at this (open) issue and a related merged PR for GPU interpolation. It’s immature, but better than trying to roll your own.